1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00

ACER algorithm (#184)

* initial ACER commit

* Code cleanup + several fixes

* Q-retrace bug fix + small clean-ups

* added documentation for acer

* ACER benchmarks

* update benchmarks table

* Add nightly running of golden and trace tests. (#202)

Resolves #200

* comment out nightly trace tests until values reset.

* remove redundant observe ignore (#168)

* ensure nightly test env containers exist. (#205)

Also bump integration test timeout

* wxPython removal (#207)

Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.

* Create CONTRIBUTING.md (#210)

* Create CONTRIBUTING.md.  Resolves #188

* run nightly golden tests sequentially. (#217)

Should reduce resource requirements and potential CPU contention but increases
overall execution time.

* tests: added new setup configuration + test args (#211)

- added utils for future tests and conftest
- added test args

* new docs build

* golden test update
This commit is contained in:
shadiendrawis
2019-02-20 23:52:34 +02:00
committed by GitHub
parent 7253f511ed
commit 2b5d1dabe6
175 changed files with 2327 additions and 664 deletions

View File

@@ -14,6 +14,7 @@ A detailed description of those algorithms can be found by navigating to each of
:caption: Agents
policy_optimization/ac
policy_optimization/acer
imitation/bc
value_optimization/bs_dqn
value_optimization/categorical_dqn

View File

@@ -32,8 +32,8 @@ Training the network
Given a batch of transitions, run them through the network to get the current predictions of the future measurements
per action, and set them as the initial targets for training the network. For each transition
:math:`(s_t,a_t,r_t,s_{t+1} )` in the batch, the target of the network for the action that was taken, is the actual
measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
For the actions that were not taken, the targets are the current values.
measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
For the actions that were not taken, the targets are the current values.
.. autoclass:: rl_coach.agents.dfp_agent.DFPAlgorithmParameters

View File

@@ -0,0 +1,60 @@
ACER
============
**Actions space:** Discrete
**References:** `Sample Efficient Actor-Critic with Experience Replay <https://arxiv.org/abs/1611.01224>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/acer.png
:width: 500px
:align: center
Algorithm Description
---------------------
Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
++++++++++++++++++++
Each iteration perform one on-policy update with a batch of the last :math:`T_{max}` transitions,
and :math:`n` (replay ratio) off-policy updates from batches of :math:`T_{max}` transitions sampled from the replay buffer.
Each update perform the following procedure:
1. **Calculate state values:**
.. math:: V(s_t) = \mathbb{E}_{a \sim \pi} [Q(s_t,a)]
2. **Calculate Q retrace:**
.. math:: Q^{ret}(s_t,a_t) = r_t +\gamma \bar{\rho}_{t+1}[Q^{ret}(s_{t+1},a_{t+1}) - Q(s_{t+1},a_{t+1})] + \gamma V(s_{t+1})
.. math:: \text{where} \quad \bar{\rho}_{t} = \min{\left\{c,\rho_t\right\}},\quad \rho_t=\frac{\pi (a_t \mid s_t)}{\mu (a_t \mid s_t)}
3. **Accumulate gradients:**
:math:`\bullet` **Policy gradients (with bias correction):**
.. math:: \hat{g}_t^{policy} & = & \bar{\rho}_{t} \nabla \log \pi (a_t \mid s_t) [Q^{ret}(s_t,a_t) - V(s_t)] \\
& & + \mathbb{E}_{a \sim \pi} \left(\left[\frac{\rho_t(a)-c}{\rho_t(a)}\right] \nabla \log \pi (a \mid s_t) [Q(s_t,a) - V(s_t)] \right)
:math:`\bullet` **Q-Head gradients (MSE):**
.. math:: \hat{g}_t^{Q} = (Q^{ret}(s_t,a_t) - Q(s_t,a_t)) \nabla Q(s_t,a_t)\\
4. **(Optional) Trust region update:** change the policy loss gradient w.r.t network output:
.. math:: \hat{g}_t^{trust-region} = \hat{g}_t^{policy} - \max \left\{0, \frac{k^T \hat{g}_t^{policy} - \delta}{\lVert k \rVert_2^2}\right\} k
.. math:: \text{where} \quad k = \nabla D_{KL}[\pi_{avg} \parallel \pi]
The average policy network is an exponential moving average of the parameters of the network (:math:`\theta_{avg}=\alpha\theta_{avg}+(1-\alpha)\theta`).
The goal of the trust region update is to the difference between the updated policy and the average policy to ensure stability.
.. autoclass:: rl_coach.agents.acer_agent.ACERAlgorithmParameters

View File

@@ -19,7 +19,7 @@ Training the network
1. Sample a batch of transitions from the replay buffer.
2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
2. Using the next states from the sampled batch, run the online network in order to find the :math:`Q` maximizing
action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.

View File

@@ -26,7 +26,7 @@ Training the network
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
Set those values as the targets for the actions that were not actually played.
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
4. For each action that was played, use the following equation for calculating the targets of the network:
:math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.