ACER algorithm (#184)

* initial ACER commit * Code cleanup + several fixes * Q-retrace bug fix + small clean-ups * added documentation for acer * ACER benchmarks * update benchmarks table * Add nightly running of golden and trace tests. (#202) Resolves #200 * comment out nightly trace tests until values reset. * remove redundant observe ignore (#168) * ensure nightly test env containers exist. (#205) Also bump integration test timeout * wxPython removal (#207) Replacing wxPython with Python's Tkinter. Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner. * Create CONTRIBUTING.md (#210) * Create CONTRIBUTING.md. Resolves #188 * run nightly golden tests sequentially. (#217) Should reduce resource requirements and potential CPU contention but increases overall execution time. * tests: added new setup configuration + test args (#211) - added utils for future tests and conftest - added test args * new docs build * golden test update
2026-02-03 06:45:46 +01:00 · 2019-02-20 23:52:34 +02:00
parent 7253f511ed
commit 2b5d1dabe6
175 changed files with 2327 additions and 664 deletions
--- a/docs_raw/source/components/agents/index.rst
+++ b/docs_raw/source/components/agents/index.rst
@@ -14,6 +14,7 @@ A detailed description of those algorithms can be found by navigating to each of
   :caption: Agents

   policy_optimization/ac
+   policy_optimization/acer
   imitation/bc
   value_optimization/bs_dqn
   value_optimization/categorical_dqn
--- a/docs_raw/source/components/agents/other/dfp.rst
+++ b/docs_raw/source/components/agents/other/dfp.rst
@@ -32,8 +32,8 @@ Training the network
 Given a batch of transitions, run them through the network to get the current predictions of the future measurements
 per action, and set them as the initial targets for training the network. For each transition
 :math:`(s_t,a_t,r_t,s_{t+1} )` in the batch, the target of the network for the action that was taken, is the actual
- measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
- For the actions that were not taken, the targets are the current values.
+measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
+For the actions that were not taken, the targets are the current values.


 .. autoclass:: rl_coach.agents.dfp_agent.DFPAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/acer.rst
+++ b/docs_raw/source/components/agents/policy_optimization/acer.rst
@@ -0,0 +1,60 @@
+ACER
+============
+
+**Actions space:** Discrete
+
+**References:** `Sample Efficient Actor-Critic with Experience Replay <https://arxiv.org/abs/1611.01224>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/acer.png
+   :width: 500px
+   :align: center
+
+Algorithm Description
+---------------------
+
+Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
+
+The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
+distribution assigned with these probabilities. When testing, the action with the highest probability is used.
+
+Training the network
++++++++++++++++++++
+Each iteration perform one on-policy update with a batch of the last :math:`T_{max}` transitions,
+and :math:`n` (replay ratio) off-policy updates from batches of :math:`T_{max}` transitions sampled from the replay buffer.
+
+Each update perform the following procedure:
+
+1. **Calculate state values:**
+
+   .. math:: V(s_t) = \mathbb{E}_{a \sim \pi} [Q(s_t,a)]
+
+2. **Calculate Q retrace:**
+
+    .. math::   Q^{ret}(s_t,a_t) = r_t +\gamma \bar{\rho}_{t+1}[Q^{ret}(s_{t+1},a_{t+1}) - Q(s_{t+1},a_{t+1})] + \gamma V(s_{t+1})
+    .. math::   \text{where} \quad \bar{\rho}_{t} = \min{\left\{c,\rho_t\right\}},\quad \rho_t=\frac{\pi (a_t \mid s_t)}{\mu (a_t \mid s_t)}
+
+3. **Accumulate gradients:**
+    :math:`\bullet` **Policy gradients (with bias correction):**
+
+        .. math::  \hat{g}_t^{policy} & = & \bar{\rho}_{t} \nabla \log \pi (a_t \mid s_t) [Q^{ret}(s_t,a_t) - V(s_t)] \\
+                    & & + \mathbb{E}_{a \sim \pi} \left(\left[\frac{\rho_t(a)-c}{\rho_t(a)}\right] \nabla \log \pi (a \mid s_t) [Q(s_t,a) - V(s_t)] \right)
+
+    :math:`\bullet` **Q-Head gradients (MSE):**
+
+        .. math::  \hat{g}_t^{Q} = (Q^{ret}(s_t,a_t) - Q(s_t,a_t)) \nabla Q(s_t,a_t)\\
+
+4. **(Optional) Trust region update:** change the policy loss gradient w.r.t network output:
+
+    .. math::  \hat{g}_t^{trust-region} = \hat{g}_t^{policy} - \max \left\{0, \frac{k^T \hat{g}_t^{policy} - \delta}{\lVert k \rVert_2^2}\right\} k
+    .. math::  \text{where} \quad k = \nabla D_{KL}[\pi_{avg} \parallel \pi]
+
+    The average policy network is an exponential moving average of the parameters of the network (:math:`\theta_{avg}=\alpha\theta_{avg}+(1-\alpha)\theta`).
+    The goal of the trust region update is to the difference between the updated policy and the average policy to ensure stability.
+
+
+
+.. autoclass:: rl_coach.agents.acer_agent.ACERAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/double_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/double_dqn.rst
@@ -19,7 +19,7 @@ Training the network

 1. Sample a batch of transitions from the replay buffer.

-2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
+2. Using the next states from the sampled batch, run the online network in order to find the :math:`Q` maximizing
   action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
   network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.

--- a/docs_raw/source/components/agents/value_optimization/dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/dqn.rst
@@ -26,7 +26,7 @@ Training the network
   use the current states from the sampled batch, and run the online network to get the current Q values predictions.
   Set those values as the targets for the actions that were not actually played.

-4. For each action that was played, use the following equation for calculating the targets of the network:                                                         $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$ 
+4. For each action that was played, use the following equation for calculating the targets of the network:
   :math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`

 5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.