update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
2025-12-18 11:40:18 +01:00 · 2018-11-15 15:00:13 +02:00
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions
--- a/docs_raw/source/components/agents/value_optimization/bs_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/bs_dqn.rst
@@ -0,0 +1,43 @@
+Bootstrapped DQN
+================
+
+**Actions space:** Discrete
+
+**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/bs_dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+The current states are used as the input to the network. The network contains several $Q$ heads, which  are used
+for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
+selects a single head to play with during the episode. According to the selected head, only the relevant
+output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
+
+Storing the transitions
+++++++++++++++++++++++
+For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
+The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
+and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
+the replay buffer.
+
+Training the network
++++++++++++++++++++
+First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
+current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
+and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
+according to the standard DQN update rule:
+
+:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
+
+Otherwise, leave it intact so that the transition does not affect the learning of this head.
+Then, train the online network according to the calculated targets.
+
+As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.
+
--- a/docs_raw/source/components/agents/value_optimization/categorical_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/categorical_dqn.rst
@@ -0,0 +1,39 @@
+Categorical DQN
+===============
+
+**Actions space:** Discrete
+
+**References:** `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/distributional_dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
+   that the :math:`i-th` component of the projected update is calculated as follows:
+
+   :math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
+
+   where:
+   *  :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
+   *  :math:`\hat{T}_{z_{j}}` is the Bellman update for atom :math:`z_j`: :math:`\hat{T}_{z_{j}} := r+\gamma z_j`
+
+
+3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
+   probability distribution.   Only the target of the actions that were actually taken is updated.
+
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+
+
+.. autoclass:: rl_coach.agents.categorical_dqn_agent.CategoricalDQNAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/double_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/double_dqn.rst
@@ -0,0 +1,35 @@
+Double DQN
+==========
+
+**Actions space:** Discrete
+
+**References:** `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
+   action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
+   network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.
+
+3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
+   use the current states from the sampled batch, and run the online network to get the current Q values predictions.
+   Set those values as the targets for the actions that were not actually played.
+
+4. For each action that was played, use the following equation for calculating the targets of the network:
+   :math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a))`
+
+5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
+
+6. Once in every few thousand steps, copy the weights from the online network to the target network.
--- a/docs_raw/source/components/agents/value_optimization/dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/dqn.rst
@@ -0,0 +1,37 @@
+Deep Q Networks
+===============
+
+**Actions space:** Discrete
+
+**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
+   the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
+
+3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
+   use the current states from the sampled batch, and run the online network to get the current Q values predictions.
+   Set those values as the targets for the actions that were not actually played.
+
+4. For each action that was played, use the following equation for calculating the targets of the network:                                                         $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$ 
+   :math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
+
+5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
+
+6. Once in every few thousand steps, copy the weights from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/dueling_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/dueling_dqn.rst
@@ -0,0 +1,27 @@
+Dueling DQN
+===========
+
+**Actions space:** Discrete
+
+**References:** `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dueling_dqn.png
+   :align: center
+
+General Description
+-------------------
+Dueling DQN presents a change in the network structure comparing to DQN.
+
+Dueling DQN uses a specialized *Dueling Q Head* in order to separate :math:`Q` to an :math:`A` (advantage)
+stream and a :math:`V` stream. Adding this type of structure to the network head allows the network to better differentiate
+actions from one another, and significantly improves the learning.
+
+In many states, the values of the different actions are very similar, and it is less important which action to take.
+This is especially important in environments where there are many actions to choose from. In DQN, on each training
+iteration, for each of the states in the batch, we update the :ath:`Q` values only for the specific actions taken in
+those states. This results in slower learning as we do not learn the :math:`Q` values for actions that were not taken yet.
+On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a
+single action has been taken at this state.
--- a/docs_raw/source/components/agents/value_optimization/mmc.rst
+++ b/docs_raw/source/components/agents/value_optimization/mmc.rst
@@ -0,0 +1,37 @@
+Mixed Monte Carlo
+=================
+
+**Actions space:** Discrete
+
+**References:** `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+Training the network
++++++++++++++++++++
+
+In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
+
+The DDQN targets are calculated in the same manner as in the DDQN agent:
+
+:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
+
+The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
+
+:math:`y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )`
+
+A mixing ratio $\alpha$ is then used to get the final targets:
+
+:math:`y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}`
+
+Finally, the online network is trained using the current states as inputs, and the calculated targets.
+Once in every few thousand steps, copy the weights from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.mmc_agent.MixedMonteCarloAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/n_step.rst
+++ b/docs_raw/source/components/agents/value_optimization/n_step.rst
@@ -0,0 +1,35 @@
+N-Step Q Learning
+=================
+
+**Actions space:** Discrete
+
+**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+The :math:`N`-step Q learning algorithm works in similar manner to DQN except for the following changes:
+
+1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
+   :math:`N` steps using the latest :math:`N` steps played by the agent.
+
+2. In order to stabilize the learning, multiple workers work together to update the network.
+   This creates the same effect as uncorrelating the samples used for training.
+
+3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated
+   to form the :math:`N`-step Q targets, according to the following equation:
+   :math:`R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})`
+   where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch
+
+
+
+.. autoclass:: rl_coach.agents.n_step_q_agent.NStepQAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/naf.rst
+++ b/docs_raw/source/components/agents/value_optimization/naf.rst
@@ -0,0 +1,33 @@
+Normalized Advantage Functions
+==============================
+
+**Actions space:** Continuous
+
+**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/naf.png
+   :width: 600px
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
+It is then passed to the exploration policy which adds noise in order to encourage exploration.
+
+Training the network
++++++++++++++++++++
+The network is trained by using the following targets:
+:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
+Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
+to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
+and :math:`y_t` as the targets.
+After every training step, use a soft update in order to copy the weights from the online network to the target network.
+
+
+
+.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/nec.rst
+++ b/docs_raw/source/components/agents/value_optimization/nec.rst
@@ -0,0 +1,50 @@
+Neural Episodic Control
+=======================
+
+**Actions space:** Discrete
+
+**References:** `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/nec.png
+   :width: 500px
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+
+1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate
+   output from the middleware.
+
+2. For each possible action :math:`a_i`, run the DND head using the state embedding and the selected action :math:`a_i` as inputs.
+   The DND is queried and returns the :math:`P` nearest neighbor keys and values. The keys and values are used to calculate
+   and return the action :math:`Q` value from the network.
+
+3. Pass all the :math:`Q` values to the exploration policy and choose an action accordingly.
+
+4. Store the state embeddings and actions taken during the current episode in a small buffer :math:`B`, in order to
+   accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
+
+Finalizing an episode
+++++++++++++++++++++
+For each step in the episode, the state embeddings and the taken actions are stored in the buffer :math:`B`.
+When the episode is finished, the replay buffer calculates the :math:`N`-step total return of each transition in the
+buffer, bootstrapped using the maximum :math:`Q` value of the :math:`N`-th transition. Those values are inserted
+along with the total return into the DND, and the buffer :math:`B` is reset.
+
+Training the network
++++++++++++++++++++
+Train the network only when the DND has enough entries for querying.
+
+To train the network, the current states are used as the inputs and the :math:`N`-step returns are used as the targets.
+The :math:`N`-step return used takes into account :math:`N` consecutive steps, and bootstraps the last value from
+the network if necessary:
+:math:`y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N   max_a Q(s_{t+N},a)`
+
+
+
+.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/pal.rst
+++ b/docs_raw/source/components/agents/value_optimization/pal.rst
@@ -0,0 +1,45 @@
+Persistent Advantage Learning
+=============================
+
+**Actions space:** Discrete
+
+**References:** `Increasing the Action Gap: New Operators for Reinforcement Learning <https://arxiv.org/abs/1512.04860>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer. 
+
+2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
+   :math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
+
+3. The action gap :math:`V(s_t )-Q(s_t,a_t)` should then be subtracted from each of the calculated targets.
+   To calculate the action gap, run the target network using the current states and get the :math:`Q` values
+   for all the actions. Then estimate :math:`V` as the maximum predicted :math:`Q` value for the current state:
+   :math:`V(s_t )=max_a Q(s_t,a)`
+
+4. For *advantage learning (AL)*, reduce the action gap weighted by a predefined parameter :math:`\alpha` from
+   the targets :math:`y_t^{DDQN}`:
+   :math:`y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t ))`
+
+5. For *persistent advantage learning (PAL)*, the target network is also used in order to calculate the action
+   gap for the next state:
+   :math:`V(s_{t+1} )-Q(s_{t+1},a_{t+1})`
+   where :math:`a_{t+1}` is chosen by running the next states through the online network and choosing the action that
+   has the highest predicted :math:`Q` value. Finally, the targets will be defined as -
+   :math:`y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} ))`
+
+6. Train the online network using the current states as inputs, and with the aforementioned targets.
+
+7. Once in every few thousand steps, copy the weights from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/qr_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/qr_dqn.rst
@@ -0,0 +1,33 @@
+Quantile Regression DQN
+=======================
+
+**Actions space:** Discrete
+
+**References:** `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/qr_dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. First, the next state quantiles are predicted. These are used in order to calculate the targets for the network,
+   by following the Bellman equation.
+   Next, the current quantile locations for the current states are predicted, sorted, and used for calculating the
+   quantile midpoints targets.
+
+3. The network is trained with the quantile regression loss between the resulting quantile locations and the target
+   quantile locations. Only the targets of the actions that were actually taken are updated.
+
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.qr_dqn_agent.QuantileRegressionDQNAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/rainbow.rst
+++ b/docs_raw/source/components/agents/value_optimization/rainbow.rst
@@ -0,0 +1,51 @@
+Rainbow
+=======
+
+**Actions space:** Discrete
+
+**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/rainbow.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Rainbow combines 6 recent advancements in reinforcement learning:
+
+* N-step returns
+* Distributional state-action value learning
+* Dueling networks
+* Noisy Networks
+* Double DQN
+* Prioritized Experience Replay
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
+   that the :math:`i-th` component of the projected update is calculated as follows:
+
+   :math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
+
+   where:
+   *  :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
+   *  :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
+   :math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
+
+
+3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
+   probability distribution.   Only the target of the actions that were actually taken is updated.
+
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
+   using the KL divergence loss that is returned from the network.
+
+
+.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters