mirror of
https://github.com/gryf/coach.git
synced 2025-12-18 11:40:18 +01:00
update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
This commit is contained in:
@@ -0,0 +1,43 @@
|
||||
Bootstrapped DQN
|
||||
================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/bs_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
The current states are used as the input to the network. The network contains several $Q$ heads, which are used
|
||||
for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
|
||||
selects a single head to play with during the episode. According to the selected head, only the relevant
|
||||
output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
|
||||
|
||||
Storing the transitions
|
||||
+++++++++++++++++++++++
|
||||
For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
|
||||
The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
|
||||
and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
|
||||
the replay buffer.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
|
||||
current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
|
||||
and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
|
||||
according to the standard DQN update rule:
|
||||
|
||||
:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
|
||||
|
||||
Otherwise, leave it intact so that the transition does not affect the learning of this head.
|
||||
Then, train the online network according to the calculated targets.
|
||||
|
||||
As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
@@ -0,0 +1,39 @@
|
||||
Categorical DQN
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/distributional_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
|
||||
that the :math:`i-th` component of the projected update is calculated as follows:
|
||||
|
||||
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
|
||||
|
||||
where:
|
||||
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
|
||||
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom :math:`z_j`: :math:`\hat{T}_{z_{j}} := r+\gamma z_j`
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
|
||||
probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.categorical_dqn_agent.CategoricalDQNAlgorithmParameters
|
||||
@@ -0,0 +1,35 @@
|
||||
Double DQN
|
||||
==========
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
|
||||
action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
|
||||
network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.
|
||||
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
|
||||
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
|
||||
Set those values as the targets for the actions that were not actually played.
|
||||
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network:
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
37
docs_raw/source/components/agents/value_optimization/dqn.rst
Normal file
37
docs_raw/source/components/agents/value_optimization/dqn.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
Deep Q Networks
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
|
||||
the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
|
||||
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
|
||||
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
|
||||
Set those values as the targets for the actions that were not actually played.
|
||||
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters
|
||||
@@ -0,0 +1,27 @@
|
||||
Dueling DQN
|
||||
===========
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dueling_dqn.png
|
||||
:align: center
|
||||
|
||||
General Description
|
||||
-------------------
|
||||
Dueling DQN presents a change in the network structure comparing to DQN.
|
||||
|
||||
Dueling DQN uses a specialized *Dueling Q Head* in order to separate :math:`Q` to an :math:`A` (advantage)
|
||||
stream and a :math:`V` stream. Adding this type of structure to the network head allows the network to better differentiate
|
||||
actions from one another, and significantly improves the learning.
|
||||
|
||||
In many states, the values of the different actions are very similar, and it is less important which action to take.
|
||||
This is especially important in environments where there are many actions to choose from. In DQN, on each training
|
||||
iteration, for each of the states in the batch, we update the :ath:`Q` values only for the specific actions taken in
|
||||
those states. This results in slower learning as we do not learn the :math:`Q` values for actions that were not taken yet.
|
||||
On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a
|
||||
single action has been taken at this state.
|
||||
37
docs_raw/source/components/agents/value_optimization/mmc.rst
Normal file
37
docs_raw/source/components/agents/value_optimization/mmc.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
Mixed Monte Carlo
|
||||
=================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
|
||||
|
||||
The DDQN targets are calculated in the same manner as in the DDQN agent:
|
||||
|
||||
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
|
||||
|
||||
:math:`y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )`
|
||||
|
||||
A mixing ratio $\alpha$ is then used to get the final targets:
|
||||
|
||||
:math:`y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}`
|
||||
|
||||
Finally, the online network is trained using the current states as inputs, and the calculated targets.
|
||||
Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.mmc_agent.MixedMonteCarloAlgorithmParameters
|
||||
@@ -0,0 +1,35 @@
|
||||
N-Step Q Learning
|
||||
=================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
The :math:`N`-step Q learning algorithm works in similar manner to DQN except for the following changes:
|
||||
|
||||
1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
|
||||
:math:`N` steps using the latest :math:`N` steps played by the agent.
|
||||
|
||||
2. In order to stabilize the learning, multiple workers work together to update the network.
|
||||
This creates the same effect as uncorrelating the samples used for training.
|
||||
|
||||
3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated
|
||||
to form the :math:`N`-step Q targets, according to the following equation:
|
||||
:math:`R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})`
|
||||
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.n_step_q_agent.NStepQAlgorithmParameters
|
||||
33
docs_raw/source/components/agents/value_optimization/naf.rst
Normal file
33
docs_raw/source/components/agents/value_optimization/naf.rst
Normal file
@@ -0,0 +1,33 @@
|
||||
Normalized Advantage Functions
|
||||
==============================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/naf.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
|
||||
It is then passed to the exploration policy which adds noise in order to encourage exploration.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
The network is trained by using the following targets:
|
||||
:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
|
||||
Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
|
||||
to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
|
||||
and :math:`y_t` as the targets.
|
||||
After every training step, use a soft update in order to copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters
|
||||
50
docs_raw/source/components/agents/value_optimization/nec.rst
Normal file
50
docs_raw/source/components/agents/value_optimization/nec.rst
Normal file
@@ -0,0 +1,50 @@
|
||||
Neural Episodic Control
|
||||
=======================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/nec.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate
|
||||
output from the middleware.
|
||||
|
||||
2. For each possible action :math:`a_i`, run the DND head using the state embedding and the selected action :math:`a_i` as inputs.
|
||||
The DND is queried and returns the :math:`P` nearest neighbor keys and values. The keys and values are used to calculate
|
||||
and return the action :math:`Q` value from the network.
|
||||
|
||||
3. Pass all the :math:`Q` values to the exploration policy and choose an action accordingly.
|
||||
|
||||
4. Store the state embeddings and actions taken during the current episode in a small buffer :math:`B`, in order to
|
||||
accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
|
||||
|
||||
Finalizing an episode
|
||||
+++++++++++++++++++++
|
||||
For each step in the episode, the state embeddings and the taken actions are stored in the buffer :math:`B`.
|
||||
When the episode is finished, the replay buffer calculates the :math:`N`-step total return of each transition in the
|
||||
buffer, bootstrapped using the maximum :math:`Q` value of the :math:`N`-th transition. Those values are inserted
|
||||
along with the total return into the DND, and the buffer :math:`B` is reset.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
Train the network only when the DND has enough entries for querying.
|
||||
|
||||
To train the network, the current states are used as the inputs and the :math:`N`-step returns are used as the targets.
|
||||
The :math:`N`-step return used takes into account :math:`N` consecutive steps, and bootstraps the last value from
|
||||
the network if necessary:
|
||||
:math:`y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N max_a Q(s_{t+N},a)`
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters
|
||||
45
docs_raw/source/components/agents/value_optimization/pal.rst
Normal file
45
docs_raw/source/components/agents/value_optimization/pal.rst
Normal file
@@ -0,0 +1,45 @@
|
||||
Persistent Advantage Learning
|
||||
=============================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Increasing the Action Gap: New Operators for Reinforcement Learning <https://arxiv.org/abs/1512.04860>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
|
||||
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
3. The action gap :math:`V(s_t )-Q(s_t,a_t)` should then be subtracted from each of the calculated targets.
|
||||
To calculate the action gap, run the target network using the current states and get the :math:`Q` values
|
||||
for all the actions. Then estimate :math:`V` as the maximum predicted :math:`Q` value for the current state:
|
||||
:math:`V(s_t )=max_a Q(s_t,a)`
|
||||
|
||||
4. For *advantage learning (AL)*, reduce the action gap weighted by a predefined parameter :math:`\alpha` from
|
||||
the targets :math:`y_t^{DDQN}`:
|
||||
:math:`y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t ))`
|
||||
|
||||
5. For *persistent advantage learning (PAL)*, the target network is also used in order to calculate the action
|
||||
gap for the next state:
|
||||
:math:`V(s_{t+1} )-Q(s_{t+1},a_{t+1})`
|
||||
where :math:`a_{t+1}` is chosen by running the next states through the online network and choosing the action that
|
||||
has the highest predicted :math:`Q` value. Finally, the targets will be defined as -
|
||||
:math:`y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} ))`
|
||||
|
||||
6. Train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
7. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters
|
||||
@@ -0,0 +1,33 @@
|
||||
Quantile Regression DQN
|
||||
=======================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/qr_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. First, the next state quantiles are predicted. These are used in order to calculate the targets for the network,
|
||||
by following the Bellman equation.
|
||||
Next, the current quantile locations for the current states are predicted, sorted, and used for calculating the
|
||||
quantile midpoints targets.
|
||||
|
||||
3. The network is trained with the quantile regression loss between the resulting quantile locations and the target
|
||||
quantile locations. Only the targets of the actions that were actually taken are updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.qr_dqn_agent.QuantileRegressionDQNAlgorithmParameters
|
||||
@@ -0,0 +1,51 @@
|
||||
Rainbow
|
||||
=======
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/rainbow.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Rainbow combines 6 recent advancements in reinforcement learning:
|
||||
|
||||
* N-step returns
|
||||
* Distributional state-action value learning
|
||||
* Dueling networks
|
||||
* Noisy Networks
|
||||
* Double DQN
|
||||
* Prioritized Experience Replay
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
|
||||
that the :math:`i-th` component of the projected update is calculated as follows:
|
||||
|
||||
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
|
||||
|
||||
where:
|
||||
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
|
||||
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
|
||||
:math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
|
||||
probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
|
||||
using the KL divergence loss that is returned from the network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters
|
||||
Reference in New Issue
Block a user