1
0
mirror of https://github.com/gryf/coach.git synced 2026-01-08 23:04:15 +01:00

update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
This commit is contained in:
Itai Caspi
2018-11-15 15:00:13 +02:00
committed by Gal Novik
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions

View File

@@ -0,0 +1,51 @@
Rainbow
=======
**Actions space:** Discrete
**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/rainbow.png
:align: center
Algorithm Description
---------------------
Rainbow combines 6 recent advancements in reinforcement learning:
* N-step returns
* Distributional state-action value learning
* Dueling networks
* Noisy Networks
* Double DQN
* Prioritized Experience Replay
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
that the :math:`i-th` component of the projected update is calculated as follows:
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
where:
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
:math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
probability distribution. Only the target of the actions that were actually taken is updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.
5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
using the KL divergence loss that is returned from the network.
.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters