coach/docs_raw/source/components/agents/value_optimization/rainbow.rst at 19ad2d60a7022bb5125855c029f27d86aaa46d64

gryf/coach

mirror of https://github.com/gryf/coach.git synced 2025-12-18 19:50:17 +01:00

Files

Itai Caspi 6d40ad1650 update of api docstrings across coach and tutorials [WIP] (#91 )

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation

2018-11-15 15:00:13 +02:00

1.8 KiB

Raw Blame History

Actions space: Discrete

References: Rainbow: Combining Improvements in Deep Reinforcement Learning

Network Structure

Algorithm Description

Rainbow combines 6 recent advancements in reinforcement learning:

N-step returns
Distributional state-action value learning
Dueling networks
Noisy Networks
Double DQN
Prioritized Experience Replay

Training the network

Sample a batch of transitions from the replay buffer.
The Bellman update is projected to the set of atoms representing the Q values distribution, such that the i − th component of the projected update is calculated as follows:

(ΦT̂Z_θ(s_t, a_t))_i = ∑^N − 1_j = 0[1 − (|[T̂_{z_j}]^V_MAX_{V_MIN} − z_i|)/(Δz)]¹₀ p_j(s_t + 1, π(s_t + 1))

where: * [⋅] bounds its argument in the range [a, b] * T̂_{z_j} is the Bellman update for atom z_j: T̂_{z_j} : = r_t + γr_t + 1 + ... + γr_{t + n − 1} + γ^n − 1z_j
Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
Once in every few thousand steps, weights are copied from the online network to the target network.
After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer using the KL divergence loss that is returned from the network.

System Message: ERROR/3 (<string>, line 51)

Unknown directive type "autoclass".

.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters

1.8 KiB Raw Blame History Unescape Escape

Network Structure

Algorithm Description

Training the network

1.8 KiB

Raw Blame History