gryf/coach

mirror of https://github.com/gryf/coach.git synced 2025-12-17 11:10:20 +01:00

Files

Itai Caspi 00fca9b6e0 updated the paper links in the docs and restyled the theme

2017-10-19 17:16:12 +03:00

1.2 KiB

Raw Blame History

Distributional DQN

Actions space: Discrete

References: A Distributional Perspective on Reinforcement Learning

Network Structure

Algorithmic Description

Training the network

Sample a batch of transitions from the replay buffer.
The Bellman update is projected to the set of atoms representing the Q values distribution, such that the i-th component of the projected update is calculated as follows:
(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))
where: * [ \cdot ] bounds its argument in the range [a, b] * \hat{T}_{z_{j}} is the Bellman update for atom z_j: \hat{T}_{z_{j}} := r+\gamma z_j
Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
Once in every few thousand steps, weights are copied from the online network to the target network.