Categorical DQN
Actions space: Discrete
References: A Distributional Perspective on Reinforcement Learning
Network Structure
Algorithm Description
Training the network
- Sample a batch of transitions from the replay buffer.
-
The Bellman update is projected to the set of atoms representing the values distribution, such that the component of the projected update is calculated as follows: where:
- bounds its argument in the range [a, b]
- is the Bellman update for atom :
-
Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
- Once in every few thousand steps, weights are copied from the online network to the target network.