mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 11:10:20 +01:00
1.1 KiB
1.1 KiB
N-Step Q Learning
Actions space: Discrete
References: Asynchronous Methods for Deep Reinforcement Learning
Network Structure
Algorithmic Description
Training the network
The $N$-step Q learning algorithm works in similar manner to DQN except for the following changes:
-
No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
Nsteps using the latestNsteps played by the agent. -
In order to stabilize the learning, multiple workers work together to update the network. This creates the same effect as uncorrelating the samples used for training.
-
Instead of using single-step Q targets for the network, the rewards from
Nconsequent steps are accumulated to form the $N$-step Q targets, according to the following equation:
R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})
where k is T_{max} - State\_Index for each state in the batch
