* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
1.4 KiB
Actions space: Discrete | Continuous
References: Asynchronous Methods for Deep Reinforcement Learning
Network Structure
Algorithm Description
Choosing an action - Discrete actions
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
A batch of Tmax transitions is used, and the advantages are calculated upon it.
Advantages can be calculated by either of the following methods (configured by the selected preset) -
A_VALUE - Estimating advantage directly: A(st, at) = ∑i = t + k − 1i = tγi − tri + γkV(st + k)Q(st, at) − V(st) where k is Tmax − State_Index for each state in the batch.
GAE - By following the Generalized Advantage Estimation paper.
The advantages are then used in order to accumulate gradients according to L = − \mathop𝔼[log(π)⋅A]