mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 19:20:19 +01:00
1.2 KiB
1.2 KiB
Actor-Critic
Actions space: Discrete|Continuous
References: Asynchronous Methods for Deep Reinforcement Learning
Network Structure
## Algorithm DescriptionChoosing an action - Discrete actions
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
A batch of T_{max} transitions is used, and the advantages are calculated upon it.
Advantages can be calculated by either of the following methods (configured by the selected preset) -
- A_VALUE - Estimating advantage directly:$$ A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t) $$where
kisT_{max} - State\_Indexfor each state in the batch. - GAE - By following the Generalized Advantage Estimation paper.
The advantages are then used in order to accumulate gradients according to
L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]
