coach/docs/docs/algorithms/policy_optimization/ac.md at 1d4c3455e73b9e944051370ebf43ac985013eaa2

gryf/coach

Fork 0

mirror of https://github.com/gryf/coach.git synced 2025-12-17 11:10:20 +01:00

Files

Gal Leibovich 1d4c3455e7 coach v0.8.0

2017-10-19 13:10:15 +03:00

1.1 KiB

Raw Blame History

Action space: Discrete|Continuous

Paper

Network Structure

## Algorithmic Description

Choosing an action - Discrete actions

The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.

Training the network

A batch of T_{max} transitions is used, and the advantages are calculated upon it.

Advantages can be calculated by either of the followng methods (configured by the selected preset) -

A_VALUE - Estimating advantage directly:$$ A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t) $$where k is T_{max} - State\_Index for each state in the batch.
GAE - By following the Generalized Advantage Estimation paper.

The advantages are then used in order to accumulate gradients according to

L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]

1.1 KiB Raw Blame History

Network Structure

Choosing an action - Discrete actions

Training the network

1.1 KiB

Raw Blame History