coach/docs_raw/source/components/agents/policy_optimization/ac.rst at dea46ae0d22b0a0cd30b9fc138a4a2642e1b9d9d

gryf/coach

mirror of https://github.com/gryf/coach.git synced 2025-12-18 19:50:17 +01:00

Files

Itai Caspi 6d40ad1650 update of api docstrings across coach and tutorials [WIP] (#91 )

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation

2018-11-15 15:00:13 +02:00

1.4 KiB

Raw Blame History

Actions space: Discrete | Continuous

References: Asynchronous Methods for Deep Reinforcement Learning

Network Structure

Algorithm Description

Choosing an action - Discrete actions

The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.

Training the network

A batch of T_max transitions is used, and the advantages are calculated upon it.

Advantages can be calculated by either of the following methods (configured by the selected preset) -

A_VALUE - Estimating advantage directly: A(s_t, a_t) = ∑^{i = t + k − 1}_i = tγ^i − tr_i + γ^kV(s_t + k)_{Q(s_t, a_t)} − V(s_t) where k is T_max − State_Index for each state in the batch.
GAE - By following the Generalized Advantage Estimation paper.

The advantages are then used in order to accumulate gradients according to L = − \mathop𝔼[log(π)⋅A]

System Message: ERROR/3 (<string>, line 40)

Unknown directive type "autoclass".

.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters