coach/docs_raw/source/components/agents/policy_optimization/sac.rst at 74db141d5e1a756cad26047cc15292ce211110f5

gryf/coach

mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00

Files

guyk1971 74db141d5e SAC algorithm (#282 )

* SAC algorithm

* SAC - updates to agent (learn_from_batch), sac_head and sac_q_head to fix problem in gradient calculation. Now SAC agents is able to train.
gym_environment - fixing an error in access to gym.spaces

* Soft Actor Critic - code cleanup

* code cleanup

* V-head initialization fix

* SAC benchmarks

* SAC Documentation

* typo fix

* documentation fixes

* documentation and version update

* README typo

2019-05-01 18:37:49 +03:00

2.0 KiB

Raw Blame History

System Message: WARNING/2 (<string>, line 2)

Title underline too short.

Soft Actor-Critic
============

Actions space: Continuous

References: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Network Structure

Algorithm Description

Choosing an action - Continuous actions

System Message: WARNING/2 (<string>, line 18)

Title underline too short.

Choosing an action - Continuous actions
+++++++++++++++++++++++++++++++++++++

The policy network is used in order to predict mean and log std for each action. While training, a sample is taken from a Gaussian distribution with these mean and std values. When testing, the agent can choose deterministically by picking the mean value or sample from a gaussian distribution like in training.

Training the network

Start by sampling a batch B of transitions from the experience replay.

To train the Q network, use the following targets:

y^Q_t = r(s_t, a_t) + γ⋅V(s_t + 1)

The state value used in the above target is acquired by running the target state value network.
To train the State Value network, use the following targets:

y^V_t = min_{i = 1, 2}Q_i(s_t, ã) − logπ(ã|s), ã ∼ π(⋅|s_t)

The state value network is trained using a sample-based approximation of the connection between and state value and state action values, The actions used for constructing the target are not sampled from the replay buffer, but rather sampled from the current policy.
To train the actor network, use the following equation:

∇_θJ ≈ ∇_θ(1)/(|B|)∑_{s_t ∈ B}(Q(s_t, ã_θ(s_t)) − logπ_θ(ã_θ(s_t)|s_t)), ã ∼ π(⋅|s_t)

After every training step, do a soft update of the V target network's weights from the online networks.

System Message: ERROR/3 (<string>, line 49)

Unknown directive type "autoclass".

.. autoclass:: rl_coach.agents.soft_actor_critic_agent.SoftActorCriticAlgorithmParameters