1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 11:40:18 +01:00
Files
coach/docs_raw/source/components/agents/policy_optimization/sac.rst
guyk1971 74db141d5e SAC algorithm (#282)
* SAC algorithm

* SAC - updates to agent (learn_from_batch), sac_head and sac_q_head to fix problem in gradient calculation. Now SAC agents is able to train.
gym_environment - fixing an error in access to gym.spaces

* Soft Actor Critic - code cleanup

* code cleanup

* V-head initialization fix

* SAC benchmarks

* SAC Documentation

* typo fix

* documentation fixes

* documentation and version update

* README typo
2019-05-01 18:37:49 +03:00

2.0 KiB
Raw Blame History

Actions space: Continuous

References: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Network Structure

/_static/img/design_imgs/sac.png

Algorithm Description

Choosing an action - Continuous actions

The policy network is used in order to predict mean and log std for each action. While training, a sample is taken from a Gaussian distribution with these mean and std values. When testing, the agent can choose deterministically by picking the mean value or sample from a gaussian distribution like in training.

Training the network

Start by sampling a batch B of transitions from the experience replay.

  • To train the Q network, use the following targets:

    yQt=r(st,at)+γV(st+1)

    The state value used in the above target is acquired by running the target state value network.

  • To train the State Value network, use the following targets:

    yVt=mini=1,2Qi(st,ã)logπ(ã|s),ãπ(⋅|st)

    The state value network is trained using a sample-based approximation of the connection between and state value and state action values, The actions used for constructing the target are not sampled from the replay buffer, but rather sampled from the current policy.

  • To train the actor network, use the following equation:

    θJ ≈ ∇θ(1)/(|B|)st ∈ B(Q(st,ãθ(st))logπθ(ãθ(st)|st)),ãπ(⋅|st)

After every training step, do a soft update of the V target network's weights from the online networks.