* SAC algorithm * SAC - updates to agent (learn_from_batch), sac_head and sac_q_head to fix problem in gradient calculation. Now SAC agents is able to train. gym_environment - fixing an error in access to gym.spaces * Soft Actor Critic - code cleanup * code cleanup * V-head initialization fix * SAC benchmarks * SAC Documentation * typo fix * documentation fixes * documentation and version update * README typo
2.0 KiB
Actions space: Continuous
References: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Network Structure
Algorithm Description
Choosing an action - Continuous actions
The policy network is used in order to predict mean and log std for each action. While training, a sample is taken from a Gaussian distribution with these mean and std values. When testing, the agent can choose deterministically by picking the mean value or sample from a gaussian distribution like in training.
Training the network
Start by sampling a batch B of transitions from the experience replay.
To train the Q network, use the following targets:
yQt = r(st, at) + γ⋅V(st + 1)The state value used in the above target is acquired by running the target state value network.
To train the State Value network, use the following targets:
yVt = mini = 1, 2Qi(st, ã) − logπ(ã|s), ã ∼ π(⋅|st)The state value network is trained using a sample-based approximation of the connection between and state value and state action values, The actions used for constructing the target are not sampled from the replay buffer, but rather sampled from the current policy.
To train the actor network, use the following equation:
∇θJ ≈ ∇θ(1)/(|B|) ∑st ∈ B(Q(st, ãθ(st)) − logπθ(ãθ(st)|st)), ã ∼ π(⋅|st)
After every training step, do a soft update of the V target network's weights from the online networks.