mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 19:20:19 +01:00
SAC algorithm (#282)
* SAC algorithm * SAC - updates to agent (learn_from_batch), sac_head and sac_q_head to fix problem in gradient calculation. Now SAC agents is able to train. gym_environment - fixing an error in access to gym.spaces * Soft Actor Critic - code cleanup * code cleanup * V-head initialization fix * SAC benchmarks * SAC Documentation * typo fix * documentation fixes * documentation and version update * README typo
This commit is contained in:
Binary file not shown.
|
Before Width: | Height: | Size: 51 KiB After Width: | Height: | Size: 59 KiB |
BIN
docs_raw/source/_static/img/design_imgs/sac.png
Normal file
BIN
docs_raw/source/_static/img/design_imgs/sac.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 109 KiB |
@@ -21,6 +21,7 @@ A detailed description of those algorithms can be found by navigating to each of
|
||||
imitation/cil
|
||||
policy_optimization/cppo
|
||||
policy_optimization/ddpg
|
||||
policy_optimization/sac
|
||||
other/dfp
|
||||
value_optimization/double_dqn
|
||||
value_optimization/dqn
|
||||
|
||||
@@ -38,6 +38,7 @@ Each update perform the following procedure:
|
||||
.. math:: \text{where} \quad \bar{\rho}_{t} = \min{\left\{c,\rho_t\right\}},\quad \rho_t=\frac{\pi (a_t \mid s_t)}{\mu (a_t \mid s_t)}
|
||||
|
||||
3. **Accumulate gradients:**
|
||||
|
||||
:math:`\bullet` **Policy gradients (with bias correction):**
|
||||
|
||||
.. math:: \hat{g}_t^{policy} & = & \bar{\rho}_{t} \nabla \log \pi (a_t \mid s_t) [Q^{ret}(s_t,a_t) - V(s_t)] \\
|
||||
|
||||
@@ -0,0 +1,49 @@
|
||||
Soft Actor-Critic
|
||||
============
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor <https://arxiv.org/abs/1801.01290>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/sac.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Choosing an action - Continuous actions
|
||||
+++++++++++++++++++++++++++++++++++++
|
||||
|
||||
The policy network is used in order to predict mean and log std for each action. While training, a sample is taken
|
||||
from a Gaussian distribution with these mean and std values. When testing, the agent can choose deterministically
|
||||
by picking the mean value or sample from a gaussian distribution like in training.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
Start by sampling a batch :math:`B` of transitions from the experience replay.
|
||||
|
||||
* To train the **Q network**, use the following targets:
|
||||
|
||||
.. math:: y_t^Q=r(s_t,a_t)+\gamma \cdot V(s_{t+1})
|
||||
|
||||
The state value used in the above target is acquired by running the target state value network.
|
||||
|
||||
* To train the **State Value network**, use the following targets:
|
||||
|
||||
.. math:: y_t^V = \min_{i=1,2}Q_i(s_t,\tilde{a}) - log\pi (\tilde{a} \vert s),\,\,\,\, \tilde{a} \sim \pi(\cdot \vert s_t)
|
||||
|
||||
The state value network is trained using a sample-based approximation of the connection between and state value and state
|
||||
action values, The actions used for constructing the target are **not** sampled from the replay buffer, but rather sampled
|
||||
from the current policy.
|
||||
|
||||
* To train the **actor network**, use the following equation:
|
||||
|
||||
.. math:: \nabla_{\theta} J \approx \nabla_{\theta} \frac{1}{\vert B \vert} \sum_{s_t\in B} \left( Q \left(s_t, \tilde{a}_\theta(s_t)\right) - log\pi_{\theta}(\tilde{a}_{\theta}(s_t)\vert s_t) \right),\,\,\,\, \tilde{a} \sim \pi(\cdot \vert s_t)
|
||||
|
||||
After every training step, do a soft update of the V target network's weights from the online networks.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.soft_actor_critic_agent.SoftActorCriticAlgorithmParameters
|
||||
File diff suppressed because one or more lines are too long
@@ -198,6 +198,14 @@ The algorithms are ordered by their release date in descending order.
|
||||
improve stability it also employs bias correction and trust region optimization techniques.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm continuous off-policy" data-year="201808">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/sac.html">SAC</a>
|
||||
<br>
|
||||
Soft Actor-Critic is an algorithm which optimizes a stochastic policy in an off-policy way.
|
||||
One of the key features of SAC is that it solves a maximum entropy reinforcement learning problem.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm continuous off-policy" data-year="201509">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/ddpg.html">DDPG</a>
|
||||
|
||||
Reference in New Issue
Block a user