mirror of
https://github.com/gryf/coach.git
synced 2025-12-18 11:40:18 +01:00
update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
This commit is contained in:
40
docs_raw/source/components/agents/policy_optimization/ac.rst
Normal file
40
docs_raw/source/components/agents/policy_optimization/ac.rst
Normal file
@@ -0,0 +1,40 @@
|
||||
Actor-Critic
|
||||
============
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ac.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Choosing an action - Discrete actions
|
||||
+++++++++++++++++++++++++++++++++++++
|
||||
|
||||
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
|
||||
distribution assigned with these probabilities. When testing, the action with the highest probability is used.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
|
||||
|
||||
Advantages can be calculated by either of the following methods (configured by the selected preset) -
|
||||
|
||||
1. **A_VALUE** - Estimating advantage directly:
|
||||
:math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
|
||||
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
|
||||
|
||||
2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
|
||||
|
||||
The advantages are then used in order to accumulate gradients according to
|
||||
:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters
|
||||
@@ -0,0 +1,44 @@
|
||||
Clipped Proximal Policy Optimization
|
||||
====================================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ppo.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Continuous action
|
||||
++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Same as in PPO.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Very similar to PPO, with several small (but very simplifying) changes:
|
||||
|
||||
1. Train both the value and policy networks, simultaneously, by defining a single loss function,
|
||||
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
|
||||
|
||||
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
|
||||
|
||||
3. Value targets are now also calculated based on the GAE advantages.
|
||||
In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
|
||||
in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
|
||||
each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
|
||||
|
||||
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
|
||||
:math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
|
||||
This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
|
||||
clipped surrogate loss:
|
||||
|
||||
:math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters
|
||||
@@ -0,0 +1,50 @@
|
||||
Deep Deterministic Policy Gradient
|
||||
==================================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ddpg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
|
||||
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
|
||||
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Start by sampling a batch of transitions from the experience replay.
|
||||
|
||||
* To train the **critic network**, use the following targets:
|
||||
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
|
||||
|
||||
First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
|
||||
Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
|
||||
calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
|
||||
as the inputs, and :math:`y_t` as the targets.
|
||||
|
||||
* To train the **actor network**, use the following equation:
|
||||
|
||||
:math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
|
||||
|
||||
Use the actor's online network to get the action mean values using the current states as the inputs.
|
||||
Then, use the critic online network in order to get the gradients of the critic output with respect to the
|
||||
action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
|
||||
Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
|
||||
given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
|
||||
|
||||
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters
|
||||
@@ -0,0 +1,24 @@
|
||||
Hierarchical Actor Critic
|
||||
=========================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ddpg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
|
||||
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
|
||||
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
39
docs_raw/source/components/agents/policy_optimization/pg.rst
Normal file
39
docs_raw/source/components/agents/policy_optimization/pg.rst
Normal file
@@ -0,0 +1,39 @@
|
||||
Policy Gradient
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/pg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Discrete actions
|
||||
+++++++++++++++++++++++++++++++++++++
|
||||
Run the current states through the network and get a policy distribution over the actions.
|
||||
While training, sample from the policy distribution. When testing, take the action with the highest probability.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
The policy head loss is defined as :math:`L=-log (\pi) \cdot PolicyGradientRescaler`.
|
||||
The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
|
||||
This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
|
||||
convergence. The rescaler is a configurable parameter and there are few options to choose from:
|
||||
|
||||
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
|
||||
* **Future Return** - Return from each transition until the end of the episode.
|
||||
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
|
||||
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
|
||||
which are calculated seperately for each timestep, across different episodes.
|
||||
|
||||
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
|
||||
serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
|
||||
the gradients are then applied to the network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters
|
||||
@@ -0,0 +1,45 @@
|
||||
Proximal Policy Optimization
|
||||
============================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ppo.png
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Continuous actions
|
||||
+++++++++++++++++++++++++++++++++++++++
|
||||
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
|
||||
While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
|
||||
When testing, just take the mean values predicted by the network.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
|
||||
|
||||
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
|
||||
|
||||
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
|
||||
the L-BFGS optimizer runs on the entire dataset at once, without batching.
|
||||
It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
|
||||
the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
|
||||
discounted returns of each state in each episode.
|
||||
|
||||
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
|
||||
targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
|
||||
starting to run the current set of training iterations) using a regularization term.
|
||||
|
||||
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
|
||||
in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
|
||||
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters
|
||||
Reference in New Issue
Block a user