1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 11:40:18 +01:00

update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
This commit is contained in:
Itai Caspi
2018-11-15 15:00:13 +02:00
committed by Gal Novik
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions

View File

@@ -0,0 +1,40 @@
Actor-Critic
============
**Actions space:** Discrete | Continuous
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ac.png
:width: 500px
:align: center
Algorithm Description
---------------------
Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
++++++++++++++++++++
A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
Advantages can be calculated by either of the following methods (configured by the selected preset) -
1. **A_VALUE** - Estimating advantage directly:
:math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
The advantages are then used in order to accumulate gradients according to
:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters

View File

@@ -0,0 +1,44 @@
Clipped Proximal Policy Optimization
====================================
**Actions space:** Discrete | Continuous
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ppo.png
:align: center
Algorithm Description
---------------------
Choosing an action - Continuous action
++++++++++++++++++++++++++++++++++++++
Same as in PPO.
Training the network
++++++++++++++++++++
Very similar to PPO, with several small (but very simplifying) changes:
1. Train both the value and policy networks, simultaneously, by defining a single loss function,
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
3. Value targets are now also calculated based on the GAE advantages.
In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
:math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
clipped surrogate loss:
:math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters

View File

@@ -0,0 +1,50 @@
Deep Deterministic Policy Gradient
==================================
**Actions space:** Continuous
**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ddpg.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
Training the network
++++++++++++++++++++
Start by sampling a batch of transitions from the experience replay.
* To train the **critic network**, use the following targets:
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
as the inputs, and :math:`y_t` as the targets.
* To train the **actor network**, use the following equation:
:math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
Use the actor's online network to get the action mean values using the current states as the inputs.
Then, use the critic online network in order to get the gradients of the critic output with respect to the
action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters

View File

@@ -0,0 +1,24 @@
Hierarchical Actor Critic
=========================
**Actions space:** Continuous
**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ddpg.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
Training the network
++++++++++++++++++++

View File

@@ -0,0 +1,39 @@
Policy Gradient
===============
**Actions space:** Discrete | Continuous
**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/pg.png
:align: center
Algorithm Description
---------------------
Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
Run the current states through the network and get a policy distribution over the actions.
While training, sample from the policy distribution. When testing, take the action with the highest probability.
Training the network
++++++++++++++++++++
The policy head loss is defined as :math:`L=-log (\pi) \cdot PolicyGradientRescaler`.
The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
convergence. The rescaler is a configurable parameter and there are few options to choose from:
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
* **Future Return** - Return from each transition until the end of the episode.
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
which are calculated seperately for each timestep, across different episodes.
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
the gradients are then applied to the network.
.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters

View File

@@ -0,0 +1,45 @@
Proximal Policy Optimization
============================
**Actions space:** Discrete | Continuous
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ppo.png
:align: center
Algorithm Description
---------------------
Choosing an action - Continuous actions
+++++++++++++++++++++++++++++++++++++++
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
When testing, just take the mean values predicted by the network.
Training the network
++++++++++++++++++++
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
the L-BFGS optimizer runs on the entire dataset at once, without batching.
It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
discounted returns of each state in each episode.
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
starting to run the current set of training iterations) using a regularization term.
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters