1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00

update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
This commit is contained in:
Itai Caspi
2018-11-15 15:00:13 +02:00
committed by Gal Novik
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions

View File

@@ -0,0 +1,18 @@
Additional Parameters
=====================
VisualizationParameters
-----------------------
.. autoclass:: rl_coach.base_parameters.VisualizationParameters
PresetValidationParameters
--------------------------
.. autoclass:: rl_coach.base_parameters.PresetValidationParameters
TaskParameters
--------------
.. autoclass:: rl_coach.base_parameters.TaskParameters
DistributedTaskParameters
-------------------------
.. autoclass:: rl_coach.base_parameters.DistributedTaskParameters

View File

@@ -0,0 +1,29 @@
Behavioral Cloning
==================
**Actions space:** Discrete | Continuous
Network Structure
-----------------
.. image:: /_static/img/design_imgs/pg.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
The replay buffer contains the expert demonstrations for the task.
These demonstrations are given as state, action tuples, and with no reward.
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
the expert for each state.
1. Sample a batch of transitions from the replay buffer.
2. Use the current states as input to the network, and the expert actions as the targets of the network.
3. For the network head, we use the policy head, which uses the cross entropy loss function.
.. autoclass:: rl_coach.agents.bc_agent.BCAlgorithmParameters

View File

@@ -0,0 +1,36 @@
Conditional Imitation Learning
==============================
**Actions space:** Discrete | Continuous
**References:** `End-to-end Driving via Conditional Imitation Learning <https://arxiv.org/abs/1710.02410>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/cil.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
The replay buffer contains the expert demonstrations for the task.
These demonstrations are given as state, action tuples, and with no reward.
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
the expert for each state.
In conditional imitation learning, each transition is assigned a class, which determines the goal that was pursuit
in that transitions. For example, 3 possible classes can be: turn right, turn left and follow lane.
1. Sample a batch of transitions from the replay buffer, where the batch is balanced, meaning that an equal number
of transitions will be sampled from each class index.
2. Use the current states as input to the network, and assign the expert actions as the targets of the network heads
corresponding to the state classes. For the other heads, set the targets to match the currently predicted values,
so that the loss for the other heads will be zeroed out.
3. We use a regression head, that minimizes the MSE loss between the network predicted values and the target values.
.. autoclass:: rl_coach.agents.cil_agent.CILAlgorithmParameters

View File

@@ -0,0 +1,43 @@
Agents
======
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into three main classes -
value optimization, policy optimization and imitation learning.
A detailed description of those algorithms can be found by navigating to each of the algorithm pages.
.. image:: /_static/img/algorithms.png
:width: 600px
:align: center
.. toctree::
:maxdepth: 1
:caption: Agents
policy_optimization/ac
imitation/bc
value_optimization/bs_dqn
value_optimization/categorical_dqn
imitation/cil
policy_optimization/cppo
policy_optimization/ddpg
other/dfp
value_optimization/double_dqn
value_optimization/dqn
value_optimization/dueling_dqn
value_optimization/mmc
value_optimization/n_step
value_optimization/naf
value_optimization/nec
value_optimization/pal
policy_optimization/pg
policy_optimization/ppo
value_optimization/rainbow
value_optimization/qr_dqn
.. autoclass:: rl_coach.base_parameters.AgentParameters
.. autoclass:: rl_coach.agents.agent.Agent
:members:
:inherited-members:

View File

@@ -0,0 +1,39 @@
Direct Future Prediction
========================
**Actions space:** Discrete
**References:** `Learning to Act by Predicting the Future <https://arxiv.org/abs/1611.01779>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dfp.png
:width: 600px
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network.
The output of the network is the predicted future measurements for time-steps :math:`t+1,t+2,t+4,t+8,t+16` and
:math:`t+32` for each possible action.
2. For each action, the measurements of each predicted time-step are multiplied by the goal vector,
and the result is a single vector of future values for each action.
3. Then, a weighted sum of the future values of each action is calculated, and the result is a single value for each action.
4. The action values are passed to the exploration policy to decide on the action to use.
Training the network
++++++++++++++++++++
Given a batch of transitions, run them through the network to get the current predictions of the future measurements
per action, and set them as the initial targets for training the network. For each transition
:math:`(s_t,a_t,r_t,s_{t+1} )` in the batch, the target of the network for the action that was taken, is the actual
measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
For the actions that were not taken, the targets are the current values.
.. autoclass:: rl_coach.agents.dfp_agent.DFPAlgorithmParameters

View File

@@ -0,0 +1,40 @@
Actor-Critic
============
**Actions space:** Discrete | Continuous
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ac.png
:width: 500px
:align: center
Algorithm Description
---------------------
Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
++++++++++++++++++++
A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
Advantages can be calculated by either of the following methods (configured by the selected preset) -
1. **A_VALUE** - Estimating advantage directly:
:math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
The advantages are then used in order to accumulate gradients according to
:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters

View File

@@ -0,0 +1,44 @@
Clipped Proximal Policy Optimization
====================================
**Actions space:** Discrete | Continuous
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ppo.png
:align: center
Algorithm Description
---------------------
Choosing an action - Continuous action
++++++++++++++++++++++++++++++++++++++
Same as in PPO.
Training the network
++++++++++++++++++++
Very similar to PPO, with several small (but very simplifying) changes:
1. Train both the value and policy networks, simultaneously, by defining a single loss function,
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
3. Value targets are now also calculated based on the GAE advantages.
In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
:math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
clipped surrogate loss:
:math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters

View File

@@ -0,0 +1,50 @@
Deep Deterministic Policy Gradient
==================================
**Actions space:** Continuous
**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ddpg.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
Training the network
++++++++++++++++++++
Start by sampling a batch of transitions from the experience replay.
* To train the **critic network**, use the following targets:
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
as the inputs, and :math:`y_t` as the targets.
* To train the **actor network**, use the following equation:
:math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
Use the actor's online network to get the action mean values using the current states as the inputs.
Then, use the critic online network in order to get the gradients of the critic output with respect to the
action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters

View File

@@ -0,0 +1,24 @@
Hierarchical Actor Critic
=========================
**Actions space:** Continuous
**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ddpg.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
Training the network
++++++++++++++++++++

View File

@@ -0,0 +1,39 @@
Policy Gradient
===============
**Actions space:** Discrete | Continuous
**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/pg.png
:align: center
Algorithm Description
---------------------
Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
Run the current states through the network and get a policy distribution over the actions.
While training, sample from the policy distribution. When testing, take the action with the highest probability.
Training the network
++++++++++++++++++++
The policy head loss is defined as :math:`L=-log (\pi) \cdot PolicyGradientRescaler`.
The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
convergence. The rescaler is a configurable parameter and there are few options to choose from:
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
* **Future Return** - Return from each transition until the end of the episode.
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
which are calculated seperately for each timestep, across different episodes.
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
the gradients are then applied to the network.
.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters

View File

@@ -0,0 +1,45 @@
Proximal Policy Optimization
============================
**Actions space:** Discrete | Continuous
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ppo.png
:align: center
Algorithm Description
---------------------
Choosing an action - Continuous actions
+++++++++++++++++++++++++++++++++++++++
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
When testing, just take the mean values predicted by the network.
Training the network
++++++++++++++++++++
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
the L-BFGS optimizer runs on the entire dataset at once, without batching.
It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
discounted returns of each state in each episode.
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
starting to run the current set of training iterations) using a regularization term.
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters

View File

@@ -0,0 +1,43 @@
Bootstrapped DQN
================
**Actions space:** Discrete
**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/bs_dqn.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
The current states are used as the input to the network. The network contains several $Q$ heads, which are used
for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
selects a single head to play with during the episode. According to the selected head, only the relevant
output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
Storing the transitions
+++++++++++++++++++++++
For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
the replay buffer.
Training the network
++++++++++++++++++++
First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
according to the standard DQN update rule:
:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
Otherwise, leave it intact so that the transition does not affect the learning of this head.
Then, train the online network according to the calculated targets.
As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.

View File

@@ -0,0 +1,39 @@
Categorical DQN
===============
**Actions space:** Discrete
**References:** `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/distributional_dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
that the :math:`i-th` component of the projected update is calculated as follows:
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
where:
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom :math:`z_j`: :math:`\hat{T}_{z_{j}} := r+\gamma z_j`
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
probability distribution. Only the target of the actions that were actually taken is updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.
.. autoclass:: rl_coach.agents.categorical_dqn_agent.CategoricalDQNAlgorithmParameters

View File

@@ -0,0 +1,35 @@
Double DQN
==========
**Actions space:** Discrete
**References:** `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
Set those values as the targets for the actions that were not actually played.
4. For each action that was played, use the following equation for calculating the targets of the network:
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a))`
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
6. Once in every few thousand steps, copy the weights from the online network to the target network.

View File

@@ -0,0 +1,37 @@
Deep Q Networks
===============
**Actions space:** Discrete
**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
Set those values as the targets for the actions that were not actually played.
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
:math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
6. Once in every few thousand steps, copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters

View File

@@ -0,0 +1,27 @@
Dueling DQN
===========
**Actions space:** Discrete
**References:** `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dueling_dqn.png
:align: center
General Description
-------------------
Dueling DQN presents a change in the network structure comparing to DQN.
Dueling DQN uses a specialized *Dueling Q Head* in order to separate :math:`Q` to an :math:`A` (advantage)
stream and a :math:`V` stream. Adding this type of structure to the network head allows the network to better differentiate
actions from one another, and significantly improves the learning.
In many states, the values of the different actions are very similar, and it is less important which action to take.
This is especially important in environments where there are many actions to choose from. In DQN, on each training
iteration, for each of the states in the batch, we update the :ath:`Q` values only for the specific actions taken in
those states. This results in slower learning as we do not learn the :math:`Q` values for actions that were not taken yet.
On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a
single action has been taken at this state.

View File

@@ -0,0 +1,37 @@
Mixed Monte Carlo
=================
**Actions space:** Discrete
**References:** `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
The DDQN targets are calculated in the same manner as in the DDQN agent:
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
:math:`y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )`
A mixing ratio $\alpha$ is then used to get the final targets:
:math:`y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}`
Finally, the online network is trained using the current states as inputs, and the calculated targets.
Once in every few thousand steps, copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.mmc_agent.MixedMonteCarloAlgorithmParameters

View File

@@ -0,0 +1,35 @@
N-Step Q Learning
=================
**Actions space:** Discrete
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
The :math:`N`-step Q learning algorithm works in similar manner to DQN except for the following changes:
1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
:math:`N` steps using the latest :math:`N` steps played by the agent.
2. In order to stabilize the learning, multiple workers work together to update the network.
This creates the same effect as uncorrelating the samples used for training.
3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated
to form the :math:`N`-step Q targets, according to the following equation:
:math:`R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})`
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch
.. autoclass:: rl_coach.agents.n_step_q_agent.NStepQAlgorithmParameters

View File

@@ -0,0 +1,33 @@
Normalized Advantage Functions
==============================
**Actions space:** Continuous
**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/naf.png
:width: 600px
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
It is then passed to the exploration policy which adds noise in order to encourage exploration.
Training the network
++++++++++++++++++++
The network is trained by using the following targets:
:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
and :math:`y_t` as the targets.
After every training step, use a soft update in order to copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters

View File

@@ -0,0 +1,50 @@
Neural Episodic Control
=======================
**Actions space:** Discrete
**References:** `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/nec.png
:width: 500px
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate
output from the middleware.
2. For each possible action :math:`a_i`, run the DND head using the state embedding and the selected action :math:`a_i` as inputs.
The DND is queried and returns the :math:`P` nearest neighbor keys and values. The keys and values are used to calculate
and return the action :math:`Q` value from the network.
3. Pass all the :math:`Q` values to the exploration policy and choose an action accordingly.
4. Store the state embeddings and actions taken during the current episode in a small buffer :math:`B`, in order to
accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
Finalizing an episode
+++++++++++++++++++++
For each step in the episode, the state embeddings and the taken actions are stored in the buffer :math:`B`.
When the episode is finished, the replay buffer calculates the :math:`N`-step total return of each transition in the
buffer, bootstrapped using the maximum :math:`Q` value of the :math:`N`-th transition. Those values are inserted
along with the total return into the DND, and the buffer :math:`B` is reset.
Training the network
++++++++++++++++++++
Train the network only when the DND has enough entries for querying.
To train the network, the current states are used as the inputs and the :math:`N`-step returns are used as the targets.
The :math:`N`-step return used takes into account :math:`N` consecutive steps, and bootstraps the last value from
the network if necessary:
:math:`y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N max_a Q(s_{t+N},a)`
.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters

View File

@@ -0,0 +1,45 @@
Persistent Advantage Learning
=============================
**Actions space:** Discrete
**References:** `Increasing the Action Gap: New Operators for Reinforcement Learning <https://arxiv.org/abs/1512.04860>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
3. The action gap :math:`V(s_t )-Q(s_t,a_t)` should then be subtracted from each of the calculated targets.
To calculate the action gap, run the target network using the current states and get the :math:`Q` values
for all the actions. Then estimate :math:`V` as the maximum predicted :math:`Q` value for the current state:
:math:`V(s_t )=max_a Q(s_t,a)`
4. For *advantage learning (AL)*, reduce the action gap weighted by a predefined parameter :math:`\alpha` from
the targets :math:`y_t^{DDQN}`:
:math:`y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t ))`
5. For *persistent advantage learning (PAL)*, the target network is also used in order to calculate the action
gap for the next state:
:math:`V(s_{t+1} )-Q(s_{t+1},a_{t+1})`
where :math:`a_{t+1}` is chosen by running the next states through the online network and choosing the action that
has the highest predicted :math:`Q` value. Finally, the targets will be defined as -
:math:`y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} ))`
6. Train the online network using the current states as inputs, and with the aforementioned targets.
7. Once in every few thousand steps, copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters

View File

@@ -0,0 +1,33 @@
Quantile Regression DQN
=======================
**Actions space:** Discrete
**References:** `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/qr_dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. First, the next state quantiles are predicted. These are used in order to calculate the targets for the network,
by following the Bellman equation.
Next, the current quantile locations for the current states are predicted, sorted, and used for calculating the
quantile midpoints targets.
3. The network is trained with the quantile regression loss between the resulting quantile locations and the target
quantile locations. Only the targets of the actions that were actually taken are updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.
.. autoclass:: rl_coach.agents.qr_dqn_agent.QuantileRegressionDQNAlgorithmParameters

View File

@@ -0,0 +1,51 @@
Rainbow
=======
**Actions space:** Discrete
**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/rainbow.png
:align: center
Algorithm Description
---------------------
Rainbow combines 6 recent advancements in reinforcement learning:
* N-step returns
* Distributional state-action value learning
* Dueling networks
* Noisy Networks
* Double DQN
* Prioritized Experience Replay
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
that the :math:`i-th` component of the projected update is calculated as follows:
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
where:
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
:math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
probability distribution. Only the target of the actions that were actually taken is updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.
5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
using the KL divergence loss that is returned from the network.
.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters

View File

@@ -0,0 +1,27 @@
Architectures
=============
Architectures contain all the classes that implement the neural network related stuff for the agent.
Since Coach is intended to work with multiple neural network frameworks, each framework will implement its
own components under a dedicated directory. For example, tensorflow components will contain all the neural network
parts that are implemented using TensorFlow.
.. autoclass:: rl_coach.base_parameters.NetworkParameters
Architecture
------------
.. autoclass:: rl_coach.architectures.architecture.Architecture
:members:
:inherited-members:
NetworkWrapper
--------------
.. image:: /_static/img/distributed.png
:width: 600px
:align: center
.. autoclass:: rl_coach.architectures.network_wrapper.NetworkWrapper
:members:
:inherited-members:

View File

@@ -0,0 +1,33 @@
Core Types
==========
ActionInfo
----------
.. autoclass:: rl_coach.core_types.ActionInfo
:members:
:inherited-members:
Batch
-----
.. autoclass:: rl_coach.core_types.Batch
:members:
:inherited-members:
EnvResponse
-----------
.. autoclass:: rl_coach.core_types.EnvResponse
:members:
:inherited-members:
Episode
-------
.. autoclass:: rl_coach.core_types.Episode
:members:
:inherited-members:
Transition
----------
.. autoclass:: rl_coach.core_types.Transition
:members:
:inherited-members:

View File

@@ -0,0 +1,70 @@
Environments
============
.. autoclass:: rl_coach.environments.environment.Environment
:members:
:inherited-members:
DeepMind Control Suite
----------------------
A set of reinforcement learning environments powered by the MuJoCo physics engine.
Website: `DeepMind Control Suite <https://github.com/deepmind/dm_control>`_
.. autoclass:: rl_coach.environments.control_suite_environment.ControlSuiteEnvironment
Blizzard Starcraft II
---------------------
A popular strategy game which was wrapped with a python interface by DeepMind.
Website: `Blizzard Starcraft II <https://github.com/deepmind/pysc2>`_
.. autoclass:: rl_coach.environments.starcraft2_environment.StarCraft2Environment
ViZDoom
--------
A Doom-based AI research platform for reinforcement learning from raw visual information.
Website: `ViZDoom <http://vizdoom.cs.put.edu.pl/>`_
.. autoclass:: rl_coach.environments.doom_environment.DoomEnvironment
CARLA
-----
An open-source simulator for autonomous driving research.
Website: `CARLA <https://github.com/carla-simulator/carla>`_
.. autoclass:: rl_coach.environments.carla_environment.CarlaEnvironment
OpenAI Gym
----------
A library which consists of a set of environments, from games to robotics.
Additionally, it can be extended using the API defined by the authors.
Website: `OpenAI Gym <https://gym.openai.com/>`_
In Coach, we support all the native environments in Gym, along with several extensions such as:
* `Roboschool <https://github.com/openai/roboschool>`_ - a set of environments powered by the PyBullet engine,
that offer a free alternative to MuJoCo.
* `Gym Extensions <https://github.com/Breakend/gym-extensions>`_ - a set of environments that extends Gym for
auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)
* `PyBullet <https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet>`_ - a physics engine that
includes a set of robotics environments.
.. autoclass:: rl_coach.environments.gym_environment.GymEnvironment

View File

@@ -0,0 +1,87 @@
Exploration Policies
====================
Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to a
predefined policy. This is one of the most important aspects of reinforcement learning agents, and can require some
tuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended with
custom policies. Note that not all exploration policies are expected to work for both discrete and continuous action
spaces.
.. role:: green
.. role:: red
+----------------------+-----------------------+------------------+
| Exploration Policy | Discrete Action Space | Box Action Space |
+======================+=======================+==================+
| AdditiveNoise | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| Boltzmann | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
| Bootstrapped | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
| Categorical | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
| ContinuousEntropy | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| EGreedy | :green:`V` | :green:`V` |
+----------------------+-----------------------+------------------+
| Greedy | :green:`V` | :green:`V` |
+----------------------+-----------------------+------------------+
| OUProcess | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| ParameterNoise | :green:`V` | :green:`V` |
+----------------------+-----------------------+------------------+
| TruncatedNormal | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| UCB | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
ExplorationPolicy
-----------------
.. autoclass:: rl_coach.exploration_policies.ExplorationPolicy
:members:
:inherited-members:
AdditiveNoise
-------------
.. autoclass:: rl_coach.exploration_policies.AdditiveNoise
Boltzmann
---------
.. autoclass:: rl_coach.exploration_policies.Boltzmann
Bootstrapped
------------
.. autoclass:: rl_coach.exploration_policies.Bootstrapped
Categorical
-----------
.. autoclass:: rl_coach.exploration_policies.Categorical
ContinuousEntropy
-----------------
.. autoclass:: rl_coach.exploration_policies.ContinuousEntropy
EGreedy
-------
.. autoclass:: rl_coach.exploration_policies.EGreedy
Greedy
------
.. autoclass:: rl_coach.exploration_policies.Greedy
OUProcess
---------
.. autoclass:: rl_coach.exploration_policies.OUProcess
ParameterNoise
--------------
.. autoclass:: rl_coach.exploration_policies.ParameterNoise
TruncatedNormal
---------------
.. autoclass:: rl_coach.exploration_policies.TruncatedNormal
UCB
---
.. autoclass:: rl_coach.exploration_policies.UCB

View File

@@ -0,0 +1,28 @@
Filters
=======
.. toctree::
:maxdepth: 1
:caption: Filters
input_filters
output_filters
Filters are a mechanism in Coach that allows doing pre-processing and post-processing of the internal agent information.
There are two filter categories -
* **Input filters** - these are filters that process the information passed **into** the agent from the environment.
This information includes the observation and the reward. Input filters therefore allow rescaling observations,
normalizing rewards, stack observations, etc.
* **Output filters** - these are filters that process the information going **out** of the agent into the environment.
This information includes the action the agent chooses to take. Output filters therefore allow conversion of
actions from one space into another. For example, the agent can take :math:`N` discrete actions, that will be mapped by
the output filter onto :math:`N` continuous actions.
Filters can be stacked on top of each other in order to build complex processing flows of the inputs or outputs.
.. image:: /_static/img/filters.png
:width: 350px
:align: center

View File

@@ -0,0 +1,67 @@
Input Filters
=============
The input filters are separated into two categories - **observation filters** and **reward filters**.
Observation Filters
-------------------
ObservationClippingFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationClippingFilter
ObservationCropFilter
+++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationCropFilter
ObservationMoveAxisFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationMoveAxisFilter
ObservationNormalizationFilter
++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationNormalizationFilter
ObservationReductionBySubPartsNameFilter
++++++++++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationReductionBySubPartsNameFilter
ObservationRescaleSizeByFactorFilter
++++++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationRescaleSizeByFactorFilter
ObservationRescaleToSizeFilter
++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationRescaleToSizeFilter
ObservationRGBToYFilter
+++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationRGBToYFilter
ObservationSqueezeFilter
++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationSqueezeFilter
ObservationStackingFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationStackingFilter
ObservationToUInt8Filter
++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationToUInt8Filter
Reward Filters
--------------
RewardClippingFilter
++++++++++++++++++++
.. autoclass:: rl_coach.filters.reward.RewardClippingFilter
RewardNormalizationFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.reward.RewardNormalizationFilter
RewardRescaleFilter
+++++++++++++++++++
.. autoclass:: rl_coach.filters.reward.RewardRescaleFilter

View File

@@ -0,0 +1,37 @@
Output Filters
--------------
The output filters only process the actions.
Action Filters
++++++++++++++
.. autoclass:: rl_coach.filters.action.AttentionDiscretization
.. image:: /_static/img/attention_discretization.png
:align: center
.. autoclass:: rl_coach.filters.action.BoxDiscretization
.. image:: /_static/img/box_discretization.png
:align: center
.. autoclass:: rl_coach.filters.action.BoxMasking
.. image:: /_static/img/box_masking.png
:align: center
.. autoclass:: rl_coach.filters.action.PartialDiscreteActionSpaceMap
.. image:: /_static/img/partial_discrete_action_space_map.png
:align: center
.. autoclass:: rl_coach.filters.action.FullDiscreteActionSpaceMap
.. image:: /_static/img/full_discrete_action_space_map.png
:align: center
.. autoclass:: rl_coach.filters.action.LinearBoxToBoxMap
.. image:: /_static/img/linear_box_to_box_map.png
:align: center

View File

@@ -0,0 +1,44 @@
Memories
========
Episodic Memories
-----------------
EpisodicExperienceReplay
++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.EpisodicExperienceReplay
EpisodicHindsightExperienceReplay
+++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.EpisodicHindsightExperienceReplay
EpisodicHRLHindsightExperienceReplay
++++++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.EpisodicHRLHindsightExperienceReplay
SingleEpisodeBuffer
+++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.SingleEpisodeBuffer
Non-Episodic Memories
---------------------
BalancedExperienceReplay
++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.BalancedExperienceReplay
QDND
++++
.. autoclass:: rl_coach.memories.non_episodic.QDND
ExperienceReplay
++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.ExperienceReplay
PrioritizedExperienceReplay
+++++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.PrioritizedExperienceReplay
TransitionCollection
++++++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.TransitionCollection

View File

@@ -0,0 +1,64 @@
Spaces
======
Space
-----
.. autoclass:: rl_coach.spaces.Space
:members:
:inherited-members:
Observation Spaces
------------------
.. autoclass:: rl_coach.spaces.ObservationSpace
:members:
:inherited-members:
VectorObservationSpace
++++++++++++++++++++++
.. autoclass:: rl_coach.spaces.VectorObservationSpace
PlanarMapsObservationSpace
++++++++++++++++++++++++++
.. autoclass:: rl_coach.spaces.PlanarMapsObservationSpace
ImageObservationSpace
+++++++++++++++++++++
.. autoclass:: rl_coach.spaces.ImageObservationSpace
Action Spaces
-------------
.. autoclass:: rl_coach.spaces.ActionSpace
:members:
:inherited-members:
AttentionActionSpace
++++++++++++++++++++
.. autoclass:: rl_coach.spaces.AttentionActionSpace
BoxActionSpace
++++++++++++++
.. autoclass:: rl_coach.spaces.BoxActionSpace
DiscreteActionSpace
++++++++++++++++++++
.. autoclass:: rl_coach.spaces.DiscreteActionSpace
MultiSelectActionSpace
++++++++++++++++++++++
.. autoclass:: rl_coach.spaces.MultiSelectActionSpace
CompoundActionSpace
+++++++++++++++++++
.. autoclass:: rl_coach.spaces.CompoundActionSpace
Goal Spaces
-----------
.. autoclass:: rl_coach.spaces.GoalsSpace
:members:
:inherited-members: