1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00

update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
This commit is contained in:
Itai Caspi
2018-11-15 15:00:13 +02:00
committed by Gal Novik
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions

View File

View File

@@ -0,0 +1,61 @@
/* Docs background */
.wy-side-nav-search{
background-color: #043c74;
}
/* Mobile version */
.wy-nav-top{
background-color: #043c74;
}
.green {
color: green;
}
.red {
color: red;
}
.blue {
color: blue;
}
.yellow {
color: yellow;
}
.badge {
border: 2px;
border-style: solid;
border-color: #6C8EBF;
border-radius: 5px;
padding: 3px 15px 3px 15px;
margin: 5px;
display: inline-block;
font-weight: bold;
font-size: 16px;
background: #DAE8FC;
}
.badge:hover {
cursor: pointer;
}
.badge > a {
color: black;
}
.bordered-container {
border: 0px;
border-style: solid;
border-radius: 8px;
padding: 15px;
margin-bottom: 20px;
background: #f2f2f2;
}
.questionnaire {
font-size: 1.2em;
line-height: 1.5em;
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 310 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 449 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 447 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 106 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

View File

@@ -0,0 +1 @@
<mxfile userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" version="9.3.0" editor="www.draw.io" type="device"><diagram id="44beadd6-aa91-fdea-8231-a495e8c32fb3" name="Page-1">7Z1Nc+I4EIZ/DUdSluQvjklmdvewUzVVc9jZoxcU8I7BlHEmZH/9ytgyWN0MDrFkR4bDDBG2wI9eS61Wqz1hj+v971m0XX1JFzyZUGexn7BPE0oJmTniv6LktSoJCCtLllm8qMqOBd/i/3hVWJ24fI4XfNc4ME/TJI+3zcJ5utnwed4oi7IsfWke9pQmzW/dRksOCr7NowSW/hUv8lVZGnrOsfwPHi9X8puJU33yTzT/sczS5031fRPKng6v8uN1JOuqjt+tokX6clLEPk/YY5amefluvX/kSQFXYivP++3Mp/Xvzvgmb3MCZW55ys8oeebyNx9+Wf4qaRyuhxdnOBP28LKKc/5tG82LT1+EAETZKl8n4i8i3u7yLP3BH9MkzUTJJt2Iwx6e4iSRRUckojzd5FXrC4Wwh+q38Czn+7NXRGpOQoA8XfM8exWH1Oqr0Fbam3puVfBybEkvqMpWJ63IZCNGlXqWdeVHguJNBfFMC3ihZUBnCk8StORJw054BpbxDJs8a70akqdvN06Xem3V6XWB07MbZ0Ba4qxl/D6cto1GQRNnbUEY6TuDFkPRBX7RblvaV0/xvmCusns8vADoSdFbFa9uKLoebWCkSJ9JfEyVfhcYW4xA+jEW4jwpL1968HrEN4iXdk+3AyaENe9cs5JjgImo9NEBYMT15W/s8KqiKImXG/HnXPDgovyhoBWLGdB99cE6XiyKr0F72GYf3AVuxWqvp5AnuF2sn+wANhxziN2w1VHJJGxoLxXKJvbCVpXNkOmTLtjQ1id2w1aVbRI2tBIe0r0ouJ/ncboRb8qLHyb7g+lb/SinC9XLiZHsYtgMDp9EU0sQOHx+infzjOd8rO3h+u5dixZhnqYWgWMsHF/FKfF2d46PZnvPdZUphgN7DqpLr3BQhDZwr3R8p0c6cBTzh0VH1Q7zDdKBc/xwWHRU7Zikg0wuofXTKx2P9ndnIdNMNiw6oFcODdKBY5Y7LDpAOybpwDHLGxYdMGaZpAPHrGBYdFTtmOyV26w571bRtnj7lPD9fbGiLy6bbxbV20/zJNrt4nmTk+JU9Q+vcwZ8+Y18AQIBLhI89d0jhGRZxpMoj382q8ewVd/wNY3FF9cNxGQ9tXxnzSp26XM259VZR/igIpdeqCiPsiXPQUWHVqwvu13DtlhcHIAv16x/m0HHwMicue4M8tblhWHQHh6ZN9co7RmqbYs9jKq2/cAcbVnveP25RmkTQPvm0D32Moi9qM2hK2eBN4PxVwajauXM3GYVrQ1G50JF3RmMLvRAgIbtY3LUwmAMKWxJWfYurSOu8sJ1b7MJ4/VoMLrQkWG5wejNaH+04QyxUvcddB/ZQ5yoPWpojjicfhLreSsKN8obTkBHbDa6MniyF7MRnZ0q3Hv1whLFNxcY9FFLs3O46xsgGN0kHTj7g6Zpv3ScHunACdnA1lVr120fdAa/rlqPfn3Qua3+vH0yH1y9+gN6CW2rP8jOx7GH3RGXKmF3AbIDSFvYnafXADq3iaUDcPU2vh46KH/whpEX9khn8IaR5/RIR69hpPOO83ukNniDyXN7pAMNprGPq64b9Dmu+sguJtt99ETJA2Fy05iPeI1t92EquLHoRW24kYgl+530qsCNEj8TtWQz7z4VLsNKgMItjqVR9W1yb2QIB8wvz0keF2YLTwprbaSGjBqaGiI3ARqa2kmrwF5nYOsj6lRuhmRh0GV2z+BE92Z2q2Y3qldtu0gZMpdeLLm8xDTLV+ky3UTJ52OpguSkRfg+zr+fvP+7OERcX+FqFr/se3XG4Y/jZ4totzrUVpz2L8/z1yplUPScp6Lo+CP+TNPt5Ne+6fJyLruhxSUf/L2VMqvJcem6Pb2XYcu19kO3vi3Q3CRKqxyZk8sdxy9y3pxJ1NRFx+te7lmIrq6FMMRlNkYhu1DIxDnTcN0rGU388vGVjOS71KhkxFE3RiV7mJKpKSW3SSDwAZWMzFA0KhlZ5Bmjkn1MycyUklFPnxNtFuJfG3pn4iAb/DWKGtmUPUZRB5ioXVOiRneAVqK2oaMW/YNRUSP5X8Yo6hATtWdK1KjL2ipRY7HK+kSNJIgYo6hnmKh9U6LG9zOXtodV2kZWb/VpW1puww0XBo5mZBKtDw8FeBD/0KDwIDMzfXiGHx3ktDDy9fEZfqIqwAexF/XxGXyqKsjHZAZBZ/AJFiEfZPjSx0d//v96rDf6GIVpALdF6suDRBwNj6PogApVNudSwCREkHTybCMHNzdHFEU4pfBG1hWEUu8zHc9WfzV/jlnc0Op3xhViFcK+RB9tOIkgdtMGD1cySRvOSUa8y19NDjWtF9OMpPtHMoRPC+1PLRa/K0NIJHKDKS4IwXOK2MxbTSlilje0Ey3v21XcJvt2Cs3EEfftagYXw3078lidqeXaV7t2o9qHPQ0AfZKIoQLYIguDXJCqFqHkklRjQeq4PiU/ayxIlT/krQtKtWHcWFEqZQWboJ/c3vW84b3ZHYirVNRddod6p0L30jgKwHH8hgTuHEJUGRw+/cqzWPz44m4tRSTXO0UVwYnApkWdYVWgnnSloBx03b1XQdU53WXCsyvl5JO7wHcDxwtDIVLarHRG70SRK6y/8hBfn9Sgp/B+yat6h9fnnw650ufR7ZDrmRxykcTHnzc/4yzdrMfTBnBKa7YNoH9hzHdAPQMyAx8uqN5uAMNNAM3+MevfxfZ06IMPh9+b/g03AZyHjUr/avZGLL5YG3wkM+pN/4abANqgY9Z/aDJtMkGyu970b7QJKBIZArh3lTxOb8yRGvDhsvpJ1EbCjigS9fFulF3oS+b86Q8MHp9hcThMbdnWzD2MuS7vOj0To2Exceb0TBzNVmHz6hHQOA3MEkezKthMHGjcNHEY3z7i9WoiE9TI1vAcDxtXtZluSDTS2PNGeY6aBv1cq+hKHUWRkCU47Pa7yYGqlrIDVxx0bXKgBPr8oJk8LD6MQgFp40OhsTywLYzq8+7N6keqZbib9AAfs/pp8Wj021NPQsWSYl49Srw5lsG7XFd3IQsUiVOzPiU4pf1O7iiav9TurMlEZc58o8yRKDBiO3PWN3PEUWd7tmrQtxieVMsY3JEuQHlEna9Rs7NoBh13I1yDAr4M063QYsv2VTG/B1JlbiHCJueDwE9zGTVDx6+PCK9XhE4DeGm/EeGeOrdVA7nbmr3EUWtSY8s7NHqZrt0CZ+VBJiBpVS2PiZrq6kp5MFQeZ/JrG4rvVicz9UOh3qoPUJNLlZo61AcSSfwRu4/eOgWwe/9oA7213UFdJ8vNOloemk8fsWfoq+V9MB54V7c8iMBA6rq65cWfWZrmp4cLw2j1JV3w4oj/AQ==</diagram></mxfile>

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 398 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

View File

@@ -0,0 +1,4 @@
{% extends "!layout.html" %}
{% block extrahead %}
<link href="{{ pathto("_static/css/custom.css", True) }}" rel="stylesheet" type="text/css">
{% endblock %}

View File

@@ -0,0 +1,18 @@
Additional Parameters
=====================
VisualizationParameters
-----------------------
.. autoclass:: rl_coach.base_parameters.VisualizationParameters
PresetValidationParameters
--------------------------
.. autoclass:: rl_coach.base_parameters.PresetValidationParameters
TaskParameters
--------------
.. autoclass:: rl_coach.base_parameters.TaskParameters
DistributedTaskParameters
-------------------------
.. autoclass:: rl_coach.base_parameters.DistributedTaskParameters

View File

@@ -0,0 +1,29 @@
Behavioral Cloning
==================
**Actions space:** Discrete | Continuous
Network Structure
-----------------
.. image:: /_static/img/design_imgs/pg.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
The replay buffer contains the expert demonstrations for the task.
These demonstrations are given as state, action tuples, and with no reward.
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
the expert for each state.
1. Sample a batch of transitions from the replay buffer.
2. Use the current states as input to the network, and the expert actions as the targets of the network.
3. For the network head, we use the policy head, which uses the cross entropy loss function.
.. autoclass:: rl_coach.agents.bc_agent.BCAlgorithmParameters

View File

@@ -0,0 +1,36 @@
Conditional Imitation Learning
==============================
**Actions space:** Discrete | Continuous
**References:** `End-to-end Driving via Conditional Imitation Learning <https://arxiv.org/abs/1710.02410>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/cil.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
The replay buffer contains the expert demonstrations for the task.
These demonstrations are given as state, action tuples, and with no reward.
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
the expert for each state.
In conditional imitation learning, each transition is assigned a class, which determines the goal that was pursuit
in that transitions. For example, 3 possible classes can be: turn right, turn left and follow lane.
1. Sample a batch of transitions from the replay buffer, where the batch is balanced, meaning that an equal number
of transitions will be sampled from each class index.
2. Use the current states as input to the network, and assign the expert actions as the targets of the network heads
corresponding to the state classes. For the other heads, set the targets to match the currently predicted values,
so that the loss for the other heads will be zeroed out.
3. We use a regression head, that minimizes the MSE loss between the network predicted values and the target values.
.. autoclass:: rl_coach.agents.cil_agent.CILAlgorithmParameters

View File

@@ -0,0 +1,43 @@
Agents
======
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into three main classes -
value optimization, policy optimization and imitation learning.
A detailed description of those algorithms can be found by navigating to each of the algorithm pages.
.. image:: /_static/img/algorithms.png
:width: 600px
:align: center
.. toctree::
:maxdepth: 1
:caption: Agents
policy_optimization/ac
imitation/bc
value_optimization/bs_dqn
value_optimization/categorical_dqn
imitation/cil
policy_optimization/cppo
policy_optimization/ddpg
other/dfp
value_optimization/double_dqn
value_optimization/dqn
value_optimization/dueling_dqn
value_optimization/mmc
value_optimization/n_step
value_optimization/naf
value_optimization/nec
value_optimization/pal
policy_optimization/pg
policy_optimization/ppo
value_optimization/rainbow
value_optimization/qr_dqn
.. autoclass:: rl_coach.base_parameters.AgentParameters
.. autoclass:: rl_coach.agents.agent.Agent
:members:
:inherited-members:

View File

@@ -0,0 +1,39 @@
Direct Future Prediction
========================
**Actions space:** Discrete
**References:** `Learning to Act by Predicting the Future <https://arxiv.org/abs/1611.01779>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dfp.png
:width: 600px
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network.
The output of the network is the predicted future measurements for time-steps :math:`t+1,t+2,t+4,t+8,t+16` and
:math:`t+32` for each possible action.
2. For each action, the measurements of each predicted time-step are multiplied by the goal vector,
and the result is a single vector of future values for each action.
3. Then, a weighted sum of the future values of each action is calculated, and the result is a single value for each action.
4. The action values are passed to the exploration policy to decide on the action to use.
Training the network
++++++++++++++++++++
Given a batch of transitions, run them through the network to get the current predictions of the future measurements
per action, and set them as the initial targets for training the network. For each transition
:math:`(s_t,a_t,r_t,s_{t+1} )` in the batch, the target of the network for the action that was taken, is the actual
measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
For the actions that were not taken, the targets are the current values.
.. autoclass:: rl_coach.agents.dfp_agent.DFPAlgorithmParameters

View File

@@ -0,0 +1,40 @@
Actor-Critic
============
**Actions space:** Discrete | Continuous
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ac.png
:width: 500px
:align: center
Algorithm Description
---------------------
Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
++++++++++++++++++++
A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
Advantages can be calculated by either of the following methods (configured by the selected preset) -
1. **A_VALUE** - Estimating advantage directly:
:math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
The advantages are then used in order to accumulate gradients according to
:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters

View File

@@ -0,0 +1,44 @@
Clipped Proximal Policy Optimization
====================================
**Actions space:** Discrete | Continuous
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ppo.png
:align: center
Algorithm Description
---------------------
Choosing an action - Continuous action
++++++++++++++++++++++++++++++++++++++
Same as in PPO.
Training the network
++++++++++++++++++++
Very similar to PPO, with several small (but very simplifying) changes:
1. Train both the value and policy networks, simultaneously, by defining a single loss function,
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
3. Value targets are now also calculated based on the GAE advantages.
In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
:math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
clipped surrogate loss:
:math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters

View File

@@ -0,0 +1,50 @@
Deep Deterministic Policy Gradient
==================================
**Actions space:** Continuous
**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ddpg.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
Training the network
++++++++++++++++++++
Start by sampling a batch of transitions from the experience replay.
* To train the **critic network**, use the following targets:
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
as the inputs, and :math:`y_t` as the targets.
* To train the **actor network**, use the following equation:
:math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
Use the actor's online network to get the action mean values using the current states as the inputs.
Then, use the critic online network in order to get the gradients of the critic output with respect to the
action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters

View File

@@ -0,0 +1,24 @@
Hierarchical Actor Critic
=========================
**Actions space:** Continuous
**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ddpg.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
Training the network
++++++++++++++++++++

View File

@@ -0,0 +1,39 @@
Policy Gradient
===============
**Actions space:** Discrete | Continuous
**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/pg.png
:align: center
Algorithm Description
---------------------
Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
Run the current states through the network and get a policy distribution over the actions.
While training, sample from the policy distribution. When testing, take the action with the highest probability.
Training the network
++++++++++++++++++++
The policy head loss is defined as :math:`L=-log (\pi) \cdot PolicyGradientRescaler`.
The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
convergence. The rescaler is a configurable parameter and there are few options to choose from:
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
* **Future Return** - Return from each transition until the end of the episode.
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
which are calculated seperately for each timestep, across different episodes.
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
the gradients are then applied to the network.
.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters

View File

@@ -0,0 +1,45 @@
Proximal Policy Optimization
============================
**Actions space:** Discrete | Continuous
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/ppo.png
:align: center
Algorithm Description
---------------------
Choosing an action - Continuous actions
+++++++++++++++++++++++++++++++++++++++
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
When testing, just take the mean values predicted by the network.
Training the network
++++++++++++++++++++
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
the L-BFGS optimizer runs on the entire dataset at once, without batching.
It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
discounted returns of each state in each episode.
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
starting to run the current set of training iterations) using a regularization term.
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters

View File

@@ -0,0 +1,43 @@
Bootstrapped DQN
================
**Actions space:** Discrete
**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/bs_dqn.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
The current states are used as the input to the network. The network contains several $Q$ heads, which are used
for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
selects a single head to play with during the episode. According to the selected head, only the relevant
output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
Storing the transitions
+++++++++++++++++++++++
For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
the replay buffer.
Training the network
++++++++++++++++++++
First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
according to the standard DQN update rule:
:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
Otherwise, leave it intact so that the transition does not affect the learning of this head.
Then, train the online network according to the calculated targets.
As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.

View File

@@ -0,0 +1,39 @@
Categorical DQN
===============
**Actions space:** Discrete
**References:** `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/distributional_dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
that the :math:`i-th` component of the projected update is calculated as follows:
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
where:
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom :math:`z_j`: :math:`\hat{T}_{z_{j}} := r+\gamma z_j`
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
probability distribution. Only the target of the actions that were actually taken is updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.
.. autoclass:: rl_coach.agents.categorical_dqn_agent.CategoricalDQNAlgorithmParameters

View File

@@ -0,0 +1,35 @@
Double DQN
==========
**Actions space:** Discrete
**References:** `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
Set those values as the targets for the actions that were not actually played.
4. For each action that was played, use the following equation for calculating the targets of the network:
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a))`
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
6. Once in every few thousand steps, copy the weights from the online network to the target network.

View File

@@ -0,0 +1,37 @@
Deep Q Networks
===============
**Actions space:** Discrete
**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
Set those values as the targets for the actions that were not actually played.
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
:math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
6. Once in every few thousand steps, copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters

View File

@@ -0,0 +1,27 @@
Dueling DQN
===========
**Actions space:** Discrete
**References:** `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dueling_dqn.png
:align: center
General Description
-------------------
Dueling DQN presents a change in the network structure comparing to DQN.
Dueling DQN uses a specialized *Dueling Q Head* in order to separate :math:`Q` to an :math:`A` (advantage)
stream and a :math:`V` stream. Adding this type of structure to the network head allows the network to better differentiate
actions from one another, and significantly improves the learning.
In many states, the values of the different actions are very similar, and it is less important which action to take.
This is especially important in environments where there are many actions to choose from. In DQN, on each training
iteration, for each of the states in the batch, we update the :ath:`Q` values only for the specific actions taken in
those states. This results in slower learning as we do not learn the :math:`Q` values for actions that were not taken yet.
On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a
single action has been taken at this state.

View File

@@ -0,0 +1,37 @@
Mixed Monte Carlo
=================
**Actions space:** Discrete
**References:** `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
The DDQN targets are calculated in the same manner as in the DDQN agent:
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
:math:`y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )`
A mixing ratio $\alpha$ is then used to get the final targets:
:math:`y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}`
Finally, the online network is trained using the current states as inputs, and the calculated targets.
Once in every few thousand steps, copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.mmc_agent.MixedMonteCarloAlgorithmParameters

View File

@@ -0,0 +1,35 @@
N-Step Q Learning
=================
**Actions space:** Discrete
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
The :math:`N`-step Q learning algorithm works in similar manner to DQN except for the following changes:
1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
:math:`N` steps using the latest :math:`N` steps played by the agent.
2. In order to stabilize the learning, multiple workers work together to update the network.
This creates the same effect as uncorrelating the samples used for training.
3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated
to form the :math:`N`-step Q targets, according to the following equation:
:math:`R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})`
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch
.. autoclass:: rl_coach.agents.n_step_q_agent.NStepQAlgorithmParameters

View File

@@ -0,0 +1,33 @@
Normalized Advantage Functions
==============================
**Actions space:** Continuous
**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/naf.png
:width: 600px
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
It is then passed to the exploration policy which adds noise in order to encourage exploration.
Training the network
++++++++++++++++++++
The network is trained by using the following targets:
:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
and :math:`y_t` as the targets.
After every training step, use a soft update in order to copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters

View File

@@ -0,0 +1,50 @@
Neural Episodic Control
=======================
**Actions space:** Discrete
**References:** `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/nec.png
:width: 500px
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate
output from the middleware.
2. For each possible action :math:`a_i`, run the DND head using the state embedding and the selected action :math:`a_i` as inputs.
The DND is queried and returns the :math:`P` nearest neighbor keys and values. The keys and values are used to calculate
and return the action :math:`Q` value from the network.
3. Pass all the :math:`Q` values to the exploration policy and choose an action accordingly.
4. Store the state embeddings and actions taken during the current episode in a small buffer :math:`B`, in order to
accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
Finalizing an episode
+++++++++++++++++++++
For each step in the episode, the state embeddings and the taken actions are stored in the buffer :math:`B`.
When the episode is finished, the replay buffer calculates the :math:`N`-step total return of each transition in the
buffer, bootstrapped using the maximum :math:`Q` value of the :math:`N`-th transition. Those values are inserted
along with the total return into the DND, and the buffer :math:`B` is reset.
Training the network
++++++++++++++++++++
Train the network only when the DND has enough entries for querying.
To train the network, the current states are used as the inputs and the :math:`N`-step returns are used as the targets.
The :math:`N`-step return used takes into account :math:`N` consecutive steps, and bootstraps the last value from
the network if necessary:
:math:`y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N max_a Q(s_{t+N},a)`
.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters

View File

@@ -0,0 +1,45 @@
Persistent Advantage Learning
=============================
**Actions space:** Discrete
**References:** `Increasing the Action Gap: New Operators for Reinforcement Learning <https://arxiv.org/abs/1512.04860>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
3. The action gap :math:`V(s_t )-Q(s_t,a_t)` should then be subtracted from each of the calculated targets.
To calculate the action gap, run the target network using the current states and get the :math:`Q` values
for all the actions. Then estimate :math:`V` as the maximum predicted :math:`Q` value for the current state:
:math:`V(s_t )=max_a Q(s_t,a)`
4. For *advantage learning (AL)*, reduce the action gap weighted by a predefined parameter :math:`\alpha` from
the targets :math:`y_t^{DDQN}`:
:math:`y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t ))`
5. For *persistent advantage learning (PAL)*, the target network is also used in order to calculate the action
gap for the next state:
:math:`V(s_{t+1} )-Q(s_{t+1},a_{t+1})`
where :math:`a_{t+1}` is chosen by running the next states through the online network and choosing the action that
has the highest predicted :math:`Q` value. Finally, the targets will be defined as -
:math:`y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} ))`
6. Train the online network using the current states as inputs, and with the aforementioned targets.
7. Once in every few thousand steps, copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters

View File

@@ -0,0 +1,33 @@
Quantile Regression DQN
=======================
**Actions space:** Discrete
**References:** `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/qr_dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. First, the next state quantiles are predicted. These are used in order to calculate the targets for the network,
by following the Bellman equation.
Next, the current quantile locations for the current states are predicted, sorted, and used for calculating the
quantile midpoints targets.
3. The network is trained with the quantile regression loss between the resulting quantile locations and the target
quantile locations. Only the targets of the actions that were actually taken are updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.
.. autoclass:: rl_coach.agents.qr_dqn_agent.QuantileRegressionDQNAlgorithmParameters

View File

@@ -0,0 +1,51 @@
Rainbow
=======
**Actions space:** Discrete
**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/rainbow.png
:align: center
Algorithm Description
---------------------
Rainbow combines 6 recent advancements in reinforcement learning:
* N-step returns
* Distributional state-action value learning
* Dueling networks
* Noisy Networks
* Double DQN
* Prioritized Experience Replay
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
that the :math:`i-th` component of the projected update is calculated as follows:
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
where:
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
:math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
probability distribution. Only the target of the actions that were actually taken is updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.
5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
using the KL divergence loss that is returned from the network.
.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters

View File

@@ -0,0 +1,27 @@
Architectures
=============
Architectures contain all the classes that implement the neural network related stuff for the agent.
Since Coach is intended to work with multiple neural network frameworks, each framework will implement its
own components under a dedicated directory. For example, tensorflow components will contain all the neural network
parts that are implemented using TensorFlow.
.. autoclass:: rl_coach.base_parameters.NetworkParameters
Architecture
------------
.. autoclass:: rl_coach.architectures.architecture.Architecture
:members:
:inherited-members:
NetworkWrapper
--------------
.. image:: /_static/img/distributed.png
:width: 600px
:align: center
.. autoclass:: rl_coach.architectures.network_wrapper.NetworkWrapper
:members:
:inherited-members:

View File

@@ -0,0 +1,33 @@
Core Types
==========
ActionInfo
----------
.. autoclass:: rl_coach.core_types.ActionInfo
:members:
:inherited-members:
Batch
-----
.. autoclass:: rl_coach.core_types.Batch
:members:
:inherited-members:
EnvResponse
-----------
.. autoclass:: rl_coach.core_types.EnvResponse
:members:
:inherited-members:
Episode
-------
.. autoclass:: rl_coach.core_types.Episode
:members:
:inherited-members:
Transition
----------
.. autoclass:: rl_coach.core_types.Transition
:members:
:inherited-members:

View File

@@ -0,0 +1,70 @@
Environments
============
.. autoclass:: rl_coach.environments.environment.Environment
:members:
:inherited-members:
DeepMind Control Suite
----------------------
A set of reinforcement learning environments powered by the MuJoCo physics engine.
Website: `DeepMind Control Suite <https://github.com/deepmind/dm_control>`_
.. autoclass:: rl_coach.environments.control_suite_environment.ControlSuiteEnvironment
Blizzard Starcraft II
---------------------
A popular strategy game which was wrapped with a python interface by DeepMind.
Website: `Blizzard Starcraft II <https://github.com/deepmind/pysc2>`_
.. autoclass:: rl_coach.environments.starcraft2_environment.StarCraft2Environment
ViZDoom
--------
A Doom-based AI research platform for reinforcement learning from raw visual information.
Website: `ViZDoom <http://vizdoom.cs.put.edu.pl/>`_
.. autoclass:: rl_coach.environments.doom_environment.DoomEnvironment
CARLA
-----
An open-source simulator for autonomous driving research.
Website: `CARLA <https://github.com/carla-simulator/carla>`_
.. autoclass:: rl_coach.environments.carla_environment.CarlaEnvironment
OpenAI Gym
----------
A library which consists of a set of environments, from games to robotics.
Additionally, it can be extended using the API defined by the authors.
Website: `OpenAI Gym <https://gym.openai.com/>`_
In Coach, we support all the native environments in Gym, along with several extensions such as:
* `Roboschool <https://github.com/openai/roboschool>`_ - a set of environments powered by the PyBullet engine,
that offer a free alternative to MuJoCo.
* `Gym Extensions <https://github.com/Breakend/gym-extensions>`_ - a set of environments that extends Gym for
auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)
* `PyBullet <https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet>`_ - a physics engine that
includes a set of robotics environments.
.. autoclass:: rl_coach.environments.gym_environment.GymEnvironment

View File

@@ -0,0 +1,87 @@
Exploration Policies
====================
Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to a
predefined policy. This is one of the most important aspects of reinforcement learning agents, and can require some
tuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended with
custom policies. Note that not all exploration policies are expected to work for both discrete and continuous action
spaces.
.. role:: green
.. role:: red
+----------------------+-----------------------+------------------+
| Exploration Policy | Discrete Action Space | Box Action Space |
+======================+=======================+==================+
| AdditiveNoise | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| Boltzmann | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
| Bootstrapped | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
| Categorical | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
| ContinuousEntropy | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| EGreedy | :green:`V` | :green:`V` |
+----------------------+-----------------------+------------------+
| Greedy | :green:`V` | :green:`V` |
+----------------------+-----------------------+------------------+
| OUProcess | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| ParameterNoise | :green:`V` | :green:`V` |
+----------------------+-----------------------+------------------+
| TruncatedNormal | :red:`X` | :green:`V` |
+----------------------+-----------------------+------------------+
| UCB | :green:`V` | :red:`X` |
+----------------------+-----------------------+------------------+
ExplorationPolicy
-----------------
.. autoclass:: rl_coach.exploration_policies.ExplorationPolicy
:members:
:inherited-members:
AdditiveNoise
-------------
.. autoclass:: rl_coach.exploration_policies.AdditiveNoise
Boltzmann
---------
.. autoclass:: rl_coach.exploration_policies.Boltzmann
Bootstrapped
------------
.. autoclass:: rl_coach.exploration_policies.Bootstrapped
Categorical
-----------
.. autoclass:: rl_coach.exploration_policies.Categorical
ContinuousEntropy
-----------------
.. autoclass:: rl_coach.exploration_policies.ContinuousEntropy
EGreedy
-------
.. autoclass:: rl_coach.exploration_policies.EGreedy
Greedy
------
.. autoclass:: rl_coach.exploration_policies.Greedy
OUProcess
---------
.. autoclass:: rl_coach.exploration_policies.OUProcess
ParameterNoise
--------------
.. autoclass:: rl_coach.exploration_policies.ParameterNoise
TruncatedNormal
---------------
.. autoclass:: rl_coach.exploration_policies.TruncatedNormal
UCB
---
.. autoclass:: rl_coach.exploration_policies.UCB

View File

@@ -0,0 +1,28 @@
Filters
=======
.. toctree::
:maxdepth: 1
:caption: Filters
input_filters
output_filters
Filters are a mechanism in Coach that allows doing pre-processing and post-processing of the internal agent information.
There are two filter categories -
* **Input filters** - these are filters that process the information passed **into** the agent from the environment.
This information includes the observation and the reward. Input filters therefore allow rescaling observations,
normalizing rewards, stack observations, etc.
* **Output filters** - these are filters that process the information going **out** of the agent into the environment.
This information includes the action the agent chooses to take. Output filters therefore allow conversion of
actions from one space into another. For example, the agent can take :math:`N` discrete actions, that will be mapped by
the output filter onto :math:`N` continuous actions.
Filters can be stacked on top of each other in order to build complex processing flows of the inputs or outputs.
.. image:: /_static/img/filters.png
:width: 350px
:align: center

View File

@@ -0,0 +1,67 @@
Input Filters
=============
The input filters are separated into two categories - **observation filters** and **reward filters**.
Observation Filters
-------------------
ObservationClippingFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationClippingFilter
ObservationCropFilter
+++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationCropFilter
ObservationMoveAxisFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationMoveAxisFilter
ObservationNormalizationFilter
++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationNormalizationFilter
ObservationReductionBySubPartsNameFilter
++++++++++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationReductionBySubPartsNameFilter
ObservationRescaleSizeByFactorFilter
++++++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationRescaleSizeByFactorFilter
ObservationRescaleToSizeFilter
++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationRescaleToSizeFilter
ObservationRGBToYFilter
+++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationRGBToYFilter
ObservationSqueezeFilter
++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationSqueezeFilter
ObservationStackingFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationStackingFilter
ObservationToUInt8Filter
++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.observation.ObservationToUInt8Filter
Reward Filters
--------------
RewardClippingFilter
++++++++++++++++++++
.. autoclass:: rl_coach.filters.reward.RewardClippingFilter
RewardNormalizationFilter
+++++++++++++++++++++++++
.. autoclass:: rl_coach.filters.reward.RewardNormalizationFilter
RewardRescaleFilter
+++++++++++++++++++
.. autoclass:: rl_coach.filters.reward.RewardRescaleFilter

View File

@@ -0,0 +1,37 @@
Output Filters
--------------
The output filters only process the actions.
Action Filters
++++++++++++++
.. autoclass:: rl_coach.filters.action.AttentionDiscretization
.. image:: /_static/img/attention_discretization.png
:align: center
.. autoclass:: rl_coach.filters.action.BoxDiscretization
.. image:: /_static/img/box_discretization.png
:align: center
.. autoclass:: rl_coach.filters.action.BoxMasking
.. image:: /_static/img/box_masking.png
:align: center
.. autoclass:: rl_coach.filters.action.PartialDiscreteActionSpaceMap
.. image:: /_static/img/partial_discrete_action_space_map.png
:align: center
.. autoclass:: rl_coach.filters.action.FullDiscreteActionSpaceMap
.. image:: /_static/img/full_discrete_action_space_map.png
:align: center
.. autoclass:: rl_coach.filters.action.LinearBoxToBoxMap
.. image:: /_static/img/linear_box_to_box_map.png
:align: center

View File

@@ -0,0 +1,44 @@
Memories
========
Episodic Memories
-----------------
EpisodicExperienceReplay
++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.EpisodicExperienceReplay
EpisodicHindsightExperienceReplay
+++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.EpisodicHindsightExperienceReplay
EpisodicHRLHindsightExperienceReplay
++++++++++++++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.EpisodicHRLHindsightExperienceReplay
SingleEpisodeBuffer
+++++++++++++++++++
.. autoclass:: rl_coach.memories.episodic.SingleEpisodeBuffer
Non-Episodic Memories
---------------------
BalancedExperienceReplay
++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.BalancedExperienceReplay
QDND
++++
.. autoclass:: rl_coach.memories.non_episodic.QDND
ExperienceReplay
++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.ExperienceReplay
PrioritizedExperienceReplay
+++++++++++++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.PrioritizedExperienceReplay
TransitionCollection
++++++++++++++++++++
.. autoclass:: rl_coach.memories.non_episodic.TransitionCollection

View File

@@ -0,0 +1,64 @@
Spaces
======
Space
-----
.. autoclass:: rl_coach.spaces.Space
:members:
:inherited-members:
Observation Spaces
------------------
.. autoclass:: rl_coach.spaces.ObservationSpace
:members:
:inherited-members:
VectorObservationSpace
++++++++++++++++++++++
.. autoclass:: rl_coach.spaces.VectorObservationSpace
PlanarMapsObservationSpace
++++++++++++++++++++++++++
.. autoclass:: rl_coach.spaces.PlanarMapsObservationSpace
ImageObservationSpace
+++++++++++++++++++++
.. autoclass:: rl_coach.spaces.ImageObservationSpace
Action Spaces
-------------
.. autoclass:: rl_coach.spaces.ActionSpace
:members:
:inherited-members:
AttentionActionSpace
++++++++++++++++++++
.. autoclass:: rl_coach.spaces.AttentionActionSpace
BoxActionSpace
++++++++++++++
.. autoclass:: rl_coach.spaces.BoxActionSpace
DiscreteActionSpace
++++++++++++++++++++
.. autoclass:: rl_coach.spaces.DiscreteActionSpace
MultiSelectActionSpace
++++++++++++++++++++++
.. autoclass:: rl_coach.spaces.MultiSelectActionSpace
CompoundActionSpace
+++++++++++++++++++
.. autoclass:: rl_coach.spaces.CompoundActionSpace
Goal Spaces
-----------
.. autoclass:: rl_coach.spaces.GoalsSpace
:members:
:inherited-members:

214
docs_raw/source/conf.py Normal file
View File

@@ -0,0 +1,214 @@
# -*- coding: utf-8 -*-
#
# Configuration file for the Sphinx documentation builder.
#
# This file does only contain a selection of the most common options. For a
# full list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath('.'))))
# -- Project information -----------------------------------------------------
project = 'Reinforcement Learning Coach'
copyright = '2018, Intel AI Lab'
author = 'Intel AI Lab'
# The short X.Y version
version = ''
# The full version, including alpha/beta/rc tags
release = '0.11.0'
# -- General configuration ---------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.todo',
'sphinx.ext.coverage',
'sphinx.ext.mathjax',
'sphinx.ext.ifconfig',
'sphinx.ext.viewcode',
'sphinx.ext.githubpages',
'sphinxarg.ext'
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
source_parsers = {
'.md': 'recommonmark.parser.CommonMarkParser',
}
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
source_suffix = ['.rst', '.md']
# source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
autoclass_content = 'both'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
html_logo = './_static/img/dark_logo.png'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# html_theme_options = {
# 'canonical_url': '',
# 'analytics_id': '',
# 'logo_only': True,
# 'display_version': True,
# 'prev_next_buttons_location': 'bottom',
# 'style_external_links': False,
# # Toc options
# 'collapse_navigation': True,
# 'sticky_navigation': True,
# 'navigation_depth': 1,
# 'includehidden': True,
# 'titles_only': False
# }
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = []
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# The default sidebars (for documents that don't match any pattern) are
# defined by theme itself. Builtin themes are using these templates by
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
# 'searchbox.html']``.
#
# html_sidebars = {}
def setup(app):
app.add_stylesheet('css/custom.css')
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'ReinforcementLearningCoachdoc'
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'ReinforcementLearningCoach.tex', 'Reinforcement Learning Coach Documentation',
'Intel AI Lab', 'manual'),
]
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'reinforcementlearningcoach', 'Reinforcement Learning Coach Documentation',
[author], 1)
]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'ReinforcementLearningCoach', 'Reinforcement Learning Coach Documentation',
author, 'ReinforcementLearningCoach', 'One line description of project.',
'Miscellaneous'),
]
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
# -- Extension configuration -------------------------------------------------
# -- Options for todo extension ----------------------------------------------
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = True

View File

@@ -0,0 +1,80 @@
Adding a New Agent
==================
Coach's modularity makes adding an agent a simple and clean task.
We suggest using the following
`Jupyter notebook tutorial <https://github.com/NervanaSystems/coach/blob/master/tutorials/1.%20Implementing%20an%20Algorithm.ipynb>`_
to ramp up on this process. In general, it involves the following steps:
1. Implement your algorithm in a new file. The agent can inherit base classes such as **ValueOptimizationAgent** or
**ActorCriticAgent**, or the more generic **Agent** base class.
.. note::
**ValueOptimizationAgent**, **PolicyOptimizationAgent** and **Agent** are abstract classes.
:code:`learn_from_batch()` should be overriden with the desired behavior for the algorithm being implemented.
If deciding to inherit from **Agent**, also :code:`choose_action()` should be overriden.
.. code-block:: python
def learn_from_batch(self, batch) -> Tuple[float, List, List]:
"""
Given a batch of transitions, calculates their target values and updates the network.
:param batch: A list of transitions
:return: The total loss of the training, the loss per head and the unclipped gradients
"""
def choose_action(self, curr_state):
"""
choose an action to act with in the current episode being played. Different behavior might be exhibited when training
or testing.
:param curr_state: the current state to act upon.
:return: chosen action, some action value describing the action (q-value, probability, etc)
"""
2. Implement your agent's specific network head, if needed, at the implementation for the framework of your choice.
For example **architectures/neon_components/heads.py**. The head will inherit the generic base class Head.
A new output type should be added to configurations.py, and a mapping between the new head and output type should
be defined in the get_output_head() function at **architectures/neon_components/general_network.py**
3. Define a new parameters class that inherits AgentParameters.
The parameters class defines all the hyperparameters for the agent, and is initialized with 4 main components:
* **algorithm**: A class inheriting AlgorithmParameters which defines any algorithm specific parameters
* **exploration**: A class inheriting ExplorationParameters which defines the exploration policy parameters.
There are several common exploration policies built-in which you can use, and are defined under
the exploration sub directory. You can also define your own custom exploration policy.
* **memory**: A class inheriting MemoryParameters which defined the memory parameters.
There are several common memory types built-in which you can use, and are defined under the memories
sub directory. You can also define your own custom memory.
* **networks**: A dictionary defining all the networks that will be used by the agent. The keys of the dictionary
define the network name and will be used to access each network through the agent class.
The dictionary values are a class inheriting NetworkParameters, which define the network structure
and parameters.
Additionally, set the path property to return the path to your agent class in the following format:
:code:`<path to python module>:<name of agent class>`
For example,
.. code-block:: python
class RainbowAgentParameters(AgentParameters):
def __init__(self):
super().__init__(algorithm=RainbowAlgorithmParameters(),
exploration=RainbowExplorationParameters(),
memory=RainbowMemoryParameters(),
networks={"main": RainbowNetworkParameters()})
@property
def path(self):
return 'rainbow.rainbow_agent:RainbowAgent'
4. (Optional) Define a preset using the new agent type with a given environment, and the hyper-parameters that should
be used for training on that environment.

View File

@@ -0,0 +1,93 @@
Adding a New Environment
========================
Adding a new environment to Coach is as easy as solving CartPole.
There are essentially two ways to integrate new environments to Coach:
Using the OpenAI Gym API
------------------------
If your environment is already using the OpenAI Gym API, you are already good to go.
When selecting the environment parameters in the preset, use :code:`GymEnvironmentParameters()`,
and pass the path to your environment source code using the level parameter.
You can specify additional parameters for your environment using the additional_simulator_parameters parameter.
Take for example the definition used in the :code:`Pendulum_HAC` preset:
.. code-block:: python
env_params = GymEnvironmentParameters()
env_params.level = "rl_coach.environments.mujoco.pendulum_with_goals:PendulumWithGoals"
env_params.additional_simulator_parameters = {"time_limit": 1000}
Using the Coach API
-------------------
There are a few simple steps to follow, and we will walk through them one by one.
As an alternative, we highly recommend following the corresponding
`tutorial <https://github.com/NervanaSystems/coach/blob/master/tutorials/2.%20Adding%20an%20Environment.ipynb>`_
in the GitHub repo.
1. Create a new class for your environment, and inherit the Environment class.
2. Coach defines a simple API for implementing a new environment, which are defined in environment/environment.py.
There are several functions to implement, but only some of them are mandatory.
Here are the important ones:
.. code-block:: python
def _take_action(self, action_idx: ActionType) -> None:
"""
An environment dependent function that sends an action to the simulator.
:param action_idx: the action to perform on the environment
:return: None
"""
def _update_state(self) -> None:
"""
Updates the state from the environment.
Should update self.observation, self.reward, self.done, self.measurements and self.info
:return: None
"""
def _restart_environment_episode(self, force_environment_reset=False) -> None:
"""
Restarts the simulator episode
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
:return: None
"""
def _render(self) -> None:
"""
Renders the environment using the native simulator renderer
:return: None
"""
def get_rendered_image(self) -> np.ndarray:
"""
Return a numpy array containing the image that will be rendered to the screen.
This can be different from the observation. For example, mujoco's observation is a measurements vector.
:return: numpy array containing the image that will be rendered to the screen
"""
3. Create a new parameters class for your environment, which inherits the EnvironmentParameters class.
In the __init__ of your class, define all the parameters you used in your Environment class.
Additionally, fill the path property of the class with the path to your Environment class.
For example, take a look at the EnvironmentParameters class used for Doom:
.. code-block:: python
class DoomEnvironmentParameters(EnvironmentParameters):
def __init__(self):
super().__init__()
self.default_input_filter = DoomInputFilter
self.default_output_filter = DoomOutputFilter
self.cameras = [DoomEnvironment.CameraTypes.OBSERVATION]
@property
def path(self):
return 'rl_coach.environments.doom_environment:DoomEnvironment'
4. And that's it, you're done. Now just add a new preset with your newly created environment, and start training an agent on top of it.

View File

@@ -0,0 +1,63 @@
Coach Dashboard
===============
Reinforcement learning algorithms are neat. That is - when they work. But when they don't, RL algorithms are often quite tricky to debug.
Finding the root cause for why things break in RL is rather difficult. Moreover, different RL algorithms shine in some aspects, but then lack on other. Comparing the algorithms faithfully is also a hard task, which requires the right tools.
Coach Dashboard is a visualization tool which simplifies the analysis of the training process. Each run of Coach extracts a lot of information from within the algorithm and stores it in the experiment directory. This information is very valuable for debugging, analyzing and comparing different algorithms. But without a good visualization tool, this information can not be utilized. This is where Coach Dashboard takes place.
Visualizing Signals
-------------------
Coach Dashboard exposes a convenient user interface for visualizing the training signals. The signals are dynamically updated - during the agent training. Additionaly, it allows selecting a subset of the available signals, and then overlaying them on top of each other.
.. image:: /_static/img/updating_dynamically.gif
:width: 800px
:align: center
* Holding the CTRL key, while selecting signals, will allow visualizing more than one signal.
* Signals can be visualized, using either of the Y-axes, in order to visualize signals with different scales. To move a signal to the second Y-axis, select it and press the 'Toggle Second Axis' button.
Tracking Statistics
-------------------
When running parallel algorithms, such as A3C, it often helps visualizing the learning of all the workers, at the same time. Coach Dashboard allows viewing multiple signals (and even smooth them out, if required) from multiple workers. In addition, it supports viewing the mean and standard deviation of the same signal, across different workers, using Bollinger bands.
.. figure:: /_static/img/bollinger_bands.png
:width: 800px
:align: center
**Displaying Bollinger Bands**
.. figure:: /_static/img/separate_signals.png
:width: 800px
:align: center
**Displaying all the Workers**
Comparing Runs
--------------
Reinforcement learning algorithms are notoriously known as unstable, and suffer from high run-to-run variance. This makes benchmarking and comparing different algorithms even harder. To ease this process, it is common to execute several runs of the same algorithm and average over them. This is easy to do with Coach Dashboard, by centralizing all the experiment directories in a single directory, and then loading them as a single group. Loading several groups of different algorithms then allows comparing the averaged signals, such as the total episode reward.
In RL, there are several interesting performance metrics to consider, and this is easy to do by controlling the X-axis units in Coach Dashboard. It is possible to switch between several options such as the total number of steps or the total training time.
.. figure:: /_static/img/compare_by_time.png
:width: 800px
:align: center
**Comparing Several Algorithms According to the Time Passed**
.. figure:: /_static/img/compare_by_num_episodes.png
:width: 800px
:align: center
**Comparing Several Algorithms According to the Number of Episodes Played**

View File

@@ -0,0 +1,102 @@
Control Flow
============
Coach is built in a modular way, encouraging modules reuse and reducing the amount of boilerplate code needed
for developing new algorithms or integrating a new challenge as an environment.
On the other hand, it can be overwhelming for new users to ramp up on the code.
To help with that, here's a short overview of the control flow.
Graph Manager
-------------
The main entry point for Coach is :code:`coach.py`.
The main functionality of this script is to parse the command line arguments and invoke all the sub-processes needed
for the given experiment.
:code:`coach.py` executes the given **preset** file which returns a :code:`GraphManager` object.
A **preset** is a design pattern that is intended for concentrating the entire definition of an experiment in a single
file. This helps with experiments reproducibility, improves readability and prevents confusion.
The outcome of a preset is a :code:`GraphManager` which will usually be instantiated in the final lines of the preset.
A :code:`GraphManager` is an object that holds all the agents and environments of an experiment, and is mostly responsible
for scheduling their work. Why is it called a **graph** manager? Because agents and environments are structured into
a graph of interactions. For example, in hierarchical reinforcement learning schemes, there will often be a master
policy agent, that will control a sub-policy agent, which will interact with the environment. Other schemes can have
much more complex graphs of control, such as several hierarchy layers, each with multiple agents.
The graph manager's main loop is the improve loop.
.. image:: /_static/img/improve.png
:width: 400px
:align: center
The improve loop skips between 3 main phases - heatup, training and evaluation:
* **Heatup** - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phase
takes place only in the beginning of the experiment, and the agents will act completely randomly during this phase.
Importantly, the agents do not train their networks during this phase. DQN for example, uses 50k random steps in order
to initialize the replay buffers.
* **Training** - the training phase is the main phase of the experiment. This phase can change between agent types,
but essentially consists of repeated cycles of acting, collecting data from the environment, and training the agent
networks. During this phase, the agent will use its exploration policy in training mode, which will add noise to its
actions in order to improve its knowledge about the environment state space.
* **Evaluation** - the evaluation phase is intended for evaluating the current performance of the agent. The agents
will act greedily in order to exploit the knowledge aggregated so far and the performance over multiple episodes of
evaluation will be averaged in order to reduce the stochasticity effects of all the components.
Level Manager
-------------
In each of the 3 phases described above, the graph manager will invoke all the hierarchy levels in the graph in a
synchronized manner. In Coach, agents do not interact directly with the environment. Instead, they go through a
*LevelManager*, which is a proxy that manages their interaction. The level manager passes the current state and reward
from the environment to the agent, and the actions from the agent to the environment.
The motivation for having a level manager is to disentangle the code of the environment and the agent, so to allow more
complex interactions. Each level can have multiple agents which interact with the environment. Who gets to choose the
action for each step is controlled by the level manager.
Additionally, each level manager can act as an environment for the hierarchy level above it, such that each hierarchy
level can be seen as an interaction between an agent and an environment, even if the environment is just more agents in
a lower hierarchy level.
Agent
-----
The base agent class has 3 main function that will be used during those phases - observe, act and train.
* **Observe** - this function gets the latest response from the environment as input, and updates the internal state
of the agent with the new information. The environment response will
be first passed through the agent's :code:`InputFilter` object, which will process the values in the response, according
to the specific agent definition. The environment response will then be converted into a
:code:`Transition` which will contain the information from a single step
:math:`(s_{t}, a_{t}, r_{t}, s_{t+1}, \textrm{terminal signal})`, and store it in the memory.
.. image:: /_static/img/observe.png
:width: 700px
:align: center
* **Act** - this function uses the current internal state of the agent in order to select the next action to take on
the environment. This function will call the per-agent custom function :code:`choose_action` that will use the network
and the exploration policy in order to select an action. The action will be stored, together with any additional
information (like the action value for example) in an :code:`ActionInfo` object. The ActionInfo object will then be
passed through the agent's :code:`OutputFilter` to allow any processing of the action (like discretization,
or shifting, for example), before passing it to the environment.
.. image:: /_static/img/act.png
:width: 700px
:align: center
* **Train** - this function will sample a batch from the memory and train on it. The batch of transitions will be
first wrapped into a :code:`Batch` object to allow efficient querying of the batch values. It will then be passed into
the agent specific :code:`learn_from_batch` function, that will extract network target values from the batch and will
train the networks accordingly. Lastly, if there's a target network defined for the agent, it will sync the target
network weights with the online network.
.. image:: /_static/img/train.png
:width: 700px
:align: center

View File

@@ -0,0 +1,148 @@
# Scaling out rollout workers
This document contains some options for how we could implement horizontal scaling of rollout workers in coach, though most details are not specific to coach. A few options are laid out, my current suggestion would be to start with Option 1, and move on to Option 1a or Option 1b as required.
## Off Policy Algorithms
### Option 1 - master polls file system
- one master process samples memories and updates the policy
- many worker processes execute rollouts
- coordinate using a single shared networked file system: nfs, ceph, dat, s3fs, etc.
- policy sync communication method:
- master process occasionally writes policy to shared file system
- worker processes occasionally read policy from shared file system
- prevent workers from reading a policy which has not been completely written to disk using either:
- redis lock
- write to temporary files and then rename
- rollout memories:
- sync communication method:
- worker processes write rollout memories as they are generated to shared filesystem
- master process occasionally reads rollout memories from shared file system
- master process must be resilient to corrupted or incompletely written memories
- sampling method:
- master process keeps all rollouts in memory utilizing existing coach memory classes
- control flow:
- master:
- run training updates interleaved with loading of any newly available rollouts in memory
- periodically write policy to disk
- workers:
- periodically read policy from disk
- evaluate rollouts and write them to disk
- ops:
- kubernetes yaml, kml, docker compose, etc
- a default shared file system can be provided, while allowing the user to specify something else if desired
- a default method of launching the workers and master (in kubernetes, gce, aws, etc) can be provided
#### Pros
- very simple to implement, infrastructure already available in ai-lab-kubernetes
- fast enough for proof of concept and iteration of interface design
- rollout memories are durable and can be easily reused in later off policy training
- if designed properly, there is a clear path towards:
- decreasing latency using in-memory store (option 1a/b)
- increasing rollout memory size using distributed sampling methods (option 1c)
#### Cons
- file system interface incurs additional latency. rollout memories must be written to disk, and later read from disk, instead of going directly from memory to memory.
- will require modifying standard control flow. there will be an impact on algorithms which expect particular training regimens. Specifically, algorithms which are sensitive to the number of update steps between target/online network updates
- will not be particularly efficient in strictly on policy algorithms where each rollout must use the most recent policy available
### Option 1a - master polls (redis) list
- instead of using a file system as in Option 1, redis lists can be used
- policy is stored as a single key/value pair (locking no longer necessary)
- rollout memory communication:
- workers: redis list push
- master: redis list len, redis list range
- note: many databases are interchangeable with redis protocol: google memorystore, aws elasticache, etc.
- note: many databases can implement this interface with minimal glue: SQL, any objectstore, etc.
#### Pros
- lower latency than disk since it is all in memory
- clear path toward scaling to large number of workers
- no concern about reading partially written rollouts
- no synchronization or additional threads necessary, though an additional thread would be helpful for concurrent reads from redis and training
- will be slightly more efficient in the case of strictly on policy algorithms
#### Cons
- more complex to set up, especially if you are concerned about rollout memory durability
### Option 1b - master subscribes to (redis) pub sub
- instead of using a file system as in Option 1, redis pub sub can be used
- policy is stored as a single key/value pair (locking no longer necessary)
- rollout memory communication:
- workers: redis publish
- master: redis subscribe
- no synchronization necessary, however an additional thread would be necessary?
- it looks like the python client might handle this already, would need further investigation
- note: many possible pub sub systems could be used with different characteristics under specific contexts: kafka, google pub/sub, aws kinesis, etc
#### Pros
- lower latency than disk since it is all in memory
- clear path toward scaling to large number of workers
- no concern about reading partially written rollouts
- will be slightly more efficient in the case of strictly on policy algorithms
#### Cons
- more complex to set up then shared file system
- on its own, does not persist worker rollouts for future off policy training
### Option 1c - distributed rollout memory sampling
- if rollout memories do not fit in memory of a single machine, a distributed storage and sampling method would be necessary
- for example:
- rollout memory store: redis set add
- rollout memory sample: redis set randmember
#### Pros
- capable of taking advantage of rollout memory larger than the available memory of a single machine
- reduce resource constraints on training machine
#### Cons
- distributed versions of each memory type/sampling method need to be custom built
- off-the-shelf implementations may not be available for complex memory types/sampling methods
### Option 2 - master listens to workers
- rollout memories:
- workers send memories directly to master via: mpi, 0mq, etc
- master policy thread listens for new memories and stores them in shared memory
- policy sync communication memory:
- master policy occasionally sends policies directly to workers via: mpi, 0mq, etc
- master and workers must synchronize so that all workers are listening when the master is ready to send a new policy
#### Pros
- lower latency than option 1 (for a small number of workers)
- will potentially be the optimal choice in the case of strictly on policy algorithms with relatively small number of worker nodes (small enough that more complex communication typologies would be necessary: rings, p2p, etc)
#### Cons
- much less robust and more difficult to debug requiring lots of synchronization
- much more difficult to be resiliency worker failure
- more custom communication/synchronization code
- as the number of workers scale up, a larger and larger fraction of time will be spent waiting and synchronizing
### Option 3 - Ray
#### Pros
- Ray would allow us to easily convert our current algorithms to distributed versions, with minimal change to our code.
#### Cons
- performance from naïve/simple use would be very similar to Option 2
- nontrivial to replace with a higher performance system if desired. Additional performance will require significant code changes.
## On Policy Algorithms
TODO

View File

@@ -0,0 +1,56 @@
Network Design
==============
Each agent has at least one neural network, used as the function approximator, for choosing the actions.
The network is designed in a modular way to allow reusability in different agents.
It is separated into three main parts:
* **Input Embedders** - This is the first stage of the network, meant to convert the input into a feature vector representation.
It is possible to combine several instances of any of the supported embedders, in order to allow varied combinations of inputs.
There are two main types of input embedders:
1. Image embedder - Convolutional neural network.
2. Vector embedder - Multi-layer perceptron.
* **Middlewares** - The middleware gets the output of the input embedder, and processes it into a different representation domain,
before sending it through the output head. The goal of the middleware is to enable processing the combined outputs of
several input embedders, and pass them through some extra processing.
This, for instance, might include an LSTM or just a plain simple FC layer.
* **Output Heads** - The output head is used in order to predict the values required from the network.
These might include action-values, state-values or a policy. As with the input embedders,
it is possible to use several output heads in the same network. For example, the *Actor Critic* agent combines two
heads - a policy head and a state-value head.
In addition, the output heads defines the loss function according to the head type.
.. image:: /_static/img/network.png
:width: 400px
:align: center
Keeping Network Copies in Sync
------------------------------
Most of the reinforcement learning agents include more than one copy of the neural network.
These copies serve as counterparts of the main network which are updated in different rates,
and are often synchronized either locally or between parallel workers. For easier synchronization of those copies,
a wrapper around these copies exposes a simplified API, which allows hiding these complexities from the agent.
In this wrapper, 3 types of networks can be defined:
* **online network** - A mandatory network which is the main network the agent will use
* **global network** - An optional network which is shared between workers in single-node multi-process distributed learning.
It is updated by all the workers directly, and holds the most up-to-date weights.
* **target network** - An optional network which is local for each worker. It can be used in order to keep a copy of
the weights stable for a long period of time. This is used in different agents, like DQN for example, in order to
have stable targets for the online network while training it.
.. image:: /_static/img/distributed.png
:width: 600px
:align: center

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,10 @@
Algorithms
==========
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into three main classes -
value optimization, policy optimization and imitation learning.
A detailed description of those algorithms may be found in the `agents <../components/agents/index.html>`_ section.
.. image:: /_static/img/algorithms.png
:width: 600px
:align: center

View File

@@ -0,0 +1,22 @@
Benchmarks
==========
Reinforcement learning is a developing field, and so far it has been particularly difficult to reproduce some of the
results published in the original papers. Some reasons for this are:
* Reinforcement learning algorithms are notoriously known as having an unstable learning process.
The data the neural networks trains on is dynamic, and depends on the random seed defined for the environment.
* Reinforcement learning algorithms have many moving parts. For some environments and agents, there are many
"tricks" which are needed to get the exact behavior the paper authors had seen. Also, there are **a lot** of
hyper-parameters to set.
In order for a reinforcement learning implementation to be useful for research or for data science, it must be
shown that it achieves the expected behavior. For this reason, we collected a set of benchmark results from most
of the algorithms implemented in Coach. The algorithms were tested on a subset of the same environments that were
used in the original papers, and with multiple seed for each environment.
Additionally, Coach uses some strict testing mechanisms to try and make sure the results we show for these
benchmarks stay intact as Coach continues to develop.
To see the benchmark results, please visit the
`following GitHub page <https://github.com/NervanaSystems/coach/tree/master/benchmarks>`_.

View File

@@ -0,0 +1,31 @@
Environments
============
Coach supports a large number of environments which can be solved using reinforcement learning.
To find a detailed documentation of the environments API, see the `environments section <../components/environments/index.html>`_.
The supported environments are:
* `DeepMind Control Suite <https://github.com/deepmind/dm_control>`_ - a set of reinforcement learning environments
powered by the MuJoCo physics engine.
* `Blizzard Starcraft II <https://github.com/deepmind/pysc2>`_ - a popular strategy game which was wrapped with a
python interface by DeepMind.
* `ViZDoom <http://vizdoom.cs.put.edu.pl/>`_ - a Doom-based AI research platform for reinforcement learning
from raw visual information.
* `CARLA <https://github.com/carla-simulator/carla>`_ - an open-source simulator for autonomous driving research.
* `OpenAI Gym <https://gym.openai.com/>`_ - a library which consists of a set of environments, from games to robotics.
Additionally, it can be extended using the API defined by the authors.
In Coach, we support all the native environments in Gym, along with several extensions such as:
* `Roboschool <https://github.com/openai/roboschool>`_ - a set of environments powered by the PyBullet engine,
that offer a free alternative to MuJoCo.
* `Gym Extensions <https://github.com/Breakend/gym-extensions>`_ - a set of environments that extends Gym for
auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)
* `PyBullet <https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet>`_ - a physics engine that
includes a set of robotics environments.

View File

@@ -0,0 +1,10 @@
Features
========
.. toctree::
:maxdepth: 1
:caption: Features
algorithms
environments
benchmarks

72
docs_raw/source/index.rst Normal file
View File

@@ -0,0 +1,72 @@
.. Reinforcement Learning Coach documentation master file, created by
sphinx-quickstart on Sun Oct 28 15:35:09 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Reinforcement Learning Coach
============================
Coach is a python framework which models the interaction between an agent and an environment in a modular way.
With Coach, it is possible to model an agent by combining various building blocks, and training the agent on multiple environments.
The available environments allow testing the agent in different fields such as robotics, autonomous driving, games and more.
It exposes a set of easy-to-use APIs for experimenting with new RL algorithms, and allows simple integration of
new environments to solve.
Coach collects statistics from the training process and supports advanced visualization techniques for debugging the agent being trained.
.. image:: _static/img/design.png
:width: 800px
Blog posts from the Intel® AI website:
* `Release 0.8.0 <https://ai.intel.com/reinforcement-learning-coach-intel/>`_ (initial release)
* `Release 0.9.0 <https://ai.intel.com/reinforcement-learning-coach-carla-qr-dqn/>`_
* `Release 0.10.0 <https://ai.intel.com/introducing-reinforcement-learning-coach-0-10-0/)>`_
* `Release 0.11.0 <https://ai.intel.com/>`_ (current release)
You can find more details in the `GitHub repository <https://github.com/NervanaSystems/coach>`_.
.. toctree::
:maxdepth: 2
:caption: Intro
:titlesonly:
usage
features/index
selecting_an_algorithm
dashboard
.. toctree::
:maxdepth: 1
:caption: Design
design/control_flow
design/network
.. toctree::
:maxdepth: 1
:caption: Contributing
contributing/add_agent
contributing/add_env
.. toctree::
:maxdepth: 1
:caption: Components
components/agents/index
components/architectures/index
components/environments/index
components/exploration_policies/index
components/filters/index
components/memories/index
components/core_types
components/spaces
components/additional_parameters

View File

@@ -0,0 +1,270 @@
Selecting an Algorithm
======================
As you probably already noticed, Coach has a lot of algorithms implemented into it:
.. image:: /_static/img/algorithms.png
:width: 800px
:align: center
**"ok that's prefect, but I am trying to build a solution for my application, how do I select the right algorithm?"**
We collected some guidelines for how to choose the right algorithm for your application.
Answer the following questions to see what are the best algorithms for your task.
The algorithms are ordered by their release date in descending order.
.. raw:: html
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script>
$(document).ready(function() {
// descending order of the agent badges according to their publish year
function order_badges() {
$(".badges-wrapper").find('.algorithm').sort(function(a, b) {
// dataset.year is the concatenated year and month of the paper publishing date
return b.dataset.year - a.dataset.year;
}).appendTo($(".badges-wrapper"));
}
function update_algorithms_list() {
// show all the badges
$("input:checkbox, input:radio").each(function(){
$('.' + this.id).show();
});
// remove all that don't fit the task
$("input:checkbox").each(function(){
if (!this.checked) {
$('.' + this.id).hide();
}
});
$("input:radio").each(function(){
if (this.checked) {
$('.algorithm').not('.' + this.id).hide();
}
});
order_badges();
}
// toggle badges according to the checkbox change
$('input:checkbox, input:radio').click(update_algorithms_list);
update_algorithms_list();
});
</script>
<div class="bordered-container">
<div class="questionnaire">
What are the type of actions your task requires?
<div style="margin-left: 12px;">
<input type="radio" id="discrete" name="actions" checked>Discrete actions<br>
<input type="radio" id="continuous" name="actions">Continuous actions<br>
</div>
<input type="checkbox" id="imitation" checked="True">Do you have expert demonstrations for your task?<br>
<input type="checkbox" id="on-policy" checked="True">Can you collect new data for your task dynamically?<br>
<input type="checkbox" id="requires-multi-worker" checked="True">Do you have a simulator for your task?<br>
</div>
<br>
<div class="badges-wrapper">
<div class="algorithm discrete off-policy" data-year="201300">
<span class="badge">
<a href="components/agents/value_optimization/dqn.html">DQN</a>
<br>
Learns action values for discrete actions, and allows learning from a replay buffer with old experiences
</span>
</div>
<div class="algorithm discrete off-policy" data-year="201710">
<span class="badge">
<a href="components/agents/value_optimization/rainbow.html">Rainbow</a>
<br>
Combines multiple recent innovations on top of DQN for discrete controls, and achieves
much better results on known benchmarks
</span>
</div>
<div class="algorithm continuous off-policy" data-year="201712">
<span class="badge">
<a href="components/agents/policy_optimization/hac.html">HAC</a>
<br>
Works only for continuous actions, and uses hierarchy of agents to make the learning
more simple
</span>
</div>
<div class="algorithm discrete off-policy data-year="201509">
<span class="badge">
<a href="components/agents/value_optimization/ddqn.html">DDQN</a>
<br>
An improvement over DQN, which learns more accurate action values, and therefore achieves better results
on known benchmarks
</span>
</div>
<div class="algorithm discrete on-policy" data-year="201611">
<span class="badge">
<a href="components/agents/other/dfp.html">DFP</a>
<br>
Works only for discrete actions, by learning to predict the future values of a set of
measurements from the environment, and then using a goal vector to weight the importance of each of the
measurements
</span>
</div>
<div class="algorithm discrete off-policy" data-year="201606">
<span class="badge">
<a href="components/agents/value_optimization/mmc.html">MMC</a>
<br>
A simple modification to DQN, which instead of learning action values only by bootstrapping the current
action value prediction, it mixes in the total discounted return as well. This helps learn the correct
action values faster, and is particularly useful for environments with delayed rewards.
</span>
</div>
<div class="algorithm discrete off-policy" data-year="201512">
<span class="badge">
<a href="components/agents/value_optimization/pal.html">PAL</a>
<br>
An improvement over DQN, that tries to deal with the approximation errors present in reinforcement
learning by increasing the gap between the value of the best action and the second best action.
</span>
</div>
<div class="algorithm continuous off-policy" data-year="201603">
<span class="badge">
<a href="components/agents/value_optimization/naf.html">NAF</a>
<br>
A variant of Q learning for continuous control.
</span>
</div>
<div class="algorithm discrete off-policy" data-year="201703">
<span class="badge">
<a href="components/agents/value_optimization/ddqn.html">NEC</a>
<br>
Uses a memory to "memorize" its experience and learn much faster by querying the memory on newly
seen states.
</span>
</div>
<div class="algorithm discrete off-policy" data-year="201710">
<span class="badge">
<a href="components/agents/value_optimization/qr_dqn.html">QR DQN</a>
<br>
Uses quantile regression to learn a distribution over the action values instead of only their mean.
This boosts performance on known benchmarks.
</span>
</div>
<div class="algorithm discrete off-policy" data-year="201602">
<span class="badge">
<a href="components/agents/value_optimization/bs_dqn.html">Bootstrapped DQN</a>
<br>
Uses an ensemble of DQN networks, where each network learns from a different subset of the experience
in order to improve exploration.
</span>
</div>
<div class="algorithm discrete on-policy requires-multi-worker" data-year="201602">
<span class="badge">
<a href="components/agents/value_optimization/n_step.html">N-Step Q Learning</a>
<br>
A variant of Q learning that uses bootstrapping of N steps ahead, instead of 1 step. Doing this
makes the algorithm on-policy and therefore requires having multiple workers training in parallel in
order for it to work well.
</span>
</div>
<div class="algorithm discrete off-policy" data-year="201706">
<span class="badge">
<a href="components/agents/value_optimization/categorical_dqn.html">Categorical DQN</a>
<br>
Learns a distribution over the action values instead of only their mean. This boosts performance on
known algorithms but requires knowing the range of possible values for the accumulated rewards before hand.
</span>
</div>
<div class="algorithm continuous discrete on-policy" data-year="199200">
<span class="badge">
<a href="components/agents/policy_optimization/pg.html">Policy Gradient</a>
<br>
Based on the REINFORCE algorithm, this algorithm learn a probability distribution over the actions.
This is the most simple algorithm available in Coach, but also has the worse results.
</span>
</div>
<div class="algorithm discrete continuous on-policy requires-multi-worker" data-year="201602">
<span class="badge">
<a href="components/agents/policy_optimization/ac.html">Actor Critic (A3C / A2C)</a>
<br>
Combines REINFORCE with a learned baseline (Critic) to improve stability of learning. It also
introduced the parallel learning of multiple workers to speed up data collection and improve the
learning stability and speed, both for discrete and continuous action spaces.
</span>
</div>
<div class="algorithm continuous off-policy" data-year="201509">
<span class="badge">
<a href="components/agents/policy_optimization/ddpg.html">DDPG</a>
<br>
An actor critic scheme for continuous action spaces which assumes that the policy is deterministic,
and therefore it is able to use a replay buffer in order to improve sample efficiency.
</span>
</div>
<div class="algorithm continuous discrete on-policy" data-year="201706">
<span class="badge">
<a href="components/agents/policy_optimization/ppo.html">PPO</a>
<br>
An actor critic scheme which uses bounded updates to the policy in order to make the learning process
very stable.
</span>
</div>
<div class="algorithm discrete continuous on-policy" data-year="201706">
<span class="badge">
<a href="components/agents/policy_optimization/cppo.html">Clipped PPO</a>
<br>
A simplification of PPO, that reduces the code complexity while achieving similar results.
</span>
</div>
<div class="algorithm discrete continuous imitation off-policy" data-year="199700">
<span class="badge">
<a href="components/agents/imitation/bc.html">BC</a>
<br>
The simplest form of imitation learning. Uses supervised learning on a dataset of expert demonstrations
in order to imitate the expert behavior.
</span>
</div>
<div class="algorithm discrete continuous imitation off-policy" data-year="201710">
<span class="badge">
<a href="components/agents/imitation/cil.html">CIL</a>
<br>
A variant of behavioral cloning, where the learned policy is disassembled to several skills
(such as turning left or right in an intersection), and each skill is learned separately from the
human demonstrations.
</span>
</div>
</div>
</div>
1. Does your environment have a discrete or continuous action space?
--------------------------------------------------------------------
Some reinforcement learning algorithms work only for discrete action spaces, where the agent needs to select
one out of several possible actions. Other algorithms work only for continuous action spaces, where there are
infinite possible actions, but there is some spatial relationship between the actions. And there are some algorithms
that can be applied in both cases. The available algorithms highly depend on the task at hand.
2. Is collecting more samples from your environment painful?
------------------------------------------------------------
Reinforcement learning algorithm are notoriously known for the amount of samples they need for training.
Typically, on-policy algorithms are much less sample efficient compared to off-policy algorithms. But there are
other algorithmic features that allow improving the sample efficiency even more, like using a DND in NEC, or using
Hindsight Experience Replay. It is hard to say which algorithm is the most sample efficient, but we can at least say
which ones are not sample efficient.
3. Do you have a simulator that can be parallelized across multiple processes or nodes?
---------------------------------------------------------------------------------------
Parallelizing training across multiple workers which are located on the same node or on different nodes is a technique
that has been introduced in recent years and achieved a lot of success in improving the results of multiple algorithms.
As part of this, there are some algorithms that don't work well without being parallelized with multiple workers
working in parallel, which requires having a simulator for each worker.
4. Do you have human demonstrations for solving the task?
---------------------------------------------------------
If human demonstrations are available for a task, most of the time it would be better to use those instead of training
using regular reinforcement learning from scratch. To use human demonstrations we have implemented several tools and
algorithms for imitation learning in Coach.

8
docs_raw/source/test.rst Normal file
View File

@@ -0,0 +1,8 @@
test
----
.. important:: Its a note! in markdown!
.. autoclass:: rl_coach.agents.dqn_agent.DQNAgent
:members:
:inherited-members:

158
docs_raw/source/usage.rst Normal file
View File

@@ -0,0 +1,158 @@
Usage
=====
One of the mechanism Coach uses for running experiments is the **Preset** mechanism.
As its name implies, a preset defines a set of predefined experiment parameters.
This allows defining a *complex* agent-environment interaction, with multiple parameters, and later running it through
a very *simple* command line.
The preset includes all the components that are used in the experiment, such as the agent internal components and
the environment to use.
It additionally defines general parameters for the experiment itself, such as the training schedule,
visualization parameters, and testing parameters.
Training an Agent
-----------------
Single-threaded Algorithms
++++++++++++++++++++++++++
This is the most common case. Just choose a preset using the `-p` flag and press enter.
To list the available presets, use the `-l` flag.
*Example:*
.. code-block:: python
coach -p CartPole_DQN
Multi-threaded Algorithms
+++++++++++++++++++++++++
Multi-threaded algorithms are very common this days.
They typically achieve the best results, and scale gracefully with the number of threads.
In Coach, running such algorithms is done by selecting a suitable preset, and choosing the number of threads to run using the :code:`-n` flag.
*Example:*
.. code-block:: python
coach -p CartPole_A3C -n 8
Evaluating an Agent
-------------------
There are several options for evaluating an agent during the training:
* For multi-threaded runs, an evaluation agent will constantly run in the background and evaluate the model during the training.
* For single-threaded runs, it is possible to define an evaluation period through the preset. This will run several episodes of evaluation once in a while.
Additionally, it is possible to save checkpoints of the agents networks and then run only in evaluation mode.
Saving checkpoints can be done by specifying the number of seconds between storing checkpoints using the :code:`-s` flag.
The checkpoints will be saved into the experiment directory.
Loading a model for evaluation can be done by specifying the :code:`-crd` flag with the experiment directory, and the :code:`--evaluate` flag to disable training.
*Example:*
.. code-block:: python
coach -p CartPole_DQN -s 60
coach -p CartPole_DQN --evaluate -crd CHECKPOINT_RESTORE_DIR
Playing with the Environment as a Human
---------------------------------------
Interacting with the environment as a human can be useful for understanding its difficulties and for collecting data for imitation learning.
In Coach, this can be easily done by selecting a preset that defines the environment to use, and specifying the :code:`--play` flag.
When the environment is loaded, the available keyboard buttons will be printed to the screen.
Pressing the escape key when finished will end the simulation and store the replay buffer in the experiment dir.
*Example:*
.. code-block:: python
coach -et rl_coach.environments.gym_environment:Atari -lvl BreakoutDeterministic-v4 --play
Learning Through Imitation Learning
-----------------------------------
Learning through imitation of human behavior is a nice way to speedup the learning.
In Coach, this can be done in two steps -
1. Create a dataset of demonstrations by playing with the environment as a human.
After this step, a pickle of the replay buffer containing your game play will be stored in the experiment directory.
The path to this replay buffer will be printed to the screen.
To do so, you should select an environment type and level through the command line, and specify the :code:`--play` flag.
*Example:*
.. code-block:: python
coach -et rl_coach.environments.doom_environment:DoomEnvironmentParameters -lvl Basic --play
2. Next, use an imitation learning preset and set the replay buffer path accordingly.
The path can be set either from the command line or from the preset itself.
*Example:*
.. code-block:: python
coach -p Doom_Basic_BC -cp='agent.load_memory_from_file_path=\"<experiment dir>/replay_buffer.p\"'
Visualizations
--------------
Rendering the Environment
+++++++++++++++++++++++++
Rendering the environment can be done by using the :code:`-r` flag.
When working with multi-threaded algorithms, the rendered image will be representing the game play of the evaluation worker.
When working with single-threaded algorithms, the rendered image will be representing the single worker which can be either training or evaluating.
Keep in mind that rendering the environment in single-threaded algorithms may slow the training to some extent.
When playing with the environment using the :code:`--play` flag, the environment will be rendered automatically without the need for specifying the :code:`-r` flag.
*Example:*
.. code-block:: python
coach -p Breakout_DQN -r
Dumping GIFs
++++++++++++
Coach allows storing GIFs of the agent game play.
To dump GIF files, use the :code:`-dg` flag.
The files are dumped after every evaluation episode, and are saved into the experiment directory, under a gifs sub-directory.
*Example:*
.. code-block:: python
coach -p Breakout_A3C -n 4 -dg
Switching Between Deep Learning Frameworks
------------------------------------------
Coach uses TensorFlow as its main backend framework, but it also supports MXNet.
MXNet is optional, and by default, TensorFlow will be used.
If MXNet was installed, it is possible to switch to MXNet using the :code:`-f` flag.
*Example:*
.. code-block:: python
coach -p Doom_Basic_DQN -f mxnet
Additional Flags
----------------
There are several convenient flags which are important to know about.
The most up to date description can be found by using the :code:`-h` flag.
.. argparse::
:module: rl_coach.coach
:func: create_argument_parser
:prog: coach