mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 19:20:19 +01:00
update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
This commit is contained in:
18
docs_raw/source/components/additional_parameters.rst
Normal file
18
docs_raw/source/components/additional_parameters.rst
Normal file
@@ -0,0 +1,18 @@
|
||||
Additional Parameters
|
||||
=====================
|
||||
|
||||
VisualizationParameters
|
||||
-----------------------
|
||||
.. autoclass:: rl_coach.base_parameters.VisualizationParameters
|
||||
|
||||
PresetValidationParameters
|
||||
--------------------------
|
||||
.. autoclass:: rl_coach.base_parameters.PresetValidationParameters
|
||||
|
||||
TaskParameters
|
||||
--------------
|
||||
.. autoclass:: rl_coach.base_parameters.TaskParameters
|
||||
|
||||
DistributedTaskParameters
|
||||
-------------------------
|
||||
.. autoclass:: rl_coach.base_parameters.DistributedTaskParameters
|
||||
29
docs_raw/source/components/agents/imitation/bc.rst
Normal file
29
docs_raw/source/components/agents/imitation/bc.rst
Normal file
@@ -0,0 +1,29 @@
|
||||
Behavioral Cloning
|
||||
==================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/pg.png
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
The replay buffer contains the expert demonstrations for the task.
|
||||
These demonstrations are given as state, action tuples, and with no reward.
|
||||
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
|
||||
the expert for each state.
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. Use the current states as input to the network, and the expert actions as the targets of the network.
|
||||
3. For the network head, we use the policy head, which uses the cross entropy loss function.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.bc_agent.BCAlgorithmParameters
|
||||
36
docs_raw/source/components/agents/imitation/cil.rst
Normal file
36
docs_raw/source/components/agents/imitation/cil.rst
Normal file
@@ -0,0 +1,36 @@
|
||||
Conditional Imitation Learning
|
||||
==============================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `End-to-end Driving via Conditional Imitation Learning <https://arxiv.org/abs/1710.02410>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/cil.png
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
The replay buffer contains the expert demonstrations for the task.
|
||||
These demonstrations are given as state, action tuples, and with no reward.
|
||||
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
|
||||
the expert for each state.
|
||||
In conditional imitation learning, each transition is assigned a class, which determines the goal that was pursuit
|
||||
in that transitions. For example, 3 possible classes can be: turn right, turn left and follow lane.
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer, where the batch is balanced, meaning that an equal number
|
||||
of transitions will be sampled from each class index.
|
||||
2. Use the current states as input to the network, and assign the expert actions as the targets of the network heads
|
||||
corresponding to the state classes. For the other heads, set the targets to match the currently predicted values,
|
||||
so that the loss for the other heads will be zeroed out.
|
||||
3. We use a regression head, that minimizes the MSE loss between the network predicted values and the target values.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.cil_agent.CILAlgorithmParameters
|
||||
43
docs_raw/source/components/agents/index.rst
Normal file
43
docs_raw/source/components/agents/index.rst
Normal file
@@ -0,0 +1,43 @@
|
||||
Agents
|
||||
======
|
||||
|
||||
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into three main classes -
|
||||
value optimization, policy optimization and imitation learning.
|
||||
A detailed description of those algorithms can be found by navigating to each of the algorithm pages.
|
||||
|
||||
.. image:: /_static/img/algorithms.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Agents
|
||||
|
||||
policy_optimization/ac
|
||||
imitation/bc
|
||||
value_optimization/bs_dqn
|
||||
value_optimization/categorical_dqn
|
||||
imitation/cil
|
||||
policy_optimization/cppo
|
||||
policy_optimization/ddpg
|
||||
other/dfp
|
||||
value_optimization/double_dqn
|
||||
value_optimization/dqn
|
||||
value_optimization/dueling_dqn
|
||||
value_optimization/mmc
|
||||
value_optimization/n_step
|
||||
value_optimization/naf
|
||||
value_optimization/nec
|
||||
value_optimization/pal
|
||||
policy_optimization/pg
|
||||
policy_optimization/ppo
|
||||
value_optimization/rainbow
|
||||
value_optimization/qr_dqn
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.base_parameters.AgentParameters
|
||||
|
||||
.. autoclass:: rl_coach.agents.agent.Agent
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
39
docs_raw/source/components/agents/other/dfp.rst
Normal file
39
docs_raw/source/components/agents/other/dfp.rst
Normal file
@@ -0,0 +1,39 @@
|
||||
Direct Future Prediction
|
||||
========================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Learning to Act by Predicting the Future <https://arxiv.org/abs/1611.01779>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dfp.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network.
|
||||
The output of the network is the predicted future measurements for time-steps :math:`t+1,t+2,t+4,t+8,t+16` and
|
||||
:math:`t+32` for each possible action.
|
||||
2. For each action, the measurements of each predicted time-step are multiplied by the goal vector,
|
||||
and the result is a single vector of future values for each action.
|
||||
3. Then, a weighted sum of the future values of each action is calculated, and the result is a single value for each action.
|
||||
4. The action values are passed to the exploration policy to decide on the action to use.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Given a batch of transitions, run them through the network to get the current predictions of the future measurements
|
||||
per action, and set them as the initial targets for training the network. For each transition
|
||||
:math:`(s_t,a_t,r_t,s_{t+1} )` in the batch, the target of the network for the action that was taken, is the actual
|
||||
measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
|
||||
For the actions that were not taken, the targets are the current values.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.dfp_agent.DFPAlgorithmParameters
|
||||
40
docs_raw/source/components/agents/policy_optimization/ac.rst
Normal file
40
docs_raw/source/components/agents/policy_optimization/ac.rst
Normal file
@@ -0,0 +1,40 @@
|
||||
Actor-Critic
|
||||
============
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ac.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Choosing an action - Discrete actions
|
||||
+++++++++++++++++++++++++++++++++++++
|
||||
|
||||
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
|
||||
distribution assigned with these probabilities. When testing, the action with the highest probability is used.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
|
||||
|
||||
Advantages can be calculated by either of the following methods (configured by the selected preset) -
|
||||
|
||||
1. **A_VALUE** - Estimating advantage directly:
|
||||
:math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
|
||||
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
|
||||
|
||||
2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
|
||||
|
||||
The advantages are then used in order to accumulate gradients according to
|
||||
:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters
|
||||
@@ -0,0 +1,44 @@
|
||||
Clipped Proximal Policy Optimization
|
||||
====================================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ppo.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Continuous action
|
||||
++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Same as in PPO.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Very similar to PPO, with several small (but very simplifying) changes:
|
||||
|
||||
1. Train both the value and policy networks, simultaneously, by defining a single loss function,
|
||||
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
|
||||
|
||||
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
|
||||
|
||||
3. Value targets are now also calculated based on the GAE advantages.
|
||||
In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
|
||||
in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
|
||||
each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
|
||||
|
||||
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
|
||||
:math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
|
||||
This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
|
||||
clipped surrogate loss:
|
||||
|
||||
:math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters
|
||||
@@ -0,0 +1,50 @@
|
||||
Deep Deterministic Policy Gradient
|
||||
==================================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ddpg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
|
||||
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
|
||||
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Start by sampling a batch of transitions from the experience replay.
|
||||
|
||||
* To train the **critic network**, use the following targets:
|
||||
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
|
||||
|
||||
First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
|
||||
Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
|
||||
calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
|
||||
as the inputs, and :math:`y_t` as the targets.
|
||||
|
||||
* To train the **actor network**, use the following equation:
|
||||
|
||||
:math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
|
||||
|
||||
Use the actor's online network to get the action mean values using the current states as the inputs.
|
||||
Then, use the critic online network in order to get the gradients of the critic output with respect to the
|
||||
action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
|
||||
Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
|
||||
given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
|
||||
|
||||
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters
|
||||
@@ -0,0 +1,24 @@
|
||||
Hierarchical Actor Critic
|
||||
=========================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ddpg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
|
||||
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
|
||||
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
39
docs_raw/source/components/agents/policy_optimization/pg.rst
Normal file
39
docs_raw/source/components/agents/policy_optimization/pg.rst
Normal file
@@ -0,0 +1,39 @@
|
||||
Policy Gradient
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/pg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Discrete actions
|
||||
+++++++++++++++++++++++++++++++++++++
|
||||
Run the current states through the network and get a policy distribution over the actions.
|
||||
While training, sample from the policy distribution. When testing, take the action with the highest probability.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
The policy head loss is defined as :math:`L=-log (\pi) \cdot PolicyGradientRescaler`.
|
||||
The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
|
||||
This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
|
||||
convergence. The rescaler is a configurable parameter and there are few options to choose from:
|
||||
|
||||
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
|
||||
* **Future Return** - Return from each transition until the end of the episode.
|
||||
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
|
||||
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
|
||||
which are calculated seperately for each timestep, across different episodes.
|
||||
|
||||
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
|
||||
serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
|
||||
the gradients are then applied to the network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters
|
||||
@@ -0,0 +1,45 @@
|
||||
Proximal Policy Optimization
|
||||
============================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ppo.png
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Continuous actions
|
||||
+++++++++++++++++++++++++++++++++++++++
|
||||
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
|
||||
While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
|
||||
When testing, just take the mean values predicted by the network.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
|
||||
|
||||
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
|
||||
|
||||
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
|
||||
the L-BFGS optimizer runs on the entire dataset at once, without batching.
|
||||
It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
|
||||
the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
|
||||
discounted returns of each state in each episode.
|
||||
|
||||
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
|
||||
targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
|
||||
starting to run the current set of training iterations) using a regularization term.
|
||||
|
||||
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
|
||||
in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
|
||||
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters
|
||||
@@ -0,0 +1,43 @@
|
||||
Bootstrapped DQN
|
||||
================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/bs_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
The current states are used as the input to the network. The network contains several $Q$ heads, which are used
|
||||
for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
|
||||
selects a single head to play with during the episode. According to the selected head, only the relevant
|
||||
output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
|
||||
|
||||
Storing the transitions
|
||||
+++++++++++++++++++++++
|
||||
For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
|
||||
The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
|
||||
and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
|
||||
the replay buffer.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
|
||||
current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
|
||||
and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
|
||||
according to the standard DQN update rule:
|
||||
|
||||
:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
|
||||
|
||||
Otherwise, leave it intact so that the transition does not affect the learning of this head.
|
||||
Then, train the online network according to the calculated targets.
|
||||
|
||||
As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
@@ -0,0 +1,39 @@
|
||||
Categorical DQN
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/distributional_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
|
||||
that the :math:`i-th` component of the projected update is calculated as follows:
|
||||
|
||||
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
|
||||
|
||||
where:
|
||||
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
|
||||
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom :math:`z_j`: :math:`\hat{T}_{z_{j}} := r+\gamma z_j`
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
|
||||
probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.categorical_dqn_agent.CategoricalDQNAlgorithmParameters
|
||||
@@ -0,0 +1,35 @@
|
||||
Double DQN
|
||||
==========
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
|
||||
action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
|
||||
network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.
|
||||
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
|
||||
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
|
||||
Set those values as the targets for the actions that were not actually played.
|
||||
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network:
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
37
docs_raw/source/components/agents/value_optimization/dqn.rst
Normal file
37
docs_raw/source/components/agents/value_optimization/dqn.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
Deep Q Networks
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
|
||||
the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
|
||||
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
|
||||
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
|
||||
Set those values as the targets for the actions that were not actually played.
|
||||
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters
|
||||
@@ -0,0 +1,27 @@
|
||||
Dueling DQN
|
||||
===========
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dueling_dqn.png
|
||||
:align: center
|
||||
|
||||
General Description
|
||||
-------------------
|
||||
Dueling DQN presents a change in the network structure comparing to DQN.
|
||||
|
||||
Dueling DQN uses a specialized *Dueling Q Head* in order to separate :math:`Q` to an :math:`A` (advantage)
|
||||
stream and a :math:`V` stream. Adding this type of structure to the network head allows the network to better differentiate
|
||||
actions from one another, and significantly improves the learning.
|
||||
|
||||
In many states, the values of the different actions are very similar, and it is less important which action to take.
|
||||
This is especially important in environments where there are many actions to choose from. In DQN, on each training
|
||||
iteration, for each of the states in the batch, we update the :ath:`Q` values only for the specific actions taken in
|
||||
those states. This results in slower learning as we do not learn the :math:`Q` values for actions that were not taken yet.
|
||||
On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a
|
||||
single action has been taken at this state.
|
||||
37
docs_raw/source/components/agents/value_optimization/mmc.rst
Normal file
37
docs_raw/source/components/agents/value_optimization/mmc.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
Mixed Monte Carlo
|
||||
=================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
|
||||
|
||||
The DDQN targets are calculated in the same manner as in the DDQN agent:
|
||||
|
||||
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
|
||||
|
||||
:math:`y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )`
|
||||
|
||||
A mixing ratio $\alpha$ is then used to get the final targets:
|
||||
|
||||
:math:`y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}`
|
||||
|
||||
Finally, the online network is trained using the current states as inputs, and the calculated targets.
|
||||
Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.mmc_agent.MixedMonteCarloAlgorithmParameters
|
||||
@@ -0,0 +1,35 @@
|
||||
N-Step Q Learning
|
||||
=================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
The :math:`N`-step Q learning algorithm works in similar manner to DQN except for the following changes:
|
||||
|
||||
1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
|
||||
:math:`N` steps using the latest :math:`N` steps played by the agent.
|
||||
|
||||
2. In order to stabilize the learning, multiple workers work together to update the network.
|
||||
This creates the same effect as uncorrelating the samples used for training.
|
||||
|
||||
3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated
|
||||
to form the :math:`N`-step Q targets, according to the following equation:
|
||||
:math:`R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})`
|
||||
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.n_step_q_agent.NStepQAlgorithmParameters
|
||||
33
docs_raw/source/components/agents/value_optimization/naf.rst
Normal file
33
docs_raw/source/components/agents/value_optimization/naf.rst
Normal file
@@ -0,0 +1,33 @@
|
||||
Normalized Advantage Functions
|
||||
==============================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/naf.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
|
||||
It is then passed to the exploration policy which adds noise in order to encourage exploration.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
The network is trained by using the following targets:
|
||||
:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
|
||||
Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
|
||||
to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
|
||||
and :math:`y_t` as the targets.
|
||||
After every training step, use a soft update in order to copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters
|
||||
50
docs_raw/source/components/agents/value_optimization/nec.rst
Normal file
50
docs_raw/source/components/agents/value_optimization/nec.rst
Normal file
@@ -0,0 +1,50 @@
|
||||
Neural Episodic Control
|
||||
=======================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/nec.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate
|
||||
output from the middleware.
|
||||
|
||||
2. For each possible action :math:`a_i`, run the DND head using the state embedding and the selected action :math:`a_i` as inputs.
|
||||
The DND is queried and returns the :math:`P` nearest neighbor keys and values. The keys and values are used to calculate
|
||||
and return the action :math:`Q` value from the network.
|
||||
|
||||
3. Pass all the :math:`Q` values to the exploration policy and choose an action accordingly.
|
||||
|
||||
4. Store the state embeddings and actions taken during the current episode in a small buffer :math:`B`, in order to
|
||||
accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
|
||||
|
||||
Finalizing an episode
|
||||
+++++++++++++++++++++
|
||||
For each step in the episode, the state embeddings and the taken actions are stored in the buffer :math:`B`.
|
||||
When the episode is finished, the replay buffer calculates the :math:`N`-step total return of each transition in the
|
||||
buffer, bootstrapped using the maximum :math:`Q` value of the :math:`N`-th transition. Those values are inserted
|
||||
along with the total return into the DND, and the buffer :math:`B` is reset.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
Train the network only when the DND has enough entries for querying.
|
||||
|
||||
To train the network, the current states are used as the inputs and the :math:`N`-step returns are used as the targets.
|
||||
The :math:`N`-step return used takes into account :math:`N` consecutive steps, and bootstraps the last value from
|
||||
the network if necessary:
|
||||
:math:`y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N max_a Q(s_{t+N},a)`
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters
|
||||
45
docs_raw/source/components/agents/value_optimization/pal.rst
Normal file
45
docs_raw/source/components/agents/value_optimization/pal.rst
Normal file
@@ -0,0 +1,45 @@
|
||||
Persistent Advantage Learning
|
||||
=============================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Increasing the Action Gap: New Operators for Reinforcement Learning <https://arxiv.org/abs/1512.04860>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
|
||||
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
3. The action gap :math:`V(s_t )-Q(s_t,a_t)` should then be subtracted from each of the calculated targets.
|
||||
To calculate the action gap, run the target network using the current states and get the :math:`Q` values
|
||||
for all the actions. Then estimate :math:`V` as the maximum predicted :math:`Q` value for the current state:
|
||||
:math:`V(s_t )=max_a Q(s_t,a)`
|
||||
|
||||
4. For *advantage learning (AL)*, reduce the action gap weighted by a predefined parameter :math:`\alpha` from
|
||||
the targets :math:`y_t^{DDQN}`:
|
||||
:math:`y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t ))`
|
||||
|
||||
5. For *persistent advantage learning (PAL)*, the target network is also used in order to calculate the action
|
||||
gap for the next state:
|
||||
:math:`V(s_{t+1} )-Q(s_{t+1},a_{t+1})`
|
||||
where :math:`a_{t+1}` is chosen by running the next states through the online network and choosing the action that
|
||||
has the highest predicted :math:`Q` value. Finally, the targets will be defined as -
|
||||
:math:`y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} ))`
|
||||
|
||||
6. Train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
7. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters
|
||||
@@ -0,0 +1,33 @@
|
||||
Quantile Regression DQN
|
||||
=======================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/qr_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. First, the next state quantiles are predicted. These are used in order to calculate the targets for the network,
|
||||
by following the Bellman equation.
|
||||
Next, the current quantile locations for the current states are predicted, sorted, and used for calculating the
|
||||
quantile midpoints targets.
|
||||
|
||||
3. The network is trained with the quantile regression loss between the resulting quantile locations and the target
|
||||
quantile locations. Only the targets of the actions that were actually taken are updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.qr_dqn_agent.QuantileRegressionDQNAlgorithmParameters
|
||||
@@ -0,0 +1,51 @@
|
||||
Rainbow
|
||||
=======
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/rainbow.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Rainbow combines 6 recent advancements in reinforcement learning:
|
||||
|
||||
* N-step returns
|
||||
* Distributional state-action value learning
|
||||
* Dueling networks
|
||||
* Noisy Networks
|
||||
* Double DQN
|
||||
* Prioritized Experience Replay
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
|
||||
that the :math:`i-th` component of the projected update is calculated as follows:
|
||||
|
||||
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
|
||||
|
||||
where:
|
||||
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
|
||||
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
|
||||
:math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
|
||||
probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
|
||||
using the KL divergence loss that is returned from the network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters
|
||||
27
docs_raw/source/components/architectures/index.rst
Normal file
27
docs_raw/source/components/architectures/index.rst
Normal file
@@ -0,0 +1,27 @@
|
||||
Architectures
|
||||
=============
|
||||
|
||||
Architectures contain all the classes that implement the neural network related stuff for the agent.
|
||||
Since Coach is intended to work with multiple neural network frameworks, each framework will implement its
|
||||
own components under a dedicated directory. For example, tensorflow components will contain all the neural network
|
||||
parts that are implemented using TensorFlow.
|
||||
|
||||
.. autoclass:: rl_coach.base_parameters.NetworkParameters
|
||||
|
||||
Architecture
|
||||
------------
|
||||
.. autoclass:: rl_coach.architectures.architecture.Architecture
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
NetworkWrapper
|
||||
--------------
|
||||
|
||||
.. image:: /_static/img/distributed.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
.. autoclass:: rl_coach.architectures.network_wrapper.NetworkWrapper
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
33
docs_raw/source/components/core_types.rst
Normal file
33
docs_raw/source/components/core_types.rst
Normal file
@@ -0,0 +1,33 @@
|
||||
Core Types
|
||||
==========
|
||||
|
||||
ActionInfo
|
||||
----------
|
||||
.. autoclass:: rl_coach.core_types.ActionInfo
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
Batch
|
||||
-----
|
||||
.. autoclass:: rl_coach.core_types.Batch
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
EnvResponse
|
||||
-----------
|
||||
.. autoclass:: rl_coach.core_types.EnvResponse
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
Episode
|
||||
-------
|
||||
.. autoclass:: rl_coach.core_types.Episode
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
Transition
|
||||
----------
|
||||
.. autoclass:: rl_coach.core_types.Transition
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
70
docs_raw/source/components/environments/index.rst
Normal file
70
docs_raw/source/components/environments/index.rst
Normal file
@@ -0,0 +1,70 @@
|
||||
Environments
|
||||
============
|
||||
|
||||
.. autoclass:: rl_coach.environments.environment.Environment
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
DeepMind Control Suite
|
||||
----------------------
|
||||
|
||||
A set of reinforcement learning environments powered by the MuJoCo physics engine.
|
||||
|
||||
Website: `DeepMind Control Suite <https://github.com/deepmind/dm_control>`_
|
||||
|
||||
.. autoclass:: rl_coach.environments.control_suite_environment.ControlSuiteEnvironment
|
||||
|
||||
|
||||
Blizzard Starcraft II
|
||||
---------------------
|
||||
|
||||
A popular strategy game which was wrapped with a python interface by DeepMind.
|
||||
|
||||
Website: `Blizzard Starcraft II <https://github.com/deepmind/pysc2>`_
|
||||
|
||||
.. autoclass:: rl_coach.environments.starcraft2_environment.StarCraft2Environment
|
||||
|
||||
|
||||
ViZDoom
|
||||
--------
|
||||
|
||||
A Doom-based AI research platform for reinforcement learning from raw visual information.
|
||||
|
||||
Website: `ViZDoom <http://vizdoom.cs.put.edu.pl/>`_
|
||||
|
||||
.. autoclass:: rl_coach.environments.doom_environment.DoomEnvironment
|
||||
|
||||
|
||||
CARLA
|
||||
-----
|
||||
|
||||
An open-source simulator for autonomous driving research.
|
||||
|
||||
Website: `CARLA <https://github.com/carla-simulator/carla>`_
|
||||
|
||||
.. autoclass:: rl_coach.environments.carla_environment.CarlaEnvironment
|
||||
|
||||
OpenAI Gym
|
||||
----------
|
||||
|
||||
A library which consists of a set of environments, from games to robotics.
|
||||
Additionally, it can be extended using the API defined by the authors.
|
||||
|
||||
Website: `OpenAI Gym <https://gym.openai.com/>`_
|
||||
|
||||
In Coach, we support all the native environments in Gym, along with several extensions such as:
|
||||
|
||||
* `Roboschool <https://github.com/openai/roboschool>`_ - a set of environments powered by the PyBullet engine,
|
||||
that offer a free alternative to MuJoCo.
|
||||
|
||||
* `Gym Extensions <https://github.com/Breakend/gym-extensions>`_ - a set of environments that extends Gym for
|
||||
auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)
|
||||
|
||||
* `PyBullet <https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet>`_ - a physics engine that
|
||||
includes a set of robotics environments.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.environments.gym_environment.GymEnvironment
|
||||
|
||||
|
||||
|
||||
87
docs_raw/source/components/exploration_policies/index.rst
Normal file
87
docs_raw/source/components/exploration_policies/index.rst
Normal file
@@ -0,0 +1,87 @@
|
||||
Exploration Policies
|
||||
====================
|
||||
|
||||
Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to a
|
||||
predefined policy. This is one of the most important aspects of reinforcement learning agents, and can require some
|
||||
tuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended with
|
||||
custom policies. Note that not all exploration policies are expected to work for both discrete and continuous action
|
||||
spaces.
|
||||
|
||||
.. role:: green
|
||||
.. role:: red
|
||||
|
||||
+----------------------+-----------------------+------------------+
|
||||
| Exploration Policy | Discrete Action Space | Box Action Space |
|
||||
+======================+=======================+==================+
|
||||
| AdditiveNoise | :red:`X` | :green:`V` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| Boltzmann | :green:`V` | :red:`X` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| Bootstrapped | :green:`V` | :red:`X` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| Categorical | :green:`V` | :red:`X` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| ContinuousEntropy | :red:`X` | :green:`V` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| EGreedy | :green:`V` | :green:`V` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| Greedy | :green:`V` | :green:`V` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| OUProcess | :red:`X` | :green:`V` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| ParameterNoise | :green:`V` | :green:`V` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| TruncatedNormal | :red:`X` | :green:`V` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
| UCB | :green:`V` | :red:`X` |
|
||||
+----------------------+-----------------------+------------------+
|
||||
|
||||
ExplorationPolicy
|
||||
-----------------
|
||||
.. autoclass:: rl_coach.exploration_policies.ExplorationPolicy
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
AdditiveNoise
|
||||
-------------
|
||||
.. autoclass:: rl_coach.exploration_policies.AdditiveNoise
|
||||
|
||||
Boltzmann
|
||||
---------
|
||||
.. autoclass:: rl_coach.exploration_policies.Boltzmann
|
||||
|
||||
Bootstrapped
|
||||
------------
|
||||
.. autoclass:: rl_coach.exploration_policies.Bootstrapped
|
||||
|
||||
Categorical
|
||||
-----------
|
||||
.. autoclass:: rl_coach.exploration_policies.Categorical
|
||||
|
||||
ContinuousEntropy
|
||||
-----------------
|
||||
.. autoclass:: rl_coach.exploration_policies.ContinuousEntropy
|
||||
|
||||
EGreedy
|
||||
-------
|
||||
.. autoclass:: rl_coach.exploration_policies.EGreedy
|
||||
|
||||
Greedy
|
||||
------
|
||||
.. autoclass:: rl_coach.exploration_policies.Greedy
|
||||
|
||||
OUProcess
|
||||
---------
|
||||
.. autoclass:: rl_coach.exploration_policies.OUProcess
|
||||
|
||||
ParameterNoise
|
||||
--------------
|
||||
.. autoclass:: rl_coach.exploration_policies.ParameterNoise
|
||||
|
||||
TruncatedNormal
|
||||
---------------
|
||||
.. autoclass:: rl_coach.exploration_policies.TruncatedNormal
|
||||
|
||||
UCB
|
||||
---
|
||||
.. autoclass:: rl_coach.exploration_policies.UCB
|
||||
28
docs_raw/source/components/filters/index.rst
Normal file
28
docs_raw/source/components/filters/index.rst
Normal file
@@ -0,0 +1,28 @@
|
||||
Filters
|
||||
=======
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Filters
|
||||
|
||||
input_filters
|
||||
output_filters
|
||||
|
||||
Filters are a mechanism in Coach that allows doing pre-processing and post-processing of the internal agent information.
|
||||
There are two filter categories -
|
||||
|
||||
* **Input filters** - these are filters that process the information passed **into** the agent from the environment.
|
||||
This information includes the observation and the reward. Input filters therefore allow rescaling observations,
|
||||
normalizing rewards, stack observations, etc.
|
||||
|
||||
* **Output filters** - these are filters that process the information going **out** of the agent into the environment.
|
||||
This information includes the action the agent chooses to take. Output filters therefore allow conversion of
|
||||
actions from one space into another. For example, the agent can take :math:`N` discrete actions, that will be mapped by
|
||||
the output filter onto :math:`N` continuous actions.
|
||||
|
||||
Filters can be stacked on top of each other in order to build complex processing flows of the inputs or outputs.
|
||||
|
||||
.. image:: /_static/img/filters.png
|
||||
:width: 350px
|
||||
:align: center
|
||||
|
||||
67
docs_raw/source/components/filters/input_filters.rst
Normal file
67
docs_raw/source/components/filters/input_filters.rst
Normal file
@@ -0,0 +1,67 @@
|
||||
Input Filters
|
||||
=============
|
||||
|
||||
The input filters are separated into two categories - **observation filters** and **reward filters**.
|
||||
|
||||
Observation Filters
|
||||
-------------------
|
||||
|
||||
ObservationClippingFilter
|
||||
+++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationClippingFilter
|
||||
|
||||
ObservationCropFilter
|
||||
+++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationCropFilter
|
||||
|
||||
ObservationMoveAxisFilter
|
||||
+++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationMoveAxisFilter
|
||||
|
||||
ObservationNormalizationFilter
|
||||
++++++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationNormalizationFilter
|
||||
|
||||
ObservationReductionBySubPartsNameFilter
|
||||
++++++++++++++++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationReductionBySubPartsNameFilter
|
||||
|
||||
ObservationRescaleSizeByFactorFilter
|
||||
++++++++++++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationRescaleSizeByFactorFilter
|
||||
|
||||
ObservationRescaleToSizeFilter
|
||||
++++++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationRescaleToSizeFilter
|
||||
|
||||
ObservationRGBToYFilter
|
||||
+++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationRGBToYFilter
|
||||
|
||||
ObservationSqueezeFilter
|
||||
++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationSqueezeFilter
|
||||
|
||||
ObservationStackingFilter
|
||||
+++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationStackingFilter
|
||||
|
||||
ObservationToUInt8Filter
|
||||
++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.observation.ObservationToUInt8Filter
|
||||
|
||||
|
||||
Reward Filters
|
||||
--------------
|
||||
|
||||
RewardClippingFilter
|
||||
++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.reward.RewardClippingFilter
|
||||
|
||||
RewardNormalizationFilter
|
||||
+++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.reward.RewardNormalizationFilter
|
||||
|
||||
RewardRescaleFilter
|
||||
+++++++++++++++++++
|
||||
.. autoclass:: rl_coach.filters.reward.RewardRescaleFilter
|
||||
37
docs_raw/source/components/filters/output_filters.rst
Normal file
37
docs_raw/source/components/filters/output_filters.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
Output Filters
|
||||
--------------
|
||||
|
||||
The output filters only process the actions.
|
||||
|
||||
Action Filters
|
||||
++++++++++++++
|
||||
|
||||
.. autoclass:: rl_coach.filters.action.AttentionDiscretization
|
||||
|
||||
.. image:: /_static/img/attention_discretization.png
|
||||
:align: center
|
||||
|
||||
.. autoclass:: rl_coach.filters.action.BoxDiscretization
|
||||
|
||||
.. image:: /_static/img/box_discretization.png
|
||||
:align: center
|
||||
|
||||
.. autoclass:: rl_coach.filters.action.BoxMasking
|
||||
|
||||
.. image:: /_static/img/box_masking.png
|
||||
:align: center
|
||||
|
||||
.. autoclass:: rl_coach.filters.action.PartialDiscreteActionSpaceMap
|
||||
|
||||
.. image:: /_static/img/partial_discrete_action_space_map.png
|
||||
:align: center
|
||||
|
||||
.. autoclass:: rl_coach.filters.action.FullDiscreteActionSpaceMap
|
||||
|
||||
.. image:: /_static/img/full_discrete_action_space_map.png
|
||||
:align: center
|
||||
|
||||
.. autoclass:: rl_coach.filters.action.LinearBoxToBoxMap
|
||||
|
||||
.. image:: /_static/img/linear_box_to_box_map.png
|
||||
:align: center
|
||||
44
docs_raw/source/components/memories/index.rst
Normal file
44
docs_raw/source/components/memories/index.rst
Normal file
@@ -0,0 +1,44 @@
|
||||
Memories
|
||||
========
|
||||
|
||||
Episodic Memories
|
||||
-----------------
|
||||
|
||||
EpisodicExperienceReplay
|
||||
++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.episodic.EpisodicExperienceReplay
|
||||
|
||||
EpisodicHindsightExperienceReplay
|
||||
+++++++++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.episodic.EpisodicHindsightExperienceReplay
|
||||
|
||||
EpisodicHRLHindsightExperienceReplay
|
||||
++++++++++++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.episodic.EpisodicHRLHindsightExperienceReplay
|
||||
|
||||
SingleEpisodeBuffer
|
||||
+++++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.episodic.SingleEpisodeBuffer
|
||||
|
||||
|
||||
Non-Episodic Memories
|
||||
---------------------
|
||||
BalancedExperienceReplay
|
||||
++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.non_episodic.BalancedExperienceReplay
|
||||
|
||||
QDND
|
||||
++++
|
||||
.. autoclass:: rl_coach.memories.non_episodic.QDND
|
||||
|
||||
ExperienceReplay
|
||||
++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.non_episodic.ExperienceReplay
|
||||
|
||||
PrioritizedExperienceReplay
|
||||
+++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.non_episodic.PrioritizedExperienceReplay
|
||||
|
||||
TransitionCollection
|
||||
++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.memories.non_episodic.TransitionCollection
|
||||
64
docs_raw/source/components/spaces.rst
Normal file
64
docs_raw/source/components/spaces.rst
Normal file
@@ -0,0 +1,64 @@
|
||||
Spaces
|
||||
======
|
||||
|
||||
Space
|
||||
-----
|
||||
.. autoclass:: rl_coach.spaces.Space
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
|
||||
|
||||
Observation Spaces
|
||||
------------------
|
||||
.. autoclass:: rl_coach.spaces.ObservationSpace
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
VectorObservationSpace
|
||||
++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.VectorObservationSpace
|
||||
|
||||
PlanarMapsObservationSpace
|
||||
++++++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.PlanarMapsObservationSpace
|
||||
|
||||
ImageObservationSpace
|
||||
+++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.ImageObservationSpace
|
||||
|
||||
|
||||
|
||||
Action Spaces
|
||||
-------------
|
||||
.. autoclass:: rl_coach.spaces.ActionSpace
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
AttentionActionSpace
|
||||
++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.AttentionActionSpace
|
||||
|
||||
BoxActionSpace
|
||||
++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.BoxActionSpace
|
||||
|
||||
DiscreteActionSpace
|
||||
++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.DiscreteActionSpace
|
||||
|
||||
MultiSelectActionSpace
|
||||
++++++++++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.MultiSelectActionSpace
|
||||
|
||||
CompoundActionSpace
|
||||
+++++++++++++++++++
|
||||
.. autoclass:: rl_coach.spaces.CompoundActionSpace
|
||||
|
||||
|
||||
|
||||
Goal Spaces
|
||||
-----------
|
||||
.. autoclass:: rl_coach.spaces.GoalsSpace
|
||||
:members:
|
||||
:inherited-members:
|
||||
Reference in New Issue
Block a user