update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
19
docs_raw/Makefile
Normal file
@@ -0,0 +1,19 @@
|
||||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line.
|
||||
SPHINXOPTS =
|
||||
SPHINXBUILD = sphinx-build
|
||||
SOURCEDIR = source
|
||||
BUILDDIR = build
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile
|
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
31
docs_raw/README.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Coach Documentation
|
||||
|
||||
Coach uses Sphinx with a Read The Docs theme for its documentation website.
|
||||
The website is hosted on GitHub pages, and is automatically pulled from the repository through the built docs directory.
|
||||
|
||||
To build the documentation website locally, first install the following requirements:
|
||||
|
||||
```
|
||||
pip install Sphinx
|
||||
pip install recommonmark
|
||||
pip install sphinx_rtd_theme
|
||||
pip install sphinx-autobuild
|
||||
pip install sphinx-argparse
|
||||
```
|
||||
|
||||
Then there are two option to build:
|
||||
1. Build using the make file (recommended):
|
||||
|
||||
```
|
||||
make html
|
||||
cp source/_static/css/custom.css build/html/_static/css/
|
||||
rm -rf ../docs/
|
||||
mkdir ../docs
|
||||
cp -R build/html/* ../docs/
|
||||
```
|
||||
|
||||
2. Build automatically after every change while editing the files:
|
||||
|
||||
```
|
||||
sphinx-autobuild source build/html
|
||||
```
|
||||
@@ -1,12 +0,0 @@
|
||||
installation
|
||||
=============
|
||||
1. install mkdocs by following the instructions here -
|
||||
http://www.mkdocs.org/#installation
|
||||
2. install the math extension for mkdocs
|
||||
sudo -E pip install python-markdown-math
|
||||
|
||||
to build the documentation website run:
|
||||
- mkdocs build
|
||||
- python fix_index.py
|
||||
|
||||
this will create a folder named site which contains the documentation website
|
||||
@@ -1,25 +0,0 @@
|
||||
# Behavioral Cloning
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
The replay buffer contains the expert demonstrations for the task.
|
||||
These demonstrations are given as state, action tuples, and with no reward.
|
||||
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by the expert for each state.
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. Use the current states as input to the network, and the expert actions as the targets of the network.
|
||||
3. The loss function for the network is MSE, and therefore we use the Q head to minimize this loss.
|
||||
@@ -1,25 +0,0 @@
|
||||
# Direct Future Prediction
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Learning to Act by Predicting the Future](https://arxiv.org/abs/1611.01779)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../design_imgs/dfp.png" width=600>
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
|
||||
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network. The output of the network is the predicted future measurements for time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$ for each possible action.
|
||||
2. For each action, the measurements of each predicted time-step are multiplied by the goal vector, and the result is a single vector of future values for each action.
|
||||
3. Then, a weighted sum of the future values of each action is calculated, and the result is a single value for each action.
|
||||
4. The action values are passed to the exploration policy to decide on the action to use.
|
||||
|
||||
### Training the network
|
||||
|
||||
Given a batch of transitions, run them through the network to get the current predictions of the future measurements per action, and set them as the initial targets for training the network. For each transition $(s_t,a_t,r_t,s_{t+1} )$ in the batch, the target of the network for the action that was taken, is the actual measurements that were seen in time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$. For the actions that were not taken, the targets are the current values.
|
||||
@@ -1,27 +0,0 @@
|
||||
# Actor-Critic
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783)
|
||||
|
||||
## Network Structure
|
||||
<p style="text-align: center;">
|
||||
<img src="..\..\design_imgs\ac.png" width=500>
|
||||
</p>
|
||||
## Algorithm Description
|
||||
|
||||
### Choosing an action - Discrete actions
|
||||
|
||||
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.
|
||||
|
||||
### Training the network
|
||||
A batch of $ T_{max} $ transitions is used, and the advantages are calculated upon it.
|
||||
|
||||
Advantages can be calculated by either of the following methods (configured by the selected preset) -
|
||||
|
||||
1. **A_VALUE** - Estimating advantage directly:$$ A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t) $$where $k$ is $T_{max} - State\_Index$ for each state in the batch.
|
||||
2. **GAE** - By following the [Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438) paper.
|
||||
|
||||
The advantages are then used in order to accumulate gradients according to
|
||||
$$ L = -\mathop{\mathbb{E}} [log (\pi) \cdot A] $$
|
||||
|
||||
@@ -1,28 +0,0 @@
|
||||
# Clipped Proximal Policy Optimization
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="..\..\design_imgs\ppo.png">
|
||||
</p>
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action - Continuous action
|
||||
Same as in PPO.
|
||||
### Training the network
|
||||
Very similar to PPO, with several small (but very simplifying) changes:
|
||||
|
||||
1. Train both the value and policy networks, simultaneously, by defining a single loss function, which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
|
||||
|
||||
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
|
||||
|
||||
3. Value targets are now also calculated based on the GAE advantages. In this method, the $ V $ values are predicted from the critic network, and then added to the GAE based advantages, in order to get a $ Q $ value for each action. Now, since our critic network is predicting a $ V $ value for each state, setting the $ Q $ calculated action-values as a target, will on average serve as a $ V $ state-value target.
|
||||
|
||||
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio $r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ is clipped, to achieve a similar effect. This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon clipped surrogate loss:
|
||||
|
||||
$$L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)] $$
|
||||
@@ -1,32 +0,0 @@
|
||||
# Deep Deterministic Policy Gradient
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\ddpg.png">
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
Pass the current states through the actor network, and get an action mean vector $ \mu $. While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process, to add exploration noise to the action. When testing, use the mean vector $\mu$ as-is.
|
||||
### Training the network
|
||||
Start by sampling a batch of transitions from the experience replay.
|
||||
|
||||
* To train the **critic network**, use the following targets:
|
||||
|
||||
$$ y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} )) $$
|
||||
First run the actor target network, using the next states as the inputs, and get $ \mu (s_{t+1} ) $. Next, run the critic target network using the next states and $ \mu (s_{t+1} ) $, and use the output to calculate $ y_t $ according to the equation above. To train the network, use the current states and actions as the inputs, and $y_t$ as the targets.
|
||||
|
||||
* To train the **actor network**, use the following equation:
|
||||
|
||||
$$ \nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ] $$
|
||||
Use the actor's online network to get the action mean values using the current states as the inputs. Then, use the critic online network in order to get the gradients of the critic output with respect to the action mean values $ \nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) } $. Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights, given $ \nabla_a Q(s,a) $. Finally, apply those gradients to the actor network.
|
||||
|
||||
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
|
||||
|
||||
@@ -1,27 +0,0 @@
|
||||
# Policy Gradient
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\pg.png">
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action - Discrete actions
|
||||
Run the current states through the network and get a policy distribution over the actions. While training, sample from the policy distribution. When testing, take the action with the highest probability.
|
||||
|
||||
### Training the network
|
||||
The policy head loss is defined as $ L=-log (\pi) \cdot PolicyGradientRescaler $. The $PolicyGradientRescaler$ is used in order to reduce the policy gradient variance, which might be very noisy. This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's convergence. The rescaler is a configurable parameter and there are few options to choose from:
|
||||
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
|
||||
* **Future Return** - Return from each transition until the end of the episode.
|
||||
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
|
||||
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations, which are calculated seperately for each timestep, across different episodes.
|
||||
|
||||
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes serves the same purpose - reducing the update variance. After accumulating gradients for several episodes, the gradients are then applied to the network.
|
||||
|
||||
@@ -1,24 +0,0 @@
|
||||
# Proximal Policy Optimization
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\ppo.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action - Continuous actions
|
||||
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation. While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values. When testing, just take the mean values predicted by the network.
|
||||
### Training the network
|
||||
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
|
||||
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
|
||||
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers, the L-BFGS optimizer runs on the entire dataset at once, without batching. It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset, the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total discounted returns of each state in each episode.
|
||||
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before* starting to run the current set of training iterations) using a regularization term.
|
||||
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value, in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high, increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
|
||||
@@ -1,30 +0,0 @@
|
||||
# Bootstrapped DQN
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\bs_dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
The current states are used as the input to the network. The network contains several $Q$ heads, which are used for returning different estimations of the action $ Q $ values. For each episode, the bootstrapped exploration policy selects a single head to play with during the episode. According to the selected head, only the relevant output $ Q $ values are used. Using those $ Q $ values, the exploration policy then selects the action for acting.
|
||||
|
||||
### Storing the transitions
|
||||
For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads. The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition, and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in the replay buffer.
|
||||
|
||||
### Training the network
|
||||
First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the current $ Q $ value predictions for all the heads and all the actions. For each transition in the batch, and for each output head, if the transition mask is 1 - change the targets of the played action to $y_t$, according to the standard DQN update rule:
|
||||
|
||||
$$ y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a) $$
|
||||
|
||||
Otherwise, leave it intact so that the transition does not affect the learning of this head. Then, train the online network according to the calculated targets.
|
||||
|
||||
As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
@@ -1,33 +0,0 @@
|
||||
# Categorical DQN
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\distributional_dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
|
||||
$$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
|
||||
where:
|
||||
* $[ \cdot ] $ bounds its argument in the range [a, b]
|
||||
* $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: $\hat{T}_{z_{j}} := r+\gamma z_j$
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
@@ -1,33 +0,0 @@
|
||||
# Distributional DQN
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\distributional_dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
|
||||
### Training the network
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
|
||||
$$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
|
||||
where:
|
||||
* $[ \cdot ] $ bounds its argument in the range [a, b]
|
||||
* $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: $\hat{T}_{z_{j}} := r+\gamma z_j$
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
@@ -1,28 +0,0 @@
|
||||
# Double DQN
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing action $argmax_a Q(s_{t+1},a)$. For these actions, use the corresponding next states and run the target network to calculate $Q(s_{t+1},argmax_a Q(s_{t+1},a))$.
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss), use the current states from the sampled batch, and run the online network to get the current Q values predictions. Set those values as the targets for the actions that were not actually played.
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network:
|
||||
$$ y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
|
||||
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
@@ -1,28 +0,0 @@
|
||||
# Deep Q Networks
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. Using the next states from the sampled batch, run the target network to calculate the $ Q $ values for each of the actions $ Q(s_{t+1},a) $, and keep only the maximum value for each state.
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss), use the current states from the sampled batch, and run the online network to get the current Q values predictions. Set those values as the targets for the actions that were not actually played.
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
|
||||
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
@@ -1,21 +0,0 @@
|
||||
# Dueling DQN
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\dueling_dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
## General Description
|
||||
Dueling DQN presents a change in the network structure comparing to DQN.
|
||||
|
||||
Dueling DQN uses a specialized _Dueling Q Head_ in order to separate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
|
||||
|
||||
In many states, the values of the different actions are very similar, and it is less important which action to take.
|
||||
This is especially important in environments where there are many actions to choose from. In DQN, on each training iteration, for each of the states in the batch, we update the $Q$ values only for the specific actions taken in those states. This results in slower learning as we do not learn the $Q$ values for actions that were not taken yet. On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a single action has been taken at this state.
|
||||
@@ -1,32 +0,0 @@
|
||||
# Mixed Monte Carlo
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../design_imgs/dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Training the network
|
||||
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
|
||||
|
||||
The DDQN targets are calculated in the same manner as in the DDQN agent:
|
||||
|
||||
$$ y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
|
||||
|
||||
The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
|
||||
|
||||
$$ y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} ) $$
|
||||
|
||||
A mixing ratio $\alpha$ is then used to get the final targets:
|
||||
|
||||
$$ y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC} $$
|
||||
|
||||
Finally, the online network is trained using the current states as inputs, and the calculated targets.
|
||||
Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
@@ -1,30 +0,0 @@
|
||||
# N-Step Q Learning
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
The $N$-step Q learning algorithm works in similar manner to DQN except for the following changes:
|
||||
|
||||
1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every $N$ steps using the latest $N$ steps played by the agent.
|
||||
|
||||
2. In order to stabilize the learning, multiple workers work together to update the network. This creates the same effect as uncorrelating the samples used for training.
|
||||
|
||||
3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated to form the $N$-step Q targets, according to the following equation:
|
||||
$$R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})$$
|
||||
where $k$ is $T_{max} - State\_Index$ for each state in the batch
|
||||
|
||||
@@ -1,22 +0,0 @@
|
||||
# Normalized Advantage Functions
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** [Continuous Deep Q-Learning with Model-based Acceleration](https://arxiv.org/abs/1603.00748.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\naf.png" width=600>
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
The current state is used as an input to the network. The action mean $ \mu(s_t ) $ is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration.
|
||||
###Training the network
|
||||
The network is trained by using the following targets:
|
||||
$$ y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1}) $$
|
||||
Use the next states as the inputs to the target network and extract the $ V $ value, from within the head, to get $ V(s_{t+1} ) $. Then, update the online network using the current states and actions as inputs, and $ y_t $ as the targets.
|
||||
After every training step, use a soft update in order to copy the weights from the online network to the target network.
|
||||
@@ -1,28 +0,0 @@
|
||||
# Neural Episodic Control
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Neural Episodic Control](https://arxiv.org/abs/1703.01988)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\nec.png" width=500>
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate output from the middleware.
|
||||
2. For each possible action $a_i$, run the DND head using the state embedding and the selected action $a_i$ as inputs. The DND is queried and returns the $ P $ nearest neighbor keys and values. The keys and values are used to calculate and return the action $ Q $ value from the network.
|
||||
3. Pass all the $ Q $ values to the exploration policy and choose an action accordingly.
|
||||
4. Store the state embeddings and actions taken during the current episode in a small buffer $B$, in order to accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
|
||||
|
||||
### Finalizing an episode
|
||||
For each step in the episode, the state embeddings and the taken actions are stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
|
||||
### Training the network
|
||||
Train the network only when the DND has enough entries for querying.
|
||||
|
||||
To train the network, the current states are used as the inputs and the $N$-step returns are used as the targets. The $N$-step return used takes into account $ N $ consecutive steps, and bootstraps the last value from the network if necessary:
|
||||
$$ y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N max_a Q(s_{t+N},a) $$
|
||||
@@ -1,32 +0,0 @@
|
||||
# Persistent Advantage Learning
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [Increasing the Action Gap: New Operators for Reinforcement Learning](https://arxiv.org/abs/1512.04860)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../design_imgs/dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Training the network
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
|
||||
$$ y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
|
||||
3. The action gap $ V(s_t )-Q(s_t,a_t) $ should then be subtracted from each of the calculated targets. To calculate the action gap, run the target network using the current states and get the $ Q $ values for all the actions. Then estimate $ V $ as the maximum predicted $ Q $ value for the current state:
|
||||
$$ V(s_t )=max_a Q(s_t,a) $$
|
||||
4. For _advantage learning (AL)_, reduce the action gap weighted by a predefined parameter $ \alpha $ from the targets $ y_t^{DDQN} $:
|
||||
$$ y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t )) $$
|
||||
5. For _persistent advantage learning (PAL)_, the target network is also used in order to calculate the action gap for the next state:
|
||||
$$ V(s_{t+1} )-Q(s_{t+1},a_{t+1}) $$
|
||||
where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
|
||||
$$ y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} )) $$
|
||||
6. Train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
7. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
@@ -1,68 +0,0 @@
|
||||
<!-- language-all: python -->
|
||||
|
||||
Coach's modularity makes adding an agent a simple and clean task, that involves the following steps:
|
||||
|
||||
1. Implement your algorithm in a new file. The agent can inherit base classes such as **ValueOptimizationAgent** or
|
||||
**ActorCriticAgent**, or the more generic **Agent** base class.
|
||||
|
||||
* **ValueOptimizationAgent**, **PolicyOptimizationAgent** and **Agent** are abstract classes.
|
||||
learn_from_batch() should be overriden with the desired behavior for the algorithm being implemented.
|
||||
If deciding to inherit from **Agent**, also choose_action() should be overriden.
|
||||
|
||||
|
||||
def learn_from_batch(self, batch) -> Tuple[float, List, List]:
|
||||
"""
|
||||
Given a batch of transitions, calculates their target values and updates the network.
|
||||
:param batch: A list of transitions
|
||||
:return: The total loss of the training, the loss per head and the unclipped gradients
|
||||
"""
|
||||
|
||||
def choose_action(self, curr_state):
|
||||
"""
|
||||
choose an action to act with in the current episode being played. Different behavior might be exhibited when training
|
||||
or testing.
|
||||
|
||||
:param curr_state: the current state to act upon.
|
||||
:return: chosen action, some action value describing the action (q-value, probability, etc)
|
||||
"""
|
||||
|
||||
2. Implement your agent's specific network head, if needed, at the implementation for the framework of your choice.
|
||||
For example **architectures/neon_components/heads.py**. The head will inherit the generic base class Head.
|
||||
A new output type should be added to configurations.py, and a mapping between the new head and output type should
|
||||
be defined in the get_output_head() function at **architectures/neon_components/general_network.py**
|
||||
|
||||
3. Define a new parameters class that inherits AgentParameters.
|
||||
The parameters class defines all the hyperparameters for the agent, and is initialized with 4 main components:
|
||||
* **algorithm**: A class inheriting AlgorithmParameters which defines any algorithm specific parameters
|
||||
* **exploration**: A class inheriting ExplorationParameters which defines the exploration policy parameters.
|
||||
There are several common exploration policies built-in which you can use, and are defined under
|
||||
the exploration sub directory. You can also define your own custom exploration policy.
|
||||
* **memory**: A class inheriting MemoryParameters which defined the memory parameters.
|
||||
There are several common memory types built-in which you can use, and are defined under the memories
|
||||
sub directory. You can also define your own custom memory.
|
||||
* **networks**: A dictionary defining all the networks that will be used by the agent. The keys of the dictionary
|
||||
define the network name and will be used to access each network through the agent class.
|
||||
The dictionary values are a class inheriting NetworkParameters, which define the network structure
|
||||
and parameters.
|
||||
|
||||
|
||||
Additionally, set the path property to return the path to your agent class in the following format:
|
||||
|
||||
<path to python module>:<name of agent class>
|
||||
|
||||
For example,
|
||||
|
||||
class RainbowAgentParameters(AgentParameters):
|
||||
def __init__(self):
|
||||
super().__init__(algorithm=RainbowAlgorithmParameters(),
|
||||
exploration=RainbowExplorationParameters(),
|
||||
memory=RainbowMemoryParameters(),
|
||||
networks={"main": RainbowNetworkParameters()})
|
||||
|
||||
@property
|
||||
def path(self):
|
||||
return 'rainbow.rainbow_agent:RainbowAgent'
|
||||
|
||||
4. (Optional) Define a preset using the new agent type with a given environment, and the hyper-parameters that should
|
||||
be used for training on that environment.
|
||||
|
||||
@@ -1,44 +0,0 @@
|
||||
# Coach Features
|
||||
|
||||
## Supported Algorithms
|
||||
|
||||
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into two main classes -
|
||||
value optimization and policy optimization. A detailed description of those algorithms may be found in the algorithms
|
||||
section.
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../img/algorithms.png" alt="Supported Algorithms" style="width: 600px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
## Supported Environments
|
||||
|
||||
Coach supports a large number of environments which can be solved using reinforcement learning:
|
||||
|
||||
* **[DeepMind Control Suite](https://github.com/deepmind/dm_control)** - a set of reinforcement learning environments
|
||||
powered by the MuJoCo physics engine.
|
||||
|
||||
* **[Blizzard Starcraft II](https://github.com/deepmind/pysc2)** - a popular strategy game which was wrapped with a
|
||||
python interface by DeepMind.
|
||||
|
||||
* **[ViZDoom](http://vizdoom.cs.put.edu.pl/)** - a Doom-based AI research platform for reinforcement learning
|
||||
from raw visual information.
|
||||
|
||||
* **[CARLA](https://github.com/carla-simulator/carla)** - an open-source simulator for autonomous driving research.
|
||||
|
||||
* **[OpenAI Gym](https://gym.openai.com/)** - a library which consists of a set of environments, from games to robotics.
|
||||
Additionally, it can be extended using the API defined by the authors.
|
||||
|
||||
In Coach, we support all the native environments in Gym, along with several extensions such as:
|
||||
|
||||
* **[Roboschool](https://github.com/openai/roboschool)** - a set of environments powered by the PyBullet engine,
|
||||
that offer a free alternative to MuJoCo.
|
||||
|
||||
* **[Gym Extensions](https://github.com/Breakend/gym-extensions)** - a set of environments that extends Gym for
|
||||
auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)
|
||||
|
||||
* **[PyBullet](https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet)** - a physics engine that
|
||||
includes a set of robotics environments.
|
||||
|
||||
@@ -1,116 +0,0 @@
|
||||
# Filters
|
||||
|
||||
Filters are a mechanism in Coach that allows doing pre-processing and post-processing of the internal agent information.
|
||||
There are two filter categories -
|
||||
|
||||
* **Input filters** - these are filters that process the information passed **into** the agent from the environment.
|
||||
This information includes the observation and the reward. Input filters therefore allow rescaling observations,
|
||||
normalizing rewards, stack observations, etc.
|
||||
|
||||
* **Output filters** - these are filters that process the information going **out** of the agent into the environment.
|
||||
This information includes the action the agent chooses to take. Output filters therefore allow conversion of
|
||||
actions from one space into another. For example, the agent can take $ N $ discrete actions, that will be mapped by
|
||||
the output filter onto $ N $ continuous actions.
|
||||
|
||||
Filters can be stacked on top of each other in order to build complex processing flows of the inputs or outputs.
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../img/filters.png" alt="Filters mechanism" style="width: 350px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
## Input Filters
|
||||
|
||||
The input filters are separated into two categories - **observation filters** and **reward filters**.
|
||||
|
||||
### Observation Filters
|
||||
|
||||
* **ObservationClippingFilter** - Clips the observation values to a given range of values. For example, if the
|
||||
observation consists of measurements in an arbitrary range, and we want to control the minimum and maximum values
|
||||
of these observations, we can define a range and clip the values of the measurements.
|
||||
|
||||
* **ObservationCropFilter** - Crops the size of the observation to a given crop window. For example, in Atari, the
|
||||
observations are images with a shape of 210x160. Usually, we will want to crop the size of the observation to a
|
||||
square of 160x160 before rescaling them.
|
||||
|
||||
* **ObservationMoveAxisFilter** - Reorders the axes of the observation. This can be useful when the observation is an
|
||||
image, and we want to move the channel axis to be the last axis instead of the first axis.
|
||||
|
||||
* **ObservationNormalizationFilter** - Normalizes the observation values with a running mean and standard deviation of
|
||||
all the observations seen so far. The normalization is performed element-wise. Additionally, when working with
|
||||
multiple workers, the statistics used for the normalization operation are accumulated over all the workers.
|
||||
|
||||
* **ObservationReductionBySubPartsNameFilter** - Allows keeping only parts of the observation, by specifying their
|
||||
name. For example, the CARLA environment extracts multiple measurements that can be used by the agent, such as
|
||||
speed and location. If we want to only use the speed, it can be done using this filter.
|
||||
|
||||
* **ObservationRescaleSizeByFactorFilter** - Rescales an image observation by some factor. For example, the image size
|
||||
can be reduced by a factor of 2.
|
||||
|
||||
* **ObservationRescaleToSizeFilter** - Rescales an image observation to a given size. The target size does not
|
||||
necessarily keep the aspect ratio of the original observation.
|
||||
|
||||
* **ObservationRGBToYFilter** - Converts a color image observation specified using the RGB encoding into a grayscale
|
||||
image observation, by keeping only the luminance (Y) channel of the YUV encoding. This can be useful if the colors
|
||||
in the original image are not relevant for solving the task at hand.
|
||||
|
||||
* **ObservationSqueezeFilter** - Removes redundant axes from the observation, which are axes with a dimension of 1.
|
||||
|
||||
* **ObservationStackingFilter** - Stacks several observations on top of each other. For image observation this will
|
||||
create a 3D blob. The stacking is done in a lazy manner in order to reduce memory consumption. To achieve this,
|
||||
a LazyStack object is used in order to wrap the observations in the stack. For this reason, the
|
||||
ObservationStackingFilter **must** be the last filter in the inputs filters stack.
|
||||
|
||||
* **ObservationUint8Filter** - Converts a floating point observation into an unsigned int 8 bit observation. This is
|
||||
mostly useful for reducing memory consumption and is usually used for image observations. The filter will first
|
||||
spread the observation values over the range 0-255 and then discretize them into integer values.
|
||||
|
||||
### Reward Filters
|
||||
|
||||
* **RewardClippingFilter** - Clips the reward values into a given range. For example, in DQN, the Atari rewards are
|
||||
clipped into the range -1 and 1 in order to control the scale of the returns.
|
||||
|
||||
* **RewardNormalizationFilter** - Normalizes the reward values with a running mean and standard deviation of
|
||||
all the rewards seen so far. When working with multiple workers, the statistics used for the normalization operation
|
||||
are accumulated over all the workers.
|
||||
|
||||
* **RewardRescaleFilter** - Rescales the reward by a given factor. Rescaling the rewards of the environment has been
|
||||
observed to have a large effect (negative or positive) on the behavior of the learning process.
|
||||
|
||||
## Output Filters
|
||||
|
||||
The output filters only process the actions.
|
||||
|
||||
### Action Filters
|
||||
|
||||
* **AttentionDiscretization** - Discretizes an **AttentionActionSpace**. The attention action space defines the actions
|
||||
as choosing sub-boxes in a given box. For example, consider an image of size 100x100, where the action is choosing
|
||||
a crop window of size 20x20 to attend to in the image. AttentionDiscretization allows discretizing the possible crop
|
||||
windows to choose into a finite number of options, and map a discrete action space into those crop windows.
|
||||
|
||||
* **BoxDiscretization** - Discretizes a continuous action space into a discrete action space, allowing the usage of
|
||||
agents such as DQN for continuous environments such as MuJoCo. Given the number of bins to discretize into, the
|
||||
original continuous action space is uniformly separated into the given number of bins, each mapped to a discrete
|
||||
action index. For example, if the original actions space is between -1 and 1 and 5 bins were selected, the new action
|
||||
space will consist of 5 actions mapped to -1, -0.5, 0, 0.5 and 1.
|
||||
|
||||
* **BoxMasking** - Masks part of the action space to enforce the agent to work in a defined space. For example,
|
||||
if the original action space is between -1 and 1, then this filter can be used in order to constrain the agent actions
|
||||
to the range 0 and 1 instead. This essentially masks the range -1 and 0 from the agent.
|
||||
|
||||
* **PartialDiscreteActionSpaceMap** - Partial map of two countable action spaces. For example, consider an environment
|
||||
with a MultiSelect action space (select multiple actions at the same time, such as jump and go right), with 8 actual
|
||||
MultiSelect actions. If we want the agent to be able to select only 5 of those actions by their index (0-4), we can
|
||||
map a discrete action space with 5 actions into the 5 selected MultiSelect actions. This will both allow the agent to
|
||||
use regular discrete actions, and mask 3 of the actions from the agent.
|
||||
|
||||
* **FullDiscreteActionSpaceMap** - Full map of two countable action spaces. This works in a similar way to the
|
||||
PartialDiscreteActionSpaceMap, but maps the entire source action space into the entire target action space, without
|
||||
masking any actions.
|
||||
|
||||
* **LinearBoxToBoxMap** - A linear mapping of two box action spaces. For example, if the action space of the
|
||||
environment consists of continuous actions between 0 and 1, and we want the agent to choose actions between -1 and 1,
|
||||
the LinearBoxToBoxMap can be used to map the range -1 and 1 to the range 0 and 1 in a linear way. This means that the
|
||||
action -1 will be mapped to 0, the action 1 will be mapped to 1, and the rest of the actions will be linearly mapped
|
||||
between those values.
|
||||
@@ -1,36 +0,0 @@
|
||||
# Network Design
|
||||
|
||||
Each agent has at least one neural network, used as the function approximator, for choosing the actions. The network is designed in a modular way to allow reusability in different agents. It is separated into three main parts:
|
||||
|
||||
* **Input Embedders** - This is the first stage of the network, meant to convert the input into a feature vector representation. It is possible to combine several instances of any of the supported embedders, in order to allow varied combinations of inputs.
|
||||
|
||||
There are two main types of input embedders:
|
||||
|
||||
1. Image embedder - Convolutional neural network.
|
||||
2. Vector embedder - Multi-layer perceptron.
|
||||
|
||||
|
||||
* **Middlewares** - The middleware gets the output of the input embedder, and processes it into a different representation domain, before sending it through the output head. The goal of the middleware is to enable processing the combined outputs of several input embedders, and pass them through some extra processing. This, for instance, might include an LSTM or just a plain simple FC layer.
|
||||
|
||||
* **Output Heads** - The output head is used in order to predict the values required from the network. These might include action-values, state-values or a policy. As with the input embedders, it is possible to use several output heads in the same network. For example, the *Actor Critic* agent combines two heads - a policy head and a state-value head.
|
||||
In addition, the output heads defines the loss function according to the head type.
|
||||
|
||||
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../img/network.png" alt="Network Design" style="width: 400px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
## Keeping Network Copies in Sync
|
||||
|
||||
Most of the reinforcement learning agents include more than one copy of the neural network. These copies serve as counterparts of the main network which are updated in different rates, and are often synchronized either locally or between parallel workers. For easier synchronization of those copies, a wrapper around these copies exposes a simplified API, which allows hiding these complexities from the agent.
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../img/distributed.png" alt="Distributed Training" style="width: 600px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
@@ -1,8 +0,0 @@
|
||||
.wy-side-nav-search {
|
||||
background-color: #79a7a5;
|
||||
}
|
||||
|
||||
.wy-nav-top {
|
||||
background: #79a7a5;
|
||||
}
|
||||
|
||||
|
Before Width: | Height: | Size: 35 KiB |
|
Before Width: | Height: | Size: 25 KiB |
@@ -1,25 +0,0 @@
|
||||
# What is Coach?
|
||||
|
||||
## Motivation
|
||||
|
||||
Train and evaluate reinforcement learning agents by harnessing the power of multi-core CPU processing to achieve state-of-the-art results. Provide a sandbox for easing the development process of new algorithms through a modular design and an elegant set of APIs.
|
||||
|
||||
## Solution
|
||||
|
||||
Coach is a python environment which models the interaction between an agent and an environment in a modular way.
|
||||
With Coach, it is possible to model an agent by combining various building blocks, and training the agent on multiple environments.
|
||||
The available environments allow testing the agent in different practical fields such as robotics, autonomous driving, games and more.
|
||||
Coach collects statistics from the training process and supports advanced visualization techniques for debugging the agent being trained.
|
||||
|
||||
|
||||
|
||||
Blog post from the Intel® AI website can be found [here](https://ai.intel.com/reinforcement-learning-coach-intel/).
|
||||
|
||||
GitHub repository is [here](https://github.com/NervanaSystems/coach).
|
||||
|
||||
## Design
|
||||
|
||||
<img src="img/design.png" alt="Coach Design" style="width: 800px;"/>
|
||||
|
||||
|
||||
|
||||
@@ -1,80 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2017 Intel Corporation
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
'''
|
||||
Math extension for Python-Markdown
|
||||
==================================
|
||||
|
||||
Adds support for displaying math formulas using [MathJax](http://www.mathjax.org/).
|
||||
|
||||
Author: 2015, Dmitry Shachnev <mitya57@gmail.com>.
|
||||
'''
|
||||
|
||||
import markdown
|
||||
|
||||
class MathExtension(markdown.extensions.Extension):
|
||||
def __init__(self, *args, **kwargs):
|
||||
self.config = {
|
||||
'enable_dollar_delimiter': [False, 'Enable single-dollar delimiter'],
|
||||
'render_to_span': [False,
|
||||
'Render to span elements rather than script for fallback'],
|
||||
}
|
||||
super(MathExtension, self).__init__(*args, **kwargs)
|
||||
|
||||
def extendMarkdown(self, md, md_globals):
|
||||
def handle_match_inline(m):
|
||||
if self.getConfig('render_to_span'):
|
||||
node = markdown.util.etree.Element('span')
|
||||
node.set('class', 'tex')
|
||||
node.text = ("\\\\(" + markdown.util.AtomicString(m.group(3)) +
|
||||
"\\\\)")
|
||||
else:
|
||||
node = markdown.util.etree.Element('script')
|
||||
node.set('type', 'math/tex')
|
||||
node.text = markdown.util.AtomicString(m.group(3))
|
||||
return node
|
||||
|
||||
def handle_match(m):
|
||||
node = markdown.util.etree.Element('script')
|
||||
node.set('type', 'math/tex; mode=display')
|
||||
if '\\begin' in m.group(2):
|
||||
node.text = markdown.util.AtomicString(m.group(2) + m.group(4) + m.group(5))
|
||||
else:
|
||||
node.text = markdown.util.AtomicString(m.group(3))
|
||||
return node
|
||||
|
||||
inlinemathpatterns = (
|
||||
markdown.inlinepatterns.Pattern(r'(?<!\\|\$)(\$)([^\$]+)(\$)'), # $...$
|
||||
markdown.inlinepatterns.Pattern(r'(?<!\\)(\\\()(.+?)(\\\))') # \(...\)
|
||||
)
|
||||
mathpatterns = (
|
||||
markdown.inlinepatterns.Pattern(r'(?<!\\)(\$\$)([^\$]+)(\$\$)'), # $$...$$
|
||||
markdown.inlinepatterns.Pattern(r'(?<!\\)(\\\[)(.+?)(\\\])'), # \[...\]
|
||||
markdown.inlinepatterns.Pattern(r'(?<!\\)(\\begin{([a-z]+?\*?)})(.+?)(\\end{\3})')
|
||||
)
|
||||
if not self.getConfig('enable_dollar_delimiter'):
|
||||
inlinemathpatterns = inlinemathpatterns[1:]
|
||||
for i, pattern in enumerate(inlinemathpatterns):
|
||||
pattern.handleMatch = handle_match_inline
|
||||
md.inlinePatterns.add('math-inline-%d' % i, pattern, '<escape')
|
||||
for i, pattern in enumerate(mathpatterns):
|
||||
pattern.handleMatch = handle_match
|
||||
md.inlinePatterns.add('math-%d' % i, pattern, '<escape')
|
||||
|
||||
def makeExtension(*args, **kwargs):
|
||||
return MathExtension(*args, **kwargs)
|
||||
@@ -1,42 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2017 Intel Corporation
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from distutils.core import setup
|
||||
|
||||
long_description = \
|
||||
"""This extension adds math formulas support to Python-Markdown_
|
||||
(works with version 2.6 or newer).
|
||||
|
||||
.. _Python-Markdown: https://github.com/waylan/Python-Markdown
|
||||
|
||||
You can find the source on GitHub_.
|
||||
Please refer to the `README file`_ for details on how to use it.
|
||||
|
||||
.. _GitHub: https://github.com/mitya57/python-markdown-math
|
||||
.. _`README file`: https://github.com/mitya57/python-markdown-math/blob/master/README.md
|
||||
"""
|
||||
|
||||
setup(name='python-markdown-math',
|
||||
description='Math extension for Python-Markdown',
|
||||
long_description=long_description,
|
||||
author='Dmitry Shachnev',
|
||||
author_email='mitya57@gmail.com',
|
||||
version='0.2',
|
||||
url='https://github.com/mitya57/python-markdown-math',
|
||||
py_modules=['mdx_math'],
|
||||
license='BSD')
|
||||
@@ -1,133 +0,0 @@
|
||||
# Coach Usage
|
||||
|
||||
## Training an Agent
|
||||
|
||||
### Single-threaded Algorithms
|
||||
|
||||
This is the most common case. Just choose a preset using the `-p` flag and press enter.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p CartPole_DQN`
|
||||
|
||||
### Multi-threaded Algorithms
|
||||
|
||||
Multi-threaded algorithms are very common these days.
|
||||
They typically achieve the best results, and scale gracefully with the number of threads.
|
||||
In Coach, running such algorithms is done by selecting a suitable preset, and choosing the number of threads to run using the `-n` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p CartPole_A3C -n 8`
|
||||
|
||||
## Evaluating an Agent
|
||||
|
||||
There are several options for evaluating an agent during the training:
|
||||
|
||||
* For multi-threaded runs, an evaluation agent will constantly run in the background and evaluate the model during the training.
|
||||
|
||||
* For single-threaded runs, it is possible to define an evaluation period through the preset. This will run several episodes of evaluation once in a while.
|
||||
|
||||
Additionally, it is possible to save checkpoints of the agents networks and then run only in evaluation mode.
|
||||
Saving checkpoints can be done by specifying the number of seconds between storing checkpoints using the `-s` flag.
|
||||
The checkpoints will be saved into the experiment directory.
|
||||
Loading a model for evaluation can be done by specifying the `-crd` flag with the experiment directory, and the `--evaluate` flag to disable training.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p CartPole_DQN -s 60`
|
||||
`python coach.py -p CartPole_DQN --evaluate -crd CHECKPOINT_RESTORE_DIR`
|
||||
|
||||
## Playing with the Environment as a Human
|
||||
|
||||
Interacting with the environment as a human can be useful for understanding its difficulties and for collecting data for imitation learning.
|
||||
In Coach, this can be easily done by selecting a preset that defines the environment to use, and specifying the `--play` flag.
|
||||
When the environment is loaded, the available keyboard buttons will be printed to the screen.
|
||||
Pressing the escape key when finished will end the simulation and store the replay buffer in the experiment dir.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Breakout_DQN --play`
|
||||
|
||||
## Learning Through Imitation Learning
|
||||
|
||||
Learning through imitation of human behavior is a nice way to speedup the learning.
|
||||
In Coach, this can be done in two steps -
|
||||
|
||||
1. Create a dataset of demonstrations by playing with the environment as a human.
|
||||
After this step, a pickle of the replay buffer containing your game play will be stored in the experiment directory.
|
||||
The path to this replay buffer will be printed to the screen.
|
||||
To do so, you should select an environment type and level through the command line, and specify the `--play` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -et Doom -lvl Basic --play`
|
||||
|
||||
|
||||
2. Next, use an imitation learning preset and set the replay buffer path accordingly.
|
||||
The path can be set either from the command line or from the preset itself.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Doom_Basic_BC -cp='agent.load_memory_from_file_path=\"<experiment dir>/replay_buffer.p\"'`
|
||||
|
||||
|
||||
## Visualizations
|
||||
|
||||
### Rendering the Environment
|
||||
|
||||
Rendering the environment can be done by using the `-r` flag.
|
||||
When working with multi-threaded algorithms, the rendered image will be representing the game play of the evaluation worker.
|
||||
When working with single-threaded algorithms, the rendered image will be representing the single worker which can be either training or evaluating.
|
||||
Keep in mind that rendering the environment in single-threaded algorithms may slow the training to some extent.
|
||||
When playing with the environment using the `--play` flag, the environment will be rendered automatically without the need for specifying the `-r` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Breakout_DQN -r`
|
||||
|
||||
### Dumping GIFs
|
||||
|
||||
Coach allows storing GIFs of the agent game play.
|
||||
To dump GIF files, use the `-dg` flag.
|
||||
The files are dumped after every evaluation episode, and are saved into the experiment directory, under a gifs sub-directory.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Breakout_A3C -n 4 -dg`
|
||||
|
||||
## Switching between deep learning frameworks
|
||||
|
||||
Coach uses TensorFlow as its main backend framework, but it also supports neon for some of the algorithms.
|
||||
By default, TensorFlow will be used. It is possible to switch to neon using the `-f` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Doom_Basic_DQN -f neon`
|
||||
|
||||
## Additional Flags
|
||||
|
||||
There are several convenient flags which are important to know about.
|
||||
Here we will list most of the flags, but these can be updated from time to time.
|
||||
The most up to date description can be found by using the `-h` flag.
|
||||
|
||||
|
||||
|Flag |Type |Description |
|
||||
|-------------------------------|----------|--------------|
|
||||
|`-p PRESET`, ``--preset PRESET`|string |Name of a preset to run (as configured in presets.py) |
|
||||
|`-l`, `--list` |flag |List all available presets|
|
||||
|`-e EXPERIMENT_NAME`, `--experiment_name EXPERIMENT_NAME`|string|Experiment name to be used to store the results.|
|
||||
|`-r`, `--render` |flag |Render environment|
|
||||
|`-f FRAMEWORK`, `--framework FRAMEWORK`|string|Neural network framework. Available values: tensorflow, neon|
|
||||
|`-n NUM_WORKERS`, `--num_workers NUM_WORKERS`|int|Number of workers for multi-process based agents, e.g. A3C|
|
||||
|`--play` |flag |Play as a human by controlling the game with the keyboard. This option will save a replay buffer with the game play.|
|
||||
|`--evaluate` |flag |Run evaluation only. This is a convenient way to disable training in order to evaluate an existing checkpoint.|
|
||||
|`-v`, `--verbose` |flag |Don't suppress TensorFlow debug prints.|
|
||||
|`-s SAVE_MODEL_SEC`, `--save_model_sec SAVE_MODEL_SEC`|int|Time in seconds between saving checkpoints of the model.|
|
||||
|`-crd CHECKPOINT_RESTORE_DIR`, `--checkpoint_restore_dir CHECKPOINT_RESTORE_DIR`|string|Path to a folder containing a checkpoint to restore the model from.|
|
||||
|`-dg`, `--dump_gifs` |flag |Enable the gif saving functionality.|
|
||||
|`-at AGENT_TYPE`, `--agent_type AGENT_TYPE`|string|Choose an agent type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||
|`-et ENVIRONMENT_TYPE`, `--environment_type ENVIRONMENT_TYPE`|string|Choose an environment type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||
|`-ept EXPLORATION_POLICY_TYPE`, `--exploration_policy_type EXPLORATION_POLICY_TYPE`|string|Choose an exploration policy type class to override on top of the selected preset.If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||
|`-lvl LEVEL`, `--level LEVEL` |string|Choose the level that will be played in the environment that was selected. This value will override the level parameter in the environment class.|
|
||||
|`-cp CUSTOM_PARAMETER`, `--custom_parameter CUSTOM_PARAMETER`|string| Semicolon separated parameters used to override specific parameters on top of the selected preset (or on top of the command-line assembled one). Whenever a parameter value is a string, it should be inputted as `'\"string\"'`. For ex.: `"visualization.render=False;` `num_training_iterations=500;` `optimizer='rmsprop'"`|
|
||||
@@ -1,37 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2017 Intel Corporation
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
import os, fnmatch, sys
|
||||
def findReplace(directory, find, replace, filePattern):
|
||||
for path, dirs, files in os.walk(os.path.abspath(directory)):
|
||||
for filename in fnmatch.filter(files, filePattern):
|
||||
filepath = os.path.join(path, filename)
|
||||
with open(filepath) as f:
|
||||
s = f.read()
|
||||
s = s.replace(find, replace)
|
||||
with open(filepath, "w") as f:
|
||||
f.write(s)
|
||||
|
||||
if __name__=="__main__":
|
||||
findReplace('./site/', '/"', '/index.html"', "*.html")
|
||||
findReplace('./site/', '"/index.html"', '"./index.html"', "*.html")
|
||||
findReplace('./site/', '"."', '"./index.html"', "*.html")
|
||||
findReplace('./site/', '".."', '"../index.html"', "*.html")
|
||||
findReplace('./site/', '/"', '/index.html"', "search_index.json")
|
||||
findReplace('./site/', '/#', '/index.html#', "search_index.json")
|
||||
findReplace('./site/assets/javascripts/', 'search_index.json', 'search_index.txt', "*.js")
|
||||
findReplace('./site/mkdocs/js/', 'search_index.json', 'search_index.txt', "search.js")
|
||||
os.rename("./site/mkdocs/search_index.json", "./site/mkdocs/search_index.txt")
|
||||
35
docs_raw/make.bat
Normal file
@@ -0,0 +1,35 @@
|
||||
@ECHO OFF
|
||||
|
||||
pushd %~dp0
|
||||
|
||||
REM Command file for Sphinx documentation
|
||||
|
||||
if "%SPHINXBUILD%" == "" (
|
||||
set SPHINXBUILD=sphinx-build
|
||||
)
|
||||
set SOURCEDIR=source
|
||||
set BUILDDIR=build
|
||||
|
||||
if "%1" == "" goto help
|
||||
|
||||
%SPHINXBUILD% >NUL 2>NUL
|
||||
if errorlevel 9009 (
|
||||
echo.
|
||||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
|
||||
echo.installed, then set the SPHINXBUILD environment variable to point
|
||||
echo.to the full path of the 'sphinx-build' executable. Alternatively you
|
||||
echo.may add the Sphinx directory to PATH.
|
||||
echo.
|
||||
echo.If you don't have Sphinx installed, grab it from
|
||||
echo.http://sphinx-doc.org/
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
|
||||
goto end
|
||||
|
||||
:help
|
||||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
|
||||
|
||||
:end
|
||||
popd
|
||||
@@ -1,44 +0,0 @@
|
||||
site_name: Reinforcement Learning Coach
|
||||
theme: readthedocs
|
||||
site_description: 'Reinforcement Learning Coach by Intel Nervana.'
|
||||
markdown_extensions:
|
||||
- mdx_math:
|
||||
enable_dollar_delimiter: True #for use of inline $..$
|
||||
|
||||
extra_javascript: ['https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML']
|
||||
extra_css: [extra.css]
|
||||
|
||||
pages:
|
||||
- Home : index.md
|
||||
- Usage: usage.md
|
||||
- Design:
|
||||
- 'Features' : design/features.md
|
||||
- 'Control Flow' : design/control_flow.md
|
||||
- 'Network' : design/network.md
|
||||
- 'Filters' : design/filters.md
|
||||
- API Reference:
|
||||
- 'Agent Parameters' : api_reference/agent_parameters/agent_parameters.md
|
||||
- Algorithms:
|
||||
- 'DQN' : algorithms/value_optimization/dqn.md
|
||||
- 'Double DQN' : algorithms/value_optimization/double_dqn.md
|
||||
- 'Dueling DQN' : algorithms/value_optimization/dueling_dqn.md
|
||||
- 'Categorical DQN' : algorithms/value_optimization/categorical_dqn.md
|
||||
- 'Mixed Monte Carlo' : algorithms/value_optimization/mmc.md
|
||||
- 'Persistent Advantage Learning' : algorithms/value_optimization/pal.md
|
||||
- 'Neural Episodic Control' : algorithms/value_optimization/nec.md
|
||||
- 'Bootstrapped DQN' : algorithms/value_optimization/bs_dqn.md
|
||||
- 'N-Step Q Learning' : algorithms/value_optimization/n_step.md
|
||||
- 'Normalized Advantage Functions' : algorithms/value_optimization/naf.md
|
||||
- 'Policy Gradient' : algorithms/policy_optimization/pg.md
|
||||
- 'Actor-Critic' : algorithms/policy_optimization/ac.md
|
||||
- 'Deep Determinstic Policy Gradients' : algorithms/policy_optimization/ddpg.md
|
||||
- 'Proximal Policy Optimization' : algorithms/policy_optimization/ppo.md
|
||||
- 'Clipped Proximal Policy Optimization' : algorithms/policy_optimization/cppo.md
|
||||
- 'Direct Future Prediction' : algorithms/other/dfp.md
|
||||
- 'Behavioral Cloning' : algorithms/imitation/bc.md
|
||||
|
||||
- Coach Dashboard : 'dashboard.md'
|
||||
- Contributing :
|
||||
- Adding a New Agent : 'contributing/add_agent.md'
|
||||
- Adding a New Environment : 'contributing/add_env.md'
|
||||
|
||||
61
docs_raw/source/_static/css/custom.css
Normal file
@@ -0,0 +1,61 @@
|
||||
/* Docs background */
|
||||
.wy-side-nav-search{
|
||||
background-color: #043c74;
|
||||
}
|
||||
|
||||
/* Mobile version */
|
||||
.wy-nav-top{
|
||||
background-color: #043c74;
|
||||
}
|
||||
|
||||
|
||||
.green {
|
||||
color: green;
|
||||
}
|
||||
|
||||
.red {
|
||||
color: red;
|
||||
}
|
||||
|
||||
.blue {
|
||||
color: blue;
|
||||
}
|
||||
|
||||
.yellow {
|
||||
color: yellow;
|
||||
}
|
||||
|
||||
.badge {
|
||||
border: 2px;
|
||||
border-style: solid;
|
||||
border-color: #6C8EBF;
|
||||
border-radius: 5px;
|
||||
padding: 3px 15px 3px 15px;
|
||||
margin: 5px;
|
||||
display: inline-block;
|
||||
font-weight: bold;
|
||||
font-size: 16px;
|
||||
background: #DAE8FC;
|
||||
}
|
||||
|
||||
.badge:hover {
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.badge > a {
|
||||
color: black;
|
||||
}
|
||||
|
||||
.bordered-container {
|
||||
border: 0px;
|
||||
border-style: solid;
|
||||
border-radius: 8px;
|
||||
padding: 15px;
|
||||
margin-bottom: 20px;
|
||||
background: #f2f2f2;
|
||||
}
|
||||
|
||||
.questionnaire {
|
||||
font-size: 1.2em;
|
||||
line-height: 1.5em;
|
||||
}
|
||||
|
Before Width: | Height: | Size: 49 KiB After Width: | Height: | Size: 49 KiB |
BIN
docs_raw/source/_static/img/algorithms.png
Normal file
|
After Width: | Height: | Size: 50 KiB |
BIN
docs_raw/source/_static/img/attention_discretization.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
|
Before Width: | Height: | Size: 310 KiB After Width: | Height: | Size: 310 KiB |
BIN
docs_raw/source/_static/img/box_discretization.png
Normal file
|
After Width: | Height: | Size: 17 KiB |
BIN
docs_raw/source/_static/img/box_masking.png
Normal file
|
After Width: | Height: | Size: 12 KiB |
|
Before Width: | Height: | Size: 449 KiB After Width: | Height: | Size: 449 KiB |
|
Before Width: | Height: | Size: 447 KiB After Width: | Height: | Size: 447 KiB |
BIN
docs_raw/source/_static/img/dark_logo.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
BIN
docs_raw/source/_static/img/design.png
Normal file
|
After Width: | Height: | Size: 106 KiB |
|
Before Width: | Height: | Size: 30 KiB After Width: | Height: | Size: 30 KiB |
|
Before Width: | Height: | Size: 13 KiB After Width: | Height: | Size: 13 KiB |
BIN
docs_raw/source/_static/img/design_imgs/cil.png
Normal file
|
After Width: | Height: | Size: 27 KiB |
|
Before Width: | Height: | Size: 23 KiB After Width: | Height: | Size: 23 KiB |
|
Before Width: | Height: | Size: 37 KiB After Width: | Height: | Size: 37 KiB |
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
|
Before Width: | Height: | Size: 14 KiB After Width: | Height: | Size: 14 KiB |
|
Before Width: | Height: | Size: 25 KiB After Width: | Height: | Size: 25 KiB |
|
Before Width: | Height: | Size: 33 KiB After Width: | Height: | Size: 33 KiB |
|
Before Width: | Height: | Size: 47 KiB After Width: | Height: | Size: 47 KiB |
|
Before Width: | Height: | Size: 14 KiB After Width: | Height: | Size: 14 KiB |
|
Before Width: | Height: | Size: 38 KiB After Width: | Height: | Size: 38 KiB |
BIN
docs_raw/source/_static/img/design_imgs/qr_dqn.png
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
docs_raw/source/_static/img/design_imgs/rainbow.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
1
docs_raw/source/_static/img/diagrams.xml
Normal file
|
Before Width: | Height: | Size: 35 KiB After Width: | Height: | Size: 35 KiB |
|
Before Width: | Height: | Size: 21 KiB After Width: | Height: | Size: 21 KiB |
BIN
docs_raw/source/_static/img/full_discrete_action_space_map.png
Normal file
|
After Width: | Height: | Size: 20 KiB |
|
Before Width: | Height: | Size: 29 KiB After Width: | Height: | Size: 29 KiB |
|
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
|
Before Width: | Height: | Size: 24 KiB After Width: | Height: | Size: 24 KiB |
BIN
docs_raw/source/_static/img/linear_box_to_box_map.png
Normal file
|
After Width: | Height: | Size: 12 KiB |
|
Before Width: | Height: | Size: 28 KiB After Width: | Height: | Size: 28 KiB |
|
Before Width: | Height: | Size: 40 KiB After Width: | Height: | Size: 40 KiB |
1
docs_raw/source/_static/img/output_filters.xml
Normal file
@@ -0,0 +1 @@
|
||||
<mxfile userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" version="9.3.0" editor="www.draw.io" type="device"><diagram id="44beadd6-aa91-fdea-8231-a495e8c32fb3" name="Page-1">7Z1Nc+I4EIZ/DUdSluQvjklmdvewUzVVc9jZoxcU8I7BlHEmZH/9ytgyWN0MDrFkR4bDDBG2wI9eS61Wqz1hj+v971m0XX1JFzyZUGexn7BPE0oJmTniv6LktSoJCCtLllm8qMqOBd/i/3hVWJ24fI4XfNc4ME/TJI+3zcJ5utnwed4oi7IsfWke9pQmzW/dRksOCr7NowSW/hUv8lVZGnrOsfwPHi9X8puJU33yTzT/sczS5031fRPKng6v8uN1JOuqjt+tokX6clLEPk/YY5amefluvX/kSQFXYivP++3Mp/Xvzvgmb3MCZW55ys8oeebyNx9+Wf4qaRyuhxdnOBP28LKKc/5tG82LT1+EAETZKl8n4i8i3u7yLP3BH9MkzUTJJt2Iwx6e4iSRRUckojzd5FXrC4Wwh+q38Czn+7NXRGpOQoA8XfM8exWH1Oqr0Fbam3puVfBybEkvqMpWJ63IZCNGlXqWdeVHguJNBfFMC3ihZUBnCk8StORJw054BpbxDJs8a70akqdvN06Xem3V6XWB07MbZ0Ba4qxl/D6cto1GQRNnbUEY6TuDFkPRBX7RblvaV0/xvmCusns8vADoSdFbFa9uKLoebWCkSJ9JfEyVfhcYW4xA+jEW4jwpL1968HrEN4iXdk+3AyaENe9cs5JjgImo9NEBYMT15W/s8KqiKImXG/HnXPDgovyhoBWLGdB99cE6XiyKr0F72GYf3AVuxWqvp5AnuF2sn+wANhxziN2w1VHJJGxoLxXKJvbCVpXNkOmTLtjQ1id2w1aVbRI2tBIe0r0ouJ/ncboRb8qLHyb7g+lb/SinC9XLiZHsYtgMDp9EU0sQOHx+infzjOd8rO3h+u5dixZhnqYWgWMsHF/FKfF2d46PZnvPdZUphgN7DqpLr3BQhDZwr3R8p0c6cBTzh0VH1Q7zDdKBc/xwWHRU7Zikg0wuofXTKx2P9ndnIdNMNiw6oFcODdKBY5Y7LDpAOybpwDHLGxYdMGaZpAPHrGBYdFTtmOyV26w571bRtnj7lPD9fbGiLy6bbxbV20/zJNrt4nmTk+JU9Q+vcwZ8+Y18AQIBLhI89d0jhGRZxpMoj382q8ewVd/wNY3FF9cNxGQ9tXxnzSp26XM259VZR/igIpdeqCiPsiXPQUWHVqwvu13DtlhcHIAv16x/m0HHwMicue4M8tblhWHQHh6ZN9co7RmqbYs9jKq2/cAcbVnveP25RmkTQPvm0D32Moi9qM2hK2eBN4PxVwajauXM3GYVrQ1G50JF3RmMLvRAgIbtY3LUwmAMKWxJWfYurSOu8sJ1b7MJ4/VoMLrQkWG5wejNaH+04QyxUvcddB/ZQ5yoPWpojjicfhLreSsKN8obTkBHbDa6MniyF7MRnZ0q3Hv1whLFNxcY9FFLs3O46xsgGN0kHTj7g6Zpv3ScHunACdnA1lVr120fdAa/rlqPfn3Qua3+vH0yH1y9+gN6CW2rP8jOx7GH3RGXKmF3AbIDSFvYnafXADq3iaUDcPU2vh46KH/whpEX9khn8IaR5/RIR69hpPOO83ukNniDyXN7pAMNprGPq64b9Dmu+sguJtt99ETJA2Fy05iPeI1t92EquLHoRW24kYgl+530qsCNEj8TtWQz7z4VLsNKgMItjqVR9W1yb2QIB8wvz0keF2YLTwprbaSGjBqaGiI3ARqa2kmrwF5nYOsj6lRuhmRh0GV2z+BE92Z2q2Y3qldtu0gZMpdeLLm8xDTLV+ky3UTJ52OpguSkRfg+zr+fvP+7OERcX+FqFr/se3XG4Y/jZ4totzrUVpz2L8/z1yplUPScp6Lo+CP+TNPt5Ne+6fJyLruhxSUf/L2VMqvJcem6Pb2XYcu19kO3vi3Q3CRKqxyZk8sdxy9y3pxJ1NRFx+te7lmIrq6FMMRlNkYhu1DIxDnTcN0rGU388vGVjOS71KhkxFE3RiV7mJKpKSW3SSDwAZWMzFA0KhlZ5Bmjkn1MycyUklFPnxNtFuJfG3pn4iAb/DWKGtmUPUZRB5ioXVOiRneAVqK2oaMW/YNRUSP5X8Yo6hATtWdK1KjL2ipRY7HK+kSNJIgYo6hnmKh9U6LG9zOXtodV2kZWb/VpW1puww0XBo5mZBKtDw8FeBD/0KDwIDMzfXiGHx3ktDDy9fEZfqIqwAexF/XxGXyqKsjHZAZBZ/AJFiEfZPjSx0d//v96rDf6GIVpALdF6suDRBwNj6PogApVNudSwCREkHTybCMHNzdHFEU4pfBG1hWEUu8zHc9WfzV/jlnc0Op3xhViFcK+RB9tOIkgdtMGD1cySRvOSUa8y19NDjWtF9OMpPtHMoRPC+1PLRa/K0NIJHKDKS4IwXOK2MxbTSlilje0Ey3v21XcJvt2Cs3EEfftagYXw3078lidqeXaV7t2o9qHPQ0AfZKIoQLYIguDXJCqFqHkklRjQeq4PiU/ayxIlT/krQtKtWHcWFEqZQWboJ/c3vW84b3ZHYirVNRddod6p0L30jgKwHH8hgTuHEJUGRw+/cqzWPz44m4tRSTXO0UVwYnApkWdYVWgnnSloBx03b1XQdU53WXCsyvl5JO7wHcDxwtDIVLarHRG70SRK6y/8hBfn9Sgp/B+yat6h9fnnw650ufR7ZDrmRxykcTHnzc/4yzdrMfTBnBKa7YNoH9hzHdAPQMyAx8uqN5uAMNNAM3+MevfxfZ06IMPh9+b/g03AZyHjUr/avZGLL5YG3wkM+pN/4abANqgY9Z/aDJtMkGyu970b7QJKBIZArh3lTxOb8yRGvDhsvpJ1EbCjigS9fFulF3oS+b86Q8MHp9hcThMbdnWzD2MuS7vOj0To2Exceb0TBzNVmHz6hHQOA3MEkezKthMHGjcNHEY3z7i9WoiE9TI1vAcDxtXtZluSDTS2PNGeY6aBv1cq+hKHUWRkCU47Pa7yYGqlrIDVxx0bXKgBPr8oJk8LD6MQgFp40OhsTywLYzq8+7N6keqZbib9AAfs/pp8Wj021NPQsWSYl49Srw5lsG7XFd3IQsUiVOzPiU4pf1O7iiav9TurMlEZc58o8yRKDBiO3PWN3PEUWd7tmrQtxieVMsY3JEuQHlEna9Rs7NoBh13I1yDAr4M063QYsv2VTG/B1JlbiHCJueDwE9zGTVDx6+PCK9XhE4DeGm/EeGeOrdVA7nbmr3EUWtSY8s7NHqZrt0CZ+VBJiBpVS2PiZrq6kp5MFQeZ/JrG4rvVicz9UOh3qoPUJNLlZo61AcSSfwRu4/eOgWwe/9oA7213UFdJ8vNOloemk8fsWfoq+V9MB54V7c8iMBA6rq65cWfWZrmp4cLw2j1JV3w4oj/AQ==</diagram></mxfile>
|
||||
|
After Width: | Height: | Size: 8.4 KiB |
|
Before Width: | Height: | Size: 398 KiB After Width: | Height: | Size: 398 KiB |
|
Before Width: | Height: | Size: 39 KiB After Width: | Height: | Size: 39 KiB |
|
Before Width: | Height: | Size: 1.4 MiB After Width: | Height: | Size: 1.4 MiB |
4
docs_raw/source/_templates/layout.html
Normal file
@@ -0,0 +1,4 @@
|
||||
{% extends "!layout.html" %}
|
||||
{% block extrahead %}
|
||||
<link href="{{ pathto("_static/css/custom.css", True) }}" rel="stylesheet" type="text/css">
|
||||
{% endblock %}
|
||||
18
docs_raw/source/components/additional_parameters.rst
Normal file
@@ -0,0 +1,18 @@
|
||||
Additional Parameters
|
||||
=====================
|
||||
|
||||
VisualizationParameters
|
||||
-----------------------
|
||||
.. autoclass:: rl_coach.base_parameters.VisualizationParameters
|
||||
|
||||
PresetValidationParameters
|
||||
--------------------------
|
||||
.. autoclass:: rl_coach.base_parameters.PresetValidationParameters
|
||||
|
||||
TaskParameters
|
||||
--------------
|
||||
.. autoclass:: rl_coach.base_parameters.TaskParameters
|
||||
|
||||
DistributedTaskParameters
|
||||
-------------------------
|
||||
.. autoclass:: rl_coach.base_parameters.DistributedTaskParameters
|
||||
29
docs_raw/source/components/agents/imitation/bc.rst
Normal file
@@ -0,0 +1,29 @@
|
||||
Behavioral Cloning
|
||||
==================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/pg.png
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
The replay buffer contains the expert demonstrations for the task.
|
||||
These demonstrations are given as state, action tuples, and with no reward.
|
||||
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
|
||||
the expert for each state.
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. Use the current states as input to the network, and the expert actions as the targets of the network.
|
||||
3. For the network head, we use the policy head, which uses the cross entropy loss function.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.bc_agent.BCAlgorithmParameters
|
||||
36
docs_raw/source/components/agents/imitation/cil.rst
Normal file
@@ -0,0 +1,36 @@
|
||||
Conditional Imitation Learning
|
||||
==============================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `End-to-end Driving via Conditional Imitation Learning <https://arxiv.org/abs/1710.02410>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/cil.png
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
The replay buffer contains the expert demonstrations for the task.
|
||||
These demonstrations are given as state, action tuples, and with no reward.
|
||||
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
|
||||
the expert for each state.
|
||||
In conditional imitation learning, each transition is assigned a class, which determines the goal that was pursuit
|
||||
in that transitions. For example, 3 possible classes can be: turn right, turn left and follow lane.
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer, where the batch is balanced, meaning that an equal number
|
||||
of transitions will be sampled from each class index.
|
||||
2. Use the current states as input to the network, and assign the expert actions as the targets of the network heads
|
||||
corresponding to the state classes. For the other heads, set the targets to match the currently predicted values,
|
||||
so that the loss for the other heads will be zeroed out.
|
||||
3. We use a regression head, that minimizes the MSE loss between the network predicted values and the target values.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.cil_agent.CILAlgorithmParameters
|
||||
43
docs_raw/source/components/agents/index.rst
Normal file
@@ -0,0 +1,43 @@
|
||||
Agents
|
||||
======
|
||||
|
||||
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into three main classes -
|
||||
value optimization, policy optimization and imitation learning.
|
||||
A detailed description of those algorithms can be found by navigating to each of the algorithm pages.
|
||||
|
||||
.. image:: /_static/img/algorithms.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Agents
|
||||
|
||||
policy_optimization/ac
|
||||
imitation/bc
|
||||
value_optimization/bs_dqn
|
||||
value_optimization/categorical_dqn
|
||||
imitation/cil
|
||||
policy_optimization/cppo
|
||||
policy_optimization/ddpg
|
||||
other/dfp
|
||||
value_optimization/double_dqn
|
||||
value_optimization/dqn
|
||||
value_optimization/dueling_dqn
|
||||
value_optimization/mmc
|
||||
value_optimization/n_step
|
||||
value_optimization/naf
|
||||
value_optimization/nec
|
||||
value_optimization/pal
|
||||
policy_optimization/pg
|
||||
policy_optimization/ppo
|
||||
value_optimization/rainbow
|
||||
value_optimization/qr_dqn
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.base_parameters.AgentParameters
|
||||
|
||||
.. autoclass:: rl_coach.agents.agent.Agent
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
39
docs_raw/source/components/agents/other/dfp.rst
Normal file
@@ -0,0 +1,39 @@
|
||||
Direct Future Prediction
|
||||
========================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Learning to Act by Predicting the Future <https://arxiv.org/abs/1611.01779>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dfp.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network.
|
||||
The output of the network is the predicted future measurements for time-steps :math:`t+1,t+2,t+4,t+8,t+16` and
|
||||
:math:`t+32` for each possible action.
|
||||
2. For each action, the measurements of each predicted time-step are multiplied by the goal vector,
|
||||
and the result is a single vector of future values for each action.
|
||||
3. Then, a weighted sum of the future values of each action is calculated, and the result is a single value for each action.
|
||||
4. The action values are passed to the exploration policy to decide on the action to use.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Given a batch of transitions, run them through the network to get the current predictions of the future measurements
|
||||
per action, and set them as the initial targets for training the network. For each transition
|
||||
:math:`(s_t,a_t,r_t,s_{t+1} )` in the batch, the target of the network for the action that was taken, is the actual
|
||||
measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
|
||||
For the actions that were not taken, the targets are the current values.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.dfp_agent.DFPAlgorithmParameters
|
||||
40
docs_raw/source/components/agents/policy_optimization/ac.rst
Normal file
@@ -0,0 +1,40 @@
|
||||
Actor-Critic
|
||||
============
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ac.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Choosing an action - Discrete actions
|
||||
+++++++++++++++++++++++++++++++++++++
|
||||
|
||||
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
|
||||
distribution assigned with these probabilities. When testing, the action with the highest probability is used.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
|
||||
|
||||
Advantages can be calculated by either of the following methods (configured by the selected preset) -
|
||||
|
||||
1. **A_VALUE** - Estimating advantage directly:
|
||||
:math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
|
||||
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
|
||||
|
||||
2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
|
||||
|
||||
The advantages are then used in order to accumulate gradients according to
|
||||
:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters
|
||||
@@ -0,0 +1,44 @@
|
||||
Clipped Proximal Policy Optimization
|
||||
====================================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ppo.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Continuous action
|
||||
++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Same as in PPO.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Very similar to PPO, with several small (but very simplifying) changes:
|
||||
|
||||
1. Train both the value and policy networks, simultaneously, by defining a single loss function,
|
||||
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
|
||||
|
||||
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
|
||||
|
||||
3. Value targets are now also calculated based on the GAE advantages.
|
||||
In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
|
||||
in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
|
||||
each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
|
||||
|
||||
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
|
||||
:math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
|
||||
This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
|
||||
clipped surrogate loss:
|
||||
|
||||
:math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters
|
||||
@@ -0,0 +1,50 @@
|
||||
Deep Deterministic Policy Gradient
|
||||
==================================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ddpg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
|
||||
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
|
||||
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
Start by sampling a batch of transitions from the experience replay.
|
||||
|
||||
* To train the **critic network**, use the following targets:
|
||||
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
|
||||
|
||||
First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
|
||||
Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
|
||||
calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
|
||||
as the inputs, and :math:`y_t` as the targets.
|
||||
|
||||
* To train the **actor network**, use the following equation:
|
||||
|
||||
:math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
|
||||
|
||||
Use the actor's online network to get the action mean values using the current states as the inputs.
|
||||
Then, use the critic online network in order to get the gradients of the critic output with respect to the
|
||||
action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
|
||||
Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
|
||||
given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
|
||||
|
||||
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters
|
||||
@@ -0,0 +1,24 @@
|
||||
Hierarchical Actor Critic
|
||||
=========================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ddpg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
|
||||
While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
|
||||
to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
39
docs_raw/source/components/agents/policy_optimization/pg.rst
Normal file
@@ -0,0 +1,39 @@
|
||||
Policy Gradient
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/pg.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Discrete actions
|
||||
+++++++++++++++++++++++++++++++++++++
|
||||
Run the current states through the network and get a policy distribution over the actions.
|
||||
While training, sample from the policy distribution. When testing, take the action with the highest probability.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
The policy head loss is defined as :math:`L=-log (\pi) \cdot PolicyGradientRescaler`.
|
||||
The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
|
||||
This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
|
||||
convergence. The rescaler is a configurable parameter and there are few options to choose from:
|
||||
|
||||
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
|
||||
* **Future Return** - Return from each transition until the end of the episode.
|
||||
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
|
||||
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
|
||||
which are calculated seperately for each timestep, across different episodes.
|
||||
|
||||
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
|
||||
serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
|
||||
the gradients are then applied to the network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters
|
||||
@@ -0,0 +1,45 @@
|
||||
Proximal Policy Optimization
|
||||
============================
|
||||
|
||||
**Actions space:** Discrete | Continuous
|
||||
|
||||
**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/ppo.png
|
||||
:align: center
|
||||
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action - Continuous actions
|
||||
+++++++++++++++++++++++++++++++++++++++
|
||||
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
|
||||
While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
|
||||
When testing, just take the mean values predicted by the network.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
|
||||
|
||||
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
|
||||
|
||||
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
|
||||
the L-BFGS optimizer runs on the entire dataset at once, without batching.
|
||||
It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
|
||||
the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
|
||||
discounted returns of each state in each episode.
|
||||
|
||||
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
|
||||
targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
|
||||
starting to run the current set of training iterations) using a regularization term.
|
||||
|
||||
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
|
||||
in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
|
||||
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters
|
||||
@@ -0,0 +1,43 @@
|
||||
Bootstrapped DQN
|
||||
================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/bs_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
The current states are used as the input to the network. The network contains several $Q$ heads, which are used
|
||||
for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
|
||||
selects a single head to play with during the episode. According to the selected head, only the relevant
|
||||
output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
|
||||
|
||||
Storing the transitions
|
||||
+++++++++++++++++++++++
|
||||
For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
|
||||
The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
|
||||
and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
|
||||
the replay buffer.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
|
||||
current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
|
||||
and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
|
||||
according to the standard DQN update rule:
|
||||
|
||||
:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
|
||||
|
||||
Otherwise, leave it intact so that the transition does not affect the learning of this head.
|
||||
Then, train the online network according to the calculated targets.
|
||||
|
||||
As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
@@ -0,0 +1,39 @@
|
||||
Categorical DQN
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/distributional_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
|
||||
that the :math:`i-th` component of the projected update is calculated as follows:
|
||||
|
||||
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
|
||||
|
||||
where:
|
||||
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
|
||||
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom :math:`z_j`: :math:`\hat{T}_{z_{j}} := r+\gamma z_j`
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
|
||||
probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.categorical_dqn_agent.CategoricalDQNAlgorithmParameters
|
||||
@@ -0,0 +1,35 @@
|
||||
Double DQN
|
||||
==========
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
|
||||
action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
|
||||
network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.
|
||||
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
|
||||
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
|
||||
Set those values as the targets for the actions that were not actually played.
|
||||
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network:
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
37
docs_raw/source/components/agents/value_optimization/dqn.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
Deep Q Networks
|
||||
===============
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
|
||||
the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
|
||||
|
||||
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
|
||||
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
|
||||
Set those values as the targets for the actions that were not actually played.
|
||||
|
||||
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
|
||||
:math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
|
||||
|
||||
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
6. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters
|
||||
@@ -0,0 +1,27 @@
|
||||
Dueling DQN
|
||||
===========
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dueling_dqn.png
|
||||
:align: center
|
||||
|
||||
General Description
|
||||
-------------------
|
||||
Dueling DQN presents a change in the network structure comparing to DQN.
|
||||
|
||||
Dueling DQN uses a specialized *Dueling Q Head* in order to separate :math:`Q` to an :math:`A` (advantage)
|
||||
stream and a :math:`V` stream. Adding this type of structure to the network head allows the network to better differentiate
|
||||
actions from one another, and significantly improves the learning.
|
||||
|
||||
In many states, the values of the different actions are very similar, and it is less important which action to take.
|
||||
This is especially important in environments where there are many actions to choose from. In DQN, on each training
|
||||
iteration, for each of the states in the batch, we update the :ath:`Q` values only for the specific actions taken in
|
||||
those states. This results in slower learning as we do not learn the :math:`Q` values for actions that were not taken yet.
|
||||
On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a
|
||||
single action has been taken at this state.
|
||||
37
docs_raw/source/components/agents/value_optimization/mmc.rst
Normal file
@@ -0,0 +1,37 @@
|
||||
Mixed Monte Carlo
|
||||
=================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
|
||||
|
||||
The DDQN targets are calculated in the same manner as in the DDQN agent:
|
||||
|
||||
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
|
||||
|
||||
:math:`y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )`
|
||||
|
||||
A mixing ratio $\alpha$ is then used to get the final targets:
|
||||
|
||||
:math:`y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}`
|
||||
|
||||
Finally, the online network is trained using the current states as inputs, and the calculated targets.
|
||||
Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.mmc_agent.MixedMonteCarloAlgorithmParameters
|
||||
@@ -0,0 +1,35 @@
|
||||
N-Step Q Learning
|
||||
=================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
The :math:`N`-step Q learning algorithm works in similar manner to DQN except for the following changes:
|
||||
|
||||
1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
|
||||
:math:`N` steps using the latest :math:`N` steps played by the agent.
|
||||
|
||||
2. In order to stabilize the learning, multiple workers work together to update the network.
|
||||
This creates the same effect as uncorrelating the samples used for training.
|
||||
|
||||
3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated
|
||||
to form the :math:`N`-step Q targets, according to the following equation:
|
||||
:math:`R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})`
|
||||
where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.n_step_q_agent.NStepQAlgorithmParameters
|
||||
33
docs_raw/source/components/agents/value_optimization/naf.rst
Normal file
@@ -0,0 +1,33 @@
|
||||
Normalized Advantage Functions
|
||||
==============================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/naf.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
|
||||
It is then passed to the exploration policy which adds noise in order to encourage exploration.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
The network is trained by using the following targets:
|
||||
:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
|
||||
Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
|
||||
to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
|
||||
and :math:`y_t` as the targets.
|
||||
After every training step, use a soft update in order to copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters
|
||||
50
docs_raw/source/components/agents/value_optimization/nec.rst
Normal file
@@ -0,0 +1,50 @@
|
||||
Neural Episodic Control
|
||||
=======================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/nec.png
|
||||
:width: 500px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
|
||||
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate
|
||||
output from the middleware.
|
||||
|
||||
2. For each possible action :math:`a_i`, run the DND head using the state embedding and the selected action :math:`a_i` as inputs.
|
||||
The DND is queried and returns the :math:`P` nearest neighbor keys and values. The keys and values are used to calculate
|
||||
and return the action :math:`Q` value from the network.
|
||||
|
||||
3. Pass all the :math:`Q` values to the exploration policy and choose an action accordingly.
|
||||
|
||||
4. Store the state embeddings and actions taken during the current episode in a small buffer :math:`B`, in order to
|
||||
accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
|
||||
|
||||
Finalizing an episode
|
||||
+++++++++++++++++++++
|
||||
For each step in the episode, the state embeddings and the taken actions are stored in the buffer :math:`B`.
|
||||
When the episode is finished, the replay buffer calculates the :math:`N`-step total return of each transition in the
|
||||
buffer, bootstrapped using the maximum :math:`Q` value of the :math:`N`-th transition. Those values are inserted
|
||||
along with the total return into the DND, and the buffer :math:`B` is reset.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
Train the network only when the DND has enough entries for querying.
|
||||
|
||||
To train the network, the current states are used as the inputs and the :math:`N`-step returns are used as the targets.
|
||||
The :math:`N`-step return used takes into account :math:`N` consecutive steps, and bootstraps the last value from
|
||||
the network if necessary:
|
||||
:math:`y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N max_a Q(s_{t+N},a)`
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters
|
||||
45
docs_raw/source/components/agents/value_optimization/pal.rst
Normal file
@@ -0,0 +1,45 @@
|
||||
Persistent Advantage Learning
|
||||
=============================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Increasing the Action Gap: New Operators for Reinforcement Learning <https://arxiv.org/abs/1512.04860>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
|
||||
:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
|
||||
|
||||
3. The action gap :math:`V(s_t )-Q(s_t,a_t)` should then be subtracted from each of the calculated targets.
|
||||
To calculate the action gap, run the target network using the current states and get the :math:`Q` values
|
||||
for all the actions. Then estimate :math:`V` as the maximum predicted :math:`Q` value for the current state:
|
||||
:math:`V(s_t )=max_a Q(s_t,a)`
|
||||
|
||||
4. For *advantage learning (AL)*, reduce the action gap weighted by a predefined parameter :math:`\alpha` from
|
||||
the targets :math:`y_t^{DDQN}`:
|
||||
:math:`y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t ))`
|
||||
|
||||
5. For *persistent advantage learning (PAL)*, the target network is also used in order to calculate the action
|
||||
gap for the next state:
|
||||
:math:`V(s_{t+1} )-Q(s_{t+1},a_{t+1})`
|
||||
where :math:`a_{t+1}` is chosen by running the next states through the online network and choosing the action that
|
||||
has the highest predicted :math:`Q` value. Finally, the targets will be defined as -
|
||||
:math:`y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} ))`
|
||||
|
||||
6. Train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
7. Once in every few thousand steps, copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters
|
||||
@@ -0,0 +1,33 @@
|
||||
Quantile Regression DQN
|
||||
=======================
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/qr_dqn.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. First, the next state quantiles are predicted. These are used in order to calculate the targets for the network,
|
||||
by following the Bellman equation.
|
||||
Next, the current quantile locations for the current states are predicted, sorted, and used for calculating the
|
||||
quantile midpoints targets.
|
||||
|
||||
3. The network is trained with the quantile regression loss between the resulting quantile locations and the target
|
||||
quantile locations. Only the targets of the actions that were actually taken are updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.qr_dqn_agent.QuantileRegressionDQNAlgorithmParameters
|
||||
@@ -0,0 +1,51 @@
|
||||
Rainbow
|
||||
=======
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/rainbow.png
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
|
||||
Rainbow combines 6 recent advancements in reinforcement learning:
|
||||
|
||||
* N-step returns
|
||||
* Distributional state-action value learning
|
||||
* Dueling networks
|
||||
* Noisy Networks
|
||||
* Double DQN
|
||||
* Prioritized Experience Replay
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
|
||||
that the :math:`i-th` component of the projected update is calculated as follows:
|
||||
|
||||
:math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
|
||||
|
||||
where:
|
||||
* :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
|
||||
* :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
|
||||
:math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
|
||||
probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
|
||||
using the KL divergence loss that is returned from the network.
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters
|
||||