1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00

fixed some documentation typos

This commit is contained in:
Gal Novik
2017-10-22 22:21:45 +03:00
parent 2a3a6f4a68
commit 6009b73eb6
17 changed files with 58 additions and 60 deletions

View File

@@ -4,8 +4,6 @@ installation
http://www.mkdocs.org/#installation
2. install the math extension for mkdocs
sudo -E pip install python-markdown-math
3. install the material theme
sudo -E pip install mkdocs-material
to build the documentation website run:
- mkdocs build

View File

@@ -12,7 +12,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network. The output of the network is the predicted future measurements for time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$ for each possible action.
@@ -22,4 +22,4 @@
### Training the network
Given a batch of transitions, run them through the network to get the current predictions of the future measurements per action, and set it as the initial targets for training the network. For each transition $(s_t,a_t,r_t,s_{t+1} )$ in the batch, the target of the network for the action that was taken, is the actual measurements that were seen in time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$. For the actions that were not taken, the targets are the current values.
Given a batch of transitions, run them through the network to get the current predictions of the future measurements per action, and set them as the initial targets for training the network. For each transition $(s_t,a_t,r_t,s_{t+1} )$ in the batch, the target of the network for the action that was taken, is the actual measurements that were seen in time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$. For the actions that were not taken, the targets are the current values.

View File

@@ -8,7 +8,7 @@
<p style="text-align: center;">
<img src="..\..\design_imgs\ac.png" width=500>
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action - Discrete actions
@@ -17,7 +17,7 @@ The policy network is used in order to predict action probabilites. While traini
### Training the network
A batch of $ T_{max} $ transitions is used, and the advantages are calculated upon it.
Advantages can be calculated by either of the followng methods (configured by the selected preset) -
Advantages can be calculated by either of the following methods (configured by the selected preset) -
1. **A_VALUE** - Estimating advantage directly:$$ A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t) $$where $k$ is $T_{max} - State\_Index$ for each state in the batch.
2. **GAE** - By following the [Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438) paper.

View File

@@ -1,8 +1,8 @@
# Clipped Proximal Policy Optimization
# Clipped Proximal Policy Optimization
**Actions space:** Discrete|Continuous
**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
**Actions space:** Discrete|Continuous
**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
## Network Structure
@@ -11,7 +11,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action - Continuous action
Same as in PPO.
### Training the network

View File

@@ -1,8 +1,8 @@
# Deep Deterministic Policy Gradient
# Deep Deterministic Policy Gradient
**Actions space:** Continuous
**References:** [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
**Actions space:** Continuous
**References:** [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
## Network Structure
@@ -12,7 +12,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action
Pass the current states through the actor network, and get an action mean vector $ \mu $. While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process, to add exploration noise to the action. When testing, use the mean vector $\mu$ as-is.
### Training the network

View File

@@ -1,8 +1,8 @@
# Policy Gradient
# Policy Gradient
**Actions space:** Discrete|Continuous
**References:** [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
**Actions space:** Discrete|Continuous
**References:** [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
## Network Structure
@@ -12,7 +12,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action - Discrete actions
Run the current states through the network and get a policy distribution over the actions. While training, sample from the policy distribution. When testing, take the action with the highest probability.

View File

@@ -1,8 +1,8 @@
# Proximal Policy Optimization
# Proximal Policy Optimization
**Actions space:** Discrete|Continuous
**References:** [Emergence of Locomotion Behaviours in Rich Environments](https://arxiv.org/pdf/1707.02286.pdf)
**Actions space:** Discrete|Continuous
**References:** [Emergence of Locomotion Behaviours in Rich Environments](https://arxiv.org/pdf/1707.02286.pdf)
## Network Structure
@@ -13,7 +13,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action - Continuous actions
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation. While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values. When testing, just take the mean values predicted by the network.
### Training the network

View File

@@ -1,8 +1,8 @@
# Bootstrapped DQN
# Bootstrapped DQN
**Actions space:** Discrete
**References:** [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621)
**Actions space:** Discrete
**References:** [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621)
## Network Structure
@@ -12,7 +12,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action
The current states are used as the input to the network. The network contains several $Q$ heads, which are used for returning different estimations of the action $ Q $ values. For each episode, the bootstrapped exploration policy selects a single head to play with during the episode. According to the selected head, only the relevant output $ Q $ values are used. Using those $ Q $ values, the exploration policy then selects the action for acting.

View File

@@ -14,7 +14,7 @@
## Algorithmic Description
## Algorithm Description
### Training the network

View File

@@ -1,8 +1,8 @@
# Double DQN
**Actions space:** Discrete
**Actions space:** Discrete
**References:** [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461.pdf)
**References:** [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461.pdf)
## Network Structure
@@ -14,7 +14,7 @@
## Algorithmic Description
## Algorithm Description
### Training the network
1. Sample a batch of transitions from the replay buffer.

View File

@@ -14,7 +14,7 @@
## Algorithmic Description
## Algorithm Description
### Training the network

View File

@@ -1,8 +1,8 @@
# Dueling DQN
# Dueling DQN
**Actions space:** Discrete
**References:** [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)
**Actions space:** Discrete
**References:** [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)
## Network Structure
@@ -15,7 +15,7 @@
## General Description
Dueling DQN presents a change in the network structure comparing to DQN.
Dueling DQN uses a speciallized _Dueling Q Head_ in order to seperate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
Dueling DQN uses a specialized _Dueling Q Head_ in order to separate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
In many states, the values of the diiferent actions are very similar, and it is less important which action to take.
In many states, the values of the different actions are very similar, and it is less important which action to take.
This is especially important in environments where there are many actions to choose from. In DQN, on each training iteration, for each of the states in the batch, we update the $Q$ values only for the specific actions taken in those states. This results in slower learning as we do not learn the $Q$ values for actions that were not taken yet. On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a single action has been taken at this state.

View File

@@ -1,8 +1,8 @@
# Mixed Monte Carlo
**Actions space:** Discrete
**Actions space:** Discrete
**References:** [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310)
**References:** [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310)
## Network Structure
@@ -12,7 +12,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Training the network
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).

View File

@@ -14,7 +14,7 @@
## Algorithmic Description
## Algorithm Description
### Training the network

View File

@@ -1,8 +1,8 @@
# Normalized Advantage Functions
# Normalized Advantage Functions
**Actions space:** Continuous
**References:** [Continuous Deep Q-Learning with Model-based Acceleration](https://arxiv.org/abs/1603.00748.pdf)
**Actions space:** Continuous
**References:** [Continuous Deep Q-Learning with Model-based Acceleration](https://arxiv.org/abs/1603.00748.pdf)
## Network Structure
@@ -12,7 +12,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action
The current state is used as an input to the network. The action mean $ \mu(s_t ) $ is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration.
###Training the network

View File

@@ -1,8 +1,8 @@
# Neural Episodic Control
# Neural Episodic Control
**Actions space:** Discrete
**References:** [Neural Episodic Control](https://arxiv.org/abs/1703.01988)
**Actions space:** Discrete
**References:** [Neural Episodic Control](https://arxiv.org/abs/1703.01988)
## Network Structure
@@ -12,7 +12,7 @@
</p>
## Algorithmic Description
## Algorithm Description
### Choosing an action
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate output from the middleware.
2. For each possible action $a_i$, run the DND head using the state embedding and the selected action $a_i$ as inputs. The DND is queried and returns the $ P $ nearest neighbor keys and values. The keys and values are used to calculate and return the action $ Q $ value from the network.
@@ -20,7 +20,7 @@
4. Store the state embeddings and actions taken during the current episode in a small buffer $B$, in order to accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
### Finalizing an episode
For each step in the episode, the state embeddings and the taken actions where stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
For each step in the episode, the state embeddings and the taken actions are stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
### Training the network
Train the network only when the DND has enough entries for querying.

View File

@@ -12,11 +12,11 @@
</p>
## Algorithmic Description
## Algorithm Description
### Training the network
1. Sample a batch of transitions from the replay buffer.
2. Start by calculating theinitial target values in the same manner as they are calculated in DDQN
2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
$$ y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
3. The action gap $ V(s_t )-Q(s_t,a_t) $ should then be subtracted from each of the calculated targets. To calculate the action gap, run the target network using the current states and get the $ Q $ values for all the actions. Then estimate $ V $ as the maximum predicted $ Q $ value for the current state:
$$ V(s_t )=max_a Q(s_t,a) $$
@@ -24,7 +24,7 @@
$$ y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t )) $$
5. For _persistent advantage learning (PAL)_, the target network is also used in order to calculate the action gap for the next state:
$$ V(s_{t+1} )-Q(s_{t+1},a_{t+1}) $$
Where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
$$ y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} )) $$
6. Train the online network using the current states as inputs, and with the aforementioned targets.