From 6009b73eb60f6a79daa2a0028f8feafda3635f7d Mon Sep 17 00:00:00 2001
From: Gal Novik
Date: Sun, 22 Oct 2017 22:21:45 +0300
Subject: [PATCH] fixed some documentation typos
---
docs/README.txt | 2 --
docs/docs/algorithms/other/dfp.md | 4 ++--
docs/docs/algorithms/policy_optimization/ac.md | 4 ++--
docs/docs/algorithms/policy_optimization/cppo.md | 10 +++++-----
docs/docs/algorithms/policy_optimization/ddpg.md | 10 +++++-----
docs/docs/algorithms/policy_optimization/pg.md | 10 +++++-----
docs/docs/algorithms/policy_optimization/ppo.md | 10 +++++-----
docs/docs/algorithms/value_optimization/bs_dqn.md | 10 +++++-----
.../value_optimization/distributional_dqn.md | 2 +-
.../docs/algorithms/value_optimization/double_dqn.md | 6 +++---
docs/docs/algorithms/value_optimization/dqn.md | 2 +-
.../algorithms/value_optimization/dueling_dqn.md | 12 ++++++------
docs/docs/algorithms/value_optimization/mmc.md | 6 +++---
docs/docs/algorithms/value_optimization/n_step.md | 2 +-
docs/docs/algorithms/value_optimization/naf.md | 10 +++++-----
docs/docs/algorithms/value_optimization/nec.md | 12 ++++++------
docs/docs/algorithms/value_optimization/pal.md | 6 +++---
17 files changed, 58 insertions(+), 60 deletions(-)
diff --git a/docs/README.txt b/docs/README.txt
index a60dd37..8c7131b 100644
--- a/docs/README.txt
+++ b/docs/README.txt
@@ -4,8 +4,6 @@ installation
http://www.mkdocs.org/#installation
2. install the math extension for mkdocs
sudo -E pip install python-markdown-math
-3. install the material theme
- sudo -E pip install mkdocs-material
to build the documentation website run:
- mkdocs build
diff --git a/docs/docs/algorithms/other/dfp.md b/docs/docs/algorithms/other/dfp.md
index 4e5c110..faa5e0e 100644
--- a/docs/docs/algorithms/other/dfp.md
+++ b/docs/docs/algorithms/other/dfp.md
@@ -12,7 +12,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network. The output of the network is the predicted future measurements for time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$ for each possible action.
@@ -22,4 +22,4 @@
### Training the network
-Given a batch of transitions, run them through the network to get the current predictions of the future measurements per action, and set it as the initial targets for training the network. For each transition $(s_t,a_t,r_t,s_{t+1} )$ in the batch, the target of the network for the action that was taken, is the actual measurements that were seen in time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$. For the actions that were not taken, the targets are the current values.
\ No newline at end of file
+Given a batch of transitions, run them through the network to get the current predictions of the future measurements per action, and set them as the initial targets for training the network. For each transition $(s_t,a_t,r_t,s_{t+1} )$ in the batch, the target of the network for the action that was taken, is the actual measurements that were seen in time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$. For the actions that were not taken, the targets are the current values.
\ No newline at end of file
diff --git a/docs/docs/algorithms/policy_optimization/ac.md b/docs/docs/algorithms/policy_optimization/ac.md
index d394eae..fe0e5a9 100644
--- a/docs/docs/algorithms/policy_optimization/ac.md
+++ b/docs/docs/algorithms/policy_optimization/ac.md
@@ -8,7 +8,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action - Discrete actions
@@ -17,7 +17,7 @@ The policy network is used in order to predict action probabilites. While traini
### Training the network
A batch of $ T_{max} $ transitions is used, and the advantages are calculated upon it.
-Advantages can be calculated by either of the followng methods (configured by the selected preset) -
+Advantages can be calculated by either of the following methods (configured by the selected preset) -
1. **A_VALUE** - Estimating advantage directly:$$ A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t) $$where $k$ is $T_{max} - State\_Index$ for each state in the batch.
2. **GAE** - By following the [Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438) paper.
diff --git a/docs/docs/algorithms/policy_optimization/cppo.md b/docs/docs/algorithms/policy_optimization/cppo.md
index 150cdba..b684904 100644
--- a/docs/docs/algorithms/policy_optimization/cppo.md
+++ b/docs/docs/algorithms/policy_optimization/cppo.md
@@ -1,8 +1,8 @@
-# Clipped Proximal Policy Optimization
+# Clipped Proximal Policy Optimization
-**Actions space:** Discrete|Continuous
-
-**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
+**Actions space:** Discrete|Continuous
+
+**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
## Network Structure
@@ -11,7 +11,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action - Continuous action
Same as in PPO.
### Training the network
diff --git a/docs/docs/algorithms/policy_optimization/ddpg.md b/docs/docs/algorithms/policy_optimization/ddpg.md
index f8ed755..213263f 100644
--- a/docs/docs/algorithms/policy_optimization/ddpg.md
+++ b/docs/docs/algorithms/policy_optimization/ddpg.md
@@ -1,8 +1,8 @@
-# Deep Deterministic Policy Gradient
+# Deep Deterministic Policy Gradient
-**Actions space:** Continuous
-
-**References:** [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
+**Actions space:** Continuous
+
+**References:** [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
## Network Structure
@@ -12,7 +12,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action
Pass the current states through the actor network, and get an action mean vector $ \mu $. While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process, to add exploration noise to the action. When testing, use the mean vector $\mu$ as-is.
### Training the network
diff --git a/docs/docs/algorithms/policy_optimization/pg.md b/docs/docs/algorithms/policy_optimization/pg.md
index c890510..c3b6bb8 100644
--- a/docs/docs/algorithms/policy_optimization/pg.md
+++ b/docs/docs/algorithms/policy_optimization/pg.md
@@ -1,8 +1,8 @@
-# Policy Gradient
+# Policy Gradient
-**Actions space:** Discrete|Continuous
-
-**References:** [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
+**Actions space:** Discrete|Continuous
+
+**References:** [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
## Network Structure
@@ -12,7 +12,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action - Discrete actions
Run the current states through the network and get a policy distribution over the actions. While training, sample from the policy distribution. When testing, take the action with the highest probability.
diff --git a/docs/docs/algorithms/policy_optimization/ppo.md b/docs/docs/algorithms/policy_optimization/ppo.md
index 8e23b05..a4a4b97 100644
--- a/docs/docs/algorithms/policy_optimization/ppo.md
+++ b/docs/docs/algorithms/policy_optimization/ppo.md
@@ -1,8 +1,8 @@
-# Proximal Policy Optimization
+# Proximal Policy Optimization
-**Actions space:** Discrete|Continuous
-
-**References:** [Emergence of Locomotion Behaviours in Rich Environments](https://arxiv.org/pdf/1707.02286.pdf)
+**Actions space:** Discrete|Continuous
+
+**References:** [Emergence of Locomotion Behaviours in Rich Environments](https://arxiv.org/pdf/1707.02286.pdf)
## Network Structure
@@ -13,7 +13,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action - Continuous actions
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation. While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values. When testing, just take the mean values predicted by the network.
### Training the network
diff --git a/docs/docs/algorithms/value_optimization/bs_dqn.md b/docs/docs/algorithms/value_optimization/bs_dqn.md
index 4ee1ee1..7acccd6 100644
--- a/docs/docs/algorithms/value_optimization/bs_dqn.md
+++ b/docs/docs/algorithms/value_optimization/bs_dqn.md
@@ -1,8 +1,8 @@
-# Bootstrapped DQN
+# Bootstrapped DQN
-**Actions space:** Discrete
-
-**References:** [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621)
+**Actions space:** Discrete
+
+**References:** [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621)
## Network Structure
@@ -12,7 +12,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action
The current states are used as the input to the network. The network contains several $Q$ heads, which are used for returning different estimations of the action $ Q $ values. For each episode, the bootstrapped exploration policy selects a single head to play with during the episode. According to the selected head, only the relevant output $ Q $ values are used. Using those $ Q $ values, the exploration policy then selects the action for acting.
diff --git a/docs/docs/algorithms/value_optimization/distributional_dqn.md b/docs/docs/algorithms/value_optimization/distributional_dqn.md
index 5dcc4c2..009a518 100644
--- a/docs/docs/algorithms/value_optimization/distributional_dqn.md
+++ b/docs/docs/algorithms/value_optimization/distributional_dqn.md
@@ -14,7 +14,7 @@
-## Algorithmic Description
+## Algorithm Description
### Training the network
diff --git a/docs/docs/algorithms/value_optimization/double_dqn.md b/docs/docs/algorithms/value_optimization/double_dqn.md
index 3ff88dc..2e81524 100644
--- a/docs/docs/algorithms/value_optimization/double_dqn.md
+++ b/docs/docs/algorithms/value_optimization/double_dqn.md
@@ -1,8 +1,8 @@
# Double DQN
-**Actions space:** Discrete
+**Actions space:** Discrete
-**References:** [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461.pdf)
+**References:** [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461.pdf)
## Network Structure
@@ -14,7 +14,7 @@
-## Algorithmic Description
+## Algorithm Description
### Training the network
1. Sample a batch of transitions from the replay buffer.
diff --git a/docs/docs/algorithms/value_optimization/dqn.md b/docs/docs/algorithms/value_optimization/dqn.md
index a21d19c..fe3e1eb 100644
--- a/docs/docs/algorithms/value_optimization/dqn.md
+++ b/docs/docs/algorithms/value_optimization/dqn.md
@@ -14,7 +14,7 @@
-## Algorithmic Description
+## Algorithm Description
### Training the network
diff --git a/docs/docs/algorithms/value_optimization/dueling_dqn.md b/docs/docs/algorithms/value_optimization/dueling_dqn.md
index 0b0b15d..2b7c543 100644
--- a/docs/docs/algorithms/value_optimization/dueling_dqn.md
+++ b/docs/docs/algorithms/value_optimization/dueling_dqn.md
@@ -1,8 +1,8 @@
-# Dueling DQN
+# Dueling DQN
-**Actions space:** Discrete
-
-**References:** [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)
+**Actions space:** Discrete
+
+**References:** [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)
## Network Structure
@@ -15,7 +15,7 @@
## General Description
Dueling DQN presents a change in the network structure comparing to DQN.
-Dueling DQN uses a speciallized _Dueling Q Head_ in order to seperate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
+Dueling DQN uses a specialized _Dueling Q Head_ in order to separate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
-In many states, the values of the diiferent actions are very similar, and it is less important which action to take.
+In many states, the values of the different actions are very similar, and it is less important which action to take.
This is especially important in environments where there are many actions to choose from. In DQN, on each training iteration, for each of the states in the batch, we update the $Q$ values only for the specific actions taken in those states. This results in slower learning as we do not learn the $Q$ values for actions that were not taken yet. On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a single action has been taken at this state.
\ No newline at end of file
diff --git a/docs/docs/algorithms/value_optimization/mmc.md b/docs/docs/algorithms/value_optimization/mmc.md
index 412f564..82f3814 100644
--- a/docs/docs/algorithms/value_optimization/mmc.md
+++ b/docs/docs/algorithms/value_optimization/mmc.md
@@ -1,8 +1,8 @@
# Mixed Monte Carlo
-**Actions space:** Discrete
+**Actions space:** Discrete
-**References:** [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310)
+**References:** [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310)
## Network Structure
@@ -12,7 +12,7 @@
-## Algorithmic Description
+## Algorithm Description
### Training the network
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
diff --git a/docs/docs/algorithms/value_optimization/n_step.md b/docs/docs/algorithms/value_optimization/n_step.md
index 4fa7bd2..a6f61bc 100644
--- a/docs/docs/algorithms/value_optimization/n_step.md
+++ b/docs/docs/algorithms/value_optimization/n_step.md
@@ -14,7 +14,7 @@
-## Algorithmic Description
+## Algorithm Description
### Training the network
diff --git a/docs/docs/algorithms/value_optimization/naf.md b/docs/docs/algorithms/value_optimization/naf.md
index 8e0bffd..8b32eec 100644
--- a/docs/docs/algorithms/value_optimization/naf.md
+++ b/docs/docs/algorithms/value_optimization/naf.md
@@ -1,8 +1,8 @@
-# Normalized Advantage Functions
+# Normalized Advantage Functions
-**Actions space:** Continuous
-
-**References:** [Continuous Deep Q-Learning with Model-based Acceleration](https://arxiv.org/abs/1603.00748.pdf)
+**Actions space:** Continuous
+
+**References:** [Continuous Deep Q-Learning with Model-based Acceleration](https://arxiv.org/abs/1603.00748.pdf)
## Network Structure
@@ -12,7 +12,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action
The current state is used as an input to the network. The action mean $ \mu(s_t ) $ is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration.
###Training the network
diff --git a/docs/docs/algorithms/value_optimization/nec.md b/docs/docs/algorithms/value_optimization/nec.md
index 87c4946..9a8caef 100644
--- a/docs/docs/algorithms/value_optimization/nec.md
+++ b/docs/docs/algorithms/value_optimization/nec.md
@@ -1,8 +1,8 @@
-# Neural Episodic Control
+# Neural Episodic Control
-**Actions space:** Discrete
-
-**References:** [Neural Episodic Control](https://arxiv.org/abs/1703.01988)
+**Actions space:** Discrete
+
+**References:** [Neural Episodic Control](https://arxiv.org/abs/1703.01988)
## Network Structure
@@ -12,7 +12,7 @@
-## Algorithmic Description
+## Algorithm Description
### Choosing an action
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate output from the middleware.
2. For each possible action $a_i$, run the DND head using the state embedding and the selected action $a_i$ as inputs. The DND is queried and returns the $ P $ nearest neighbor keys and values. The keys and values are used to calculate and return the action $ Q $ value from the network.
@@ -20,7 +20,7 @@
4. Store the state embeddings and actions taken during the current episode in a small buffer $B$, in order to accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
### Finalizing an episode
-For each step in the episode, the state embeddings and the taken actions where stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
+For each step in the episode, the state embeddings and the taken actions are stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
### Training the network
Train the network only when the DND has enough entries for querying.
diff --git a/docs/docs/algorithms/value_optimization/pal.md b/docs/docs/algorithms/value_optimization/pal.md
index 37be118..4733a89 100644
--- a/docs/docs/algorithms/value_optimization/pal.md
+++ b/docs/docs/algorithms/value_optimization/pal.md
@@ -12,11 +12,11 @@
-## Algorithmic Description
+## Algorithm Description
### Training the network
1. Sample a batch of transitions from the replay buffer.
-2. Start by calculating theinitial target values in the same manner as they are calculated in DDQN
+2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
$$ y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
3. The action gap $ V(s_t )-Q(s_t,a_t) $ should then be subtracted from each of the calculated targets. To calculate the action gap, run the target network using the current states and get the $ Q $ values for all the actions. Then estimate $ V $ as the maximum predicted $ Q $ value for the current state:
$$ V(s_t )=max_a Q(s_t,a) $$
@@ -24,7 +24,7 @@
$$ y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t )) $$
5. For _persistent advantage learning (PAL)_, the target network is also used in order to calculate the action gap for the next state:
$$ V(s_{t+1} )-Q(s_{t+1},a_{t+1}) $$
- Where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
+ where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
$$ y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} )) $$
6. Train the online network using the current states as inputs, and with the aforementioned targets.