moving the docs to github

2026-03-19 00:13:46 +01:00 · 2018-04-23 09:14:20 +03:00
parent cafa152382
commit 5d5562bf62
118 changed files with 10792 additions and 3 deletions
--- a/docs_raw/docs/algorithms/value_optimization/bs_dqn.md
+++ b/docs_raw/docs/algorithms/value_optimization/bs_dqn.md
@@ -0,0 +1,30 @@
+# Bootstrapped DQN
+
+**Actions space:** Discrete
+
+**References:** [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\bs_dqn.png">
+
+</p>
+
+## Algorithm Description
+### Choosing an action
+The current states are used as the input to the network. The network contains several $Q$ heads, which  are used for returning different estimations of the action $ Q $ values. For each episode, the bootstrapped exploration policy selects a single head to play with during the episode. According to the selected head, only the relevant output $ Q $ values are used. Using those $ Q $ values, the exploration policy then selects the action for acting.
+
+### Storing the transitions
+For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads. The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition, and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in the replay buffer. 
+
+### Training the network
+First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the current $ Q $ value predictions for all the heads and all the actions. For each transition in the batch, and for each output head, if the transition mask is 1 - change the targets of the played action to $y_t$, according to the standard DQN update rule:
+
+$$ y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a) $$
+
+Otherwise, leave it intact so that the transition does not affect the learning of this head. Then, train the online network according to the calculated targets.
+
+As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.
+
--- a/docs_raw/docs/algorithms/value_optimization/categorical_dqn.md
+++ b/docs_raw/docs/algorithms/value_optimization/categorical_dqn.md
@@ -0,0 +1,33 @@
+# Categorical DQN
+
+**Actions space:** Discrete
+
+**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\distributional_dqn.png">
+
+</p>
+
+
+
+## Algorithm Description
+
+### Training the network
+
+1. Sample a batch of transitions from the replay buffer. 
+2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
+   $$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
+   where:
+   	*  $[ \cdot ] $ bounds its argument in the range [a, b]
+   	*  $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: &nbsp; &nbsp;   $\hat{T}_{z_{j}} := r+\gamma z_j$
+
+
+3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution.   Only the target of the actions that were actually taken is updated. 
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+
+
--- a/docs_raw/docs/algorithms/value_optimization/distributional_dqn.md
+++ b/docs_raw/docs/algorithms/value_optimization/distributional_dqn.md
@@ -0,0 +1,33 @@
+# Distributional DQN
+
+**Actions space:** Discrete
+
+**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\distributional_dqn.png">
+
+</p>
+
+
+
+## Algorithmic Description
+
+### Training the network
+
+1. Sample a batch of transitions from the replay buffer. 
+2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
+   $$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
+   where:
+   	*  $[ \cdot ] $ bounds its argument in the range [a, b]
+   	*  $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: &nbsp; &nbsp;   $\hat{T}_{z_{j}} := r+\gamma z_j$
+
+
+3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution.   Only the target of the actions that were actually taken is updated. 
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+
+
--- a/docs_raw/docs/algorithms/value_optimization/double_dqn.md
+++ b/docs_raw/docs/algorithms/value_optimization/double_dqn.md
@@ -0,0 +1,28 @@
+# Double DQN
+
+**Actions space:** Discrete
+
+**References:** [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461.pdf)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\dqn.png">
+
+</p>
+
+
+
+## Algorithm Description
+
+### Training the network
+1. Sample a batch of transitions from the replay buffer. 
+2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing action $argmax_a Q(s_{t+1},a)$. For these actions, use the corresponding next states and run the target network to calculate $Q(s_{t+1},argmax_a Q(s_{t+1},a))$.
+3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss), use the current states from the sampled batch, and run the online network to get the current Q values predictions. Set those values as the targets for the actions that were not actually played. 
+4. For each action that was played, use the following equation for calculating the targets of the network:
+   $$ y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
+
+
+5. Finally, train the online network using the current states as inputs, and with the aforementioned targets. 
+6. Once in every few thousand steps, copy the weights from the online network to the target network.
--- a/docs_raw/docs/algorithms/value_optimization/dqn.md
+++ b/docs_raw/docs/algorithms/value_optimization/dqn.md
@@ -0,0 +1,28 @@
+# Deep Q Networks
+
+**Actions space:** Discrete
+
+**References:** [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\dqn.png">
+
+</p>
+
+
+
+## Algorithm Description
+
+### Training the network
+
+1. Sample a batch of transitions from the replay buffer. 
+2. Using the next states from the sampled batch, run the target network to calculate the $ Q $ values for each of the actions $ Q(s_{t+1},a) $, and keep only the maximum value for each state. 
+3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss), use the current states from the sampled batch, and run the online network to get the current Q values predictions. Set those values as the targets for the actions that were not actually played. 
+4. For each action that was played, use the following equation for calculating the targets of the network:                                                         $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$ 
+
+
+5. Finally, train the online network using the current states as inputs, and with the aforementioned targets. 
+6. Once in every few thousand steps, copy the weights from the online network to the target network.
--- a/docs_raw/docs/algorithms/value_optimization/dueling_dqn.md
+++ b/docs_raw/docs/algorithms/value_optimization/dueling_dqn.md
@@ -0,0 +1,21 @@
+# Dueling DQN
+
+**Actions space:** Discrete
+
+**References:** [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\dueling_dqn.png">
+
+</p>
+
+## General Description
+Dueling DQN presents a change in the network structure comparing to DQN.
+
+Dueling DQN uses a specialized _Dueling Q Head_ in order to separate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
+
+In many states, the values of the different actions are very similar, and it is less important which action to take.
+This is especially important in environments where there are many actions to choose from. In DQN, on each training iteration, for each of the states in the batch, we update the $Q$ values only for the specific actions taken in those states. This results in slower learning as we do not learn the $Q$ values for actions that were not taken yet. On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a single action has been taken at this state.
--- a/docs_raw/docs/algorithms/value_optimization/mmc.md
+++ b/docs_raw/docs/algorithms/value_optimization/mmc.md
@@ -0,0 +1,32 @@
+# Mixed Monte Carlo
+
+**Actions space:** Discrete
+
+**References:** [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="../../design_imgs/dqn.png">
+
+</p>
+
+## Algorithm Description
+### Training the network
+In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
+
+The DDQN targets are calculated in the same manner as in the DDQN agent:
+
+$$ y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
+
+The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
+
+$$ y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} ) $$
+
+A mixing ratio $\alpha$ is then used to get the final targets:
+
+$$ y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC} $$ 
+
+Finally, the online network is trained using the current states as inputs, and the calculated targets.
+Once in every few thousand steps, copy the weights from the online network to the target network.
--- a/docs_raw/docs/algorithms/value_optimization/n_step.md
+++ b/docs_raw/docs/algorithms/value_optimization/n_step.md
@@ -0,0 +1,30 @@
+# N-Step Q Learning
+
+**Actions space:** Discrete
+
+**References:** [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\dqn.png">
+
+</p>
+
+
+
+## Algorithm Description
+
+### Training the network
+
+The $N$-step Q learning algorithm works in similar manner to DQN except for the following changes:
+
+1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every $N$ steps using the latest $N$ steps played by the agent.
+
+2. In order to stabilize the learning, multiple workers work together to update the network. This creates the same effect as uncorrelating the samples used for training.
+
+3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated to form the $N$-step Q targets, according to the following equation: 
+$$R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})$$
+where $k$ is $T_{max} - State\_Index$ for each state in the batch
+
--- a/docs_raw/docs/algorithms/value_optimization/naf.md
+++ b/docs_raw/docs/algorithms/value_optimization/naf.md
@@ -0,0 +1,22 @@
+# Normalized Advantage Functions
+
+**Actions space:** Continuous
+
+**References:** [Continuous Deep Q-Learning with Model-based Acceleration](https://arxiv.org/abs/1603.00748.pdf)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\naf.png" width=600>
+
+</p>
+
+## Algorithm Description
+### Choosing an action
+The current state is used as an input to the network. The action mean $ \mu(s_t ) $ is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration.
+###Training the network
+The network is trained by using the following targets:
+$$ y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1}) $$
+Use the next states as the inputs to the target network and extract the $ V $ value, from within the head, to get $ V(s_{t+1} ) $. Then, update the online network using the current states and actions as inputs, and $ y_t $ as the targets.
+After every training step, use a soft update in order to copy the weights from the online network to the target network.
--- a/docs_raw/docs/algorithms/value_optimization/nec.md
+++ b/docs_raw/docs/algorithms/value_optimization/nec.md
@@ -0,0 +1,28 @@
+# Neural Episodic Control
+
+**Actions space:** Discrete
+
+**References:** [Neural Episodic Control](https://arxiv.org/abs/1703.01988)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\nec.png" width=500>
+
+</p>
+
+## Algorithm Description
+### Choosing an action
+1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate output from the middleware. 
+2. For each possible action $a_i$, run the DND head using the state embedding and the selected action $a_i$ as inputs. The DND is queried and returns the $ P $ nearest neighbor keys and values. The keys and values are used to calculate and return the action $ Q $ value from the network. 
+3. Pass all the $ Q $ values to the exploration policy and choose an action accordingly. 
+4. Store the state embeddings and actions taken during the current episode in a small buffer $B$, in order to accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
+
+### Finalizing an episode
+For each step in the episode, the state embeddings and the taken actions are stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
+### Training the network
+Train the network only when the DND has enough entries for querying.
+
+To train the network, the current states are used as the inputs and the $N$-step returns are used as the targets. The $N$-step return used takes into account $ N $ consecutive steps, and bootstraps the last value from the network if necessary:
+$$ y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N   max_a Q(s_{t+N},a) $$
--- a/docs_raw/docs/algorithms/value_optimization/pal.md
+++ b/docs_raw/docs/algorithms/value_optimization/pal.md
@@ -0,0 +1,32 @@
+# Persistent Advantage Learning
+
+**Actions space:** Discrete
+
+**References:** [Increasing the Action Gap: New Operators for Reinforcement Learning](https://arxiv.org/abs/1512.04860)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="../../design_imgs/dqn.png">
+
+</p> 
+
+## Algorithm Description
+### Training the network
+1. Sample a batch of transitions from the replay buffer. 
+
+2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
+   $$ y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a)) $$
+3. The action gap $ V(s_t )-Q(s_t,a_t) $ should then be subtracted from each of the calculated targets. To calculate the action gap, run the target network using the current states and get the $ Q $ values for all the actions. Then estimate $ V $ as the maximum predicted $ Q $ value for the current state:
+   $$ V(s_t )=max_a Q(s_t,a) $$
+4. For _advantage learning (AL)_, reduce the action gap weighted by a predefined parameter $ \alpha $ from the targets $ y_t^{DDQN} $: 
+   $$ y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t )) $$
+5. For _persistent advantage learning (PAL)_, the target network is also used in order to calculate the action gap for the next state:
+   $$ V(s_{t+1} )-Q(s_{t+1},a_{t+1}) $$
+   where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
+   $$ y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} )) $$
+6. Train the online network using the current states as inputs, and with the aforementioned targets.
+
+7. Once in every few thousand steps, copy the weights from the online network to the target network.
+