mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 11:10:20 +01:00
fixed some documentation typos
This commit is contained in:
@@ -4,8 +4,6 @@ installation
|
||||
http://www.mkdocs.org/#installation
|
||||
2. install the math extension for mkdocs
|
||||
sudo -E pip install python-markdown-math
|
||||
3. install the material theme
|
||||
sudo -E pip install mkdocs-material
|
||||
|
||||
to build the documentation website run:
|
||||
- mkdocs build
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
|
||||
1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network. The output of the network is the predicted future measurements for time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$ for each possible action.
|
||||
@@ -22,4 +22,4 @@
|
||||
|
||||
### Training the network
|
||||
|
||||
Given a batch of transitions, run them through the network to get the current predictions of the future measurements per action, and set it as the initial targets for training the network. For each transition $(s_t,a_t,r_t,s_{t+1} )$ in the batch, the target of the network for the action that was taken, is the actual measurements that were seen in time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$. For the actions that were not taken, the targets are the current values.
|
||||
Given a batch of transitions, run them through the network to get the current predictions of the future measurements per action, and set them as the initial targets for training the network. For each transition $(s_t,a_t,r_t,s_{t+1} )$ in the batch, the target of the network for the action that was taken, is the actual measurements that were seen in time-steps $t+1,t+2,t+4,t+8,t+16$ and $t+32$. For the actions that were not taken, the targets are the current values.
|
||||
@@ -8,7 +8,7 @@
|
||||
<p style="text-align: center;">
|
||||
<img src="..\..\design_imgs\ac.png" width=500>
|
||||
</p>
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
|
||||
### Choosing an action - Discrete actions
|
||||
|
||||
@@ -17,7 +17,7 @@ The policy network is used in order to predict action probabilites. While traini
|
||||
### Training the network
|
||||
A batch of $ T_{max} $ transitions is used, and the advantages are calculated upon it.
|
||||
|
||||
Advantages can be calculated by either of the followng methods (configured by the selected preset) -
|
||||
Advantages can be calculated by either of the following methods (configured by the selected preset) -
|
||||
|
||||
1. **A_VALUE** - Estimating advantage directly:$$ A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t) $$where $k$ is $T_{max} - State\_Index$ for each state in the batch.
|
||||
2. **GAE** - By following the [Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438) paper.
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
</p>
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action - Continuous action
|
||||
Same as in PPO.
|
||||
### Training the network
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
Pass the current states through the actor network, and get an action mean vector $ \mu $. While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process, to add exploration noise to the action. When testing, use the mean vector $\mu$ as-is.
|
||||
### Training the network
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action - Discrete actions
|
||||
Run the current states through the network and get a policy distribution over the actions. While training, sample from the policy distribution. When testing, take the action with the highest probability.
|
||||
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
</p>
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action - Continuous actions
|
||||
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation. While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values. When testing, just take the mean values predicted by the network.
|
||||
### Training the network
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
The current states are used as the input to the network. The network contains several $Q$ heads, which are used for returning different estimations of the action $ Q $ values. For each episode, the bootstrapped exploration policy selects a single head to play with during the episode. According to the selected head, only the relevant output $ Q $ values are used. Using those $ Q $ values, the exploration policy then selects the action for acting.
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@
|
||||
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@
|
||||
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
@@ -14,7 +14,7 @@
|
||||
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
|
||||
@@ -15,7 +15,7 @@
|
||||
## General Description
|
||||
Dueling DQN presents a change in the network structure comparing to DQN.
|
||||
|
||||
Dueling DQN uses a speciallized _Dueling Q Head_ in order to seperate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
|
||||
Dueling DQN uses a specialized _Dueling Q Head_ in order to separate $ Q $ to an $ A $ (advantage) stream and a $ V $ stream. Adding this type of structure to the network head allows the network to better differentiate actions from one another, and significantly improves the learning.
|
||||
|
||||
In many states, the values of the diiferent actions are very similar, and it is less important which action to take.
|
||||
In many states, the values of the different actions are very similar, and it is less important which action to take.
|
||||
This is especially important in environments where there are many actions to choose from. In DQN, on each training iteration, for each of the states in the batch, we update the $Q$ values only for the specific actions taken in those states. This results in slower learning as we do not learn the $Q$ values for actions that were not taken yet. On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a single action has been taken at this state.
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Training the network
|
||||
In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@
|
||||
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
The current state is used as an input to the network. The action mean $ \mu(s_t ) $ is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration.
|
||||
###Training the network
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate output from the middleware.
|
||||
2. For each possible action $a_i$, run the DND head using the state embedding and the selected action $a_i$ as inputs. The DND is queried and returns the $ P $ nearest neighbor keys and values. The keys and values are used to calculate and return the action $ Q $ value from the network.
|
||||
@@ -20,7 +20,7 @@
|
||||
4. Store the state embeddings and actions taken during the current episode in a small buffer $B$, in order to accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
|
||||
|
||||
### Finalizing an episode
|
||||
For each step in the episode, the state embeddings and the taken actions where stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
|
||||
For each step in the episode, the state embeddings and the taken actions are stored in the buffer $B$. When the episode is finished, the replay buffer calculates the $ N $-step total return of each transition in the buffer, bootstrapped using the maximum $Q$ value of the $N$-th transition. Those values are inserted along with the total return into the DND, and the buffer $B$ is reset.
|
||||
### Training the network
|
||||
Train the network only when the DND has enough entries for querying.
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithmic Description
|
||||
## Algorithm Description
|
||||
### Training the network
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
|
||||
@@ -24,7 +24,7 @@
|
||||
$$ y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t )) $$
|
||||
5. For _persistent advantage learning (PAL)_, the target network is also used in order to calculate the action gap for the next state:
|
||||
$$ V(s_{t+1} )-Q(s_{t+1},a_{t+1}) $$
|
||||
Where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
|
||||
where $ a_{t+1} $ is chosen by running the next states through the online network and choosing the action that has the highest predicted $ Q $ value. Finally, the targets will be defined as -
|
||||
$$ y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} )) $$
|
||||
6. Train the online network using the current states as inputs, and with the aforementioned targets.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user