mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 19:20:19 +01:00
moving the docs to github
This commit is contained in:
27
docs_raw/docs/algorithms/policy_optimization/ac.md
Normal file
27
docs_raw/docs/algorithms/policy_optimization/ac.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Actor-Critic
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783)
|
||||
|
||||
## Network Structure
|
||||
<p style="text-align: center;">
|
||||
<img src="..\..\design_imgs\ac.png" width=500>
|
||||
</p>
|
||||
## Algorithm Description
|
||||
|
||||
### Choosing an action - Discrete actions
|
||||
|
||||
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.
|
||||
|
||||
### Training the network
|
||||
A batch of $ T_{max} $ transitions is used, and the advantages are calculated upon it.
|
||||
|
||||
Advantages can be calculated by either of the following methods (configured by the selected preset) -
|
||||
|
||||
1. **A_VALUE** - Estimating advantage directly:$$ A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t) $$where $k$ is $T_{max} - State\_Index$ for each state in the batch.
|
||||
2. **GAE** - By following the [Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438) paper.
|
||||
|
||||
The advantages are then used in order to accumulate gradients according to
|
||||
$$ L = -\mathop{\mathbb{E}} [log (\pi) \cdot A] $$
|
||||
|
||||
28
docs_raw/docs/algorithms/policy_optimization/cppo.md
Normal file
28
docs_raw/docs/algorithms/policy_optimization/cppo.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Clipped Proximal Policy Optimization
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="..\..\design_imgs\ppo.png">
|
||||
</p>
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action - Continuous action
|
||||
Same as in PPO.
|
||||
### Training the network
|
||||
Very similar to PPO, with several small (but very simplifying) changes:
|
||||
|
||||
1. Train both the value and policy networks, simultaneously, by defining a single loss function, which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
|
||||
|
||||
2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
|
||||
|
||||
3. Value targets are now also calculated based on the GAE advantages. In this method, the $ V $ values are predicted from the critic network, and then added to the GAE based advantages, in order to get a $ Q $ value for each action. Now, since our critic network is predicting a $ V $ value for each state, setting the $ Q $ calculated action-values as a target, will on average serve as a $ V $ state-value target.
|
||||
|
||||
4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio $r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ is clipped, to achieve a similar effect. This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon clipped surrogate loss:
|
||||
|
||||
$$L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)] $$
|
||||
32
docs_raw/docs/algorithms/policy_optimization/ddpg.md
Normal file
32
docs_raw/docs/algorithms/policy_optimization/ddpg.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# Deep Deterministic Policy Gradient
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\ddpg.png">
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action
|
||||
Pass the current states through the actor network, and get an action mean vector $ \mu $. While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process, to add exploration noise to the action. When testing, use the mean vector $\mu$ as-is.
|
||||
### Training the network
|
||||
Start by sampling a batch of transitions from the experience replay.
|
||||
|
||||
* To train the **critic network**, use the following targets:
|
||||
|
||||
$$ y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} )) $$
|
||||
First run the actor target network, using the next states as the inputs, and get $ \mu (s_{t+1} ) $. Next, run the critic target network using the next states and $ \mu (s_{t+1} ) $, and use the output to calculate $ y_t $ according to the equation above. To train the network, use the current states and actions as the inputs, and $y_t$ as the targets.
|
||||
|
||||
* To train the **actor network**, use the following equation:
|
||||
|
||||
$$ \nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ] $$
|
||||
Use the actor's online network to get the action mean values using the current states as the inputs. Then, use the critic online network in order to get the gradients of the critic output with respect to the action mean values $ \nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) } $. Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights, given $ \nabla_a Q(s,a) $. Finally, apply those gradients to the actor network.
|
||||
|
||||
After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
|
||||
|
||||
27
docs_raw/docs/algorithms/policy_optimization/pg.md
Normal file
27
docs_raw/docs/algorithms/policy_optimization/pg.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Policy Gradient
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\pg.png">
|
||||
|
||||
</p>
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action - Discrete actions
|
||||
Run the current states through the network and get a policy distribution over the actions. While training, sample from the policy distribution. When testing, take the action with the highest probability.
|
||||
|
||||
### Training the network
|
||||
The policy head loss is defined as $ L=-log (\pi) \cdot PolicyGradientRescaler $. The $PolicyGradientRescaler$ is used in order to reduce the policy gradient variance, which might be very noisy. This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's convergence. The rescaler is a configurable parameter and there are few options to choose from:
|
||||
* **Total Episode Return** - The sum of all the discounted rewards during the episode.
|
||||
* **Future Return** - Return from each transition until the end of the episode.
|
||||
* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
|
||||
* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations, which are calculated seperately for each timestep, across different episodes.
|
||||
|
||||
Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes serves the same purpose - reducing the update variance. After accumulating gradients for several episodes, the gradients are then applied to the network.
|
||||
|
||||
24
docs_raw/docs/algorithms/policy_optimization/ppo.md
Normal file
24
docs_raw/docs/algorithms/policy_optimization/ppo.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# Proximal Policy Optimization
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
**References:** [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\ppo.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
### Choosing an action - Continuous actions
|
||||
Run the observation through the policy network, and get the mean and standard deviation vectors for this observation. While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values. When testing, just take the mean values predicted by the network.
|
||||
### Training the network
|
||||
1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
|
||||
2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
|
||||
3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers, the L-BFGS optimizer runs on the entire dataset at once, without batching. It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset, the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total discounted returns of each state in each episode.
|
||||
4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before* starting to run the current set of training iterations) using a regularization term.
|
||||
5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value, in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high, increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
|
||||
Reference in New Issue
Block a user