1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 11:40:18 +01:00

Enabling Coach Documentation to be run even when environments are not installed (#326)

This commit is contained in:
anabwan
2019-05-27 10:46:07 +03:00
committed by Gal Leibovich
parent 2b7d536da4
commit 342b7184bc
157 changed files with 5167 additions and 7477 deletions

View File

@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Clipped Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<title>Clipped Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
@@ -17,13 +17,21 @@
<script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
<script type="text/javascript" src="../../../_static/jquery.js"></script>
<script type="text/javascript" src="../../../_static/underscore.js"></script>
<script type="text/javascript" src="../../../_static/doctools.js"></script>
<script type="text/javascript" src="../../../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../../../_static/js/theme.js"></script>
<link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
<link rel="prev" title="Conditional Imitation Learning" href="../imitation/cil.html" />
<link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../../../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<div class="wy-side-nav-search" >
@@ -233,17 +236,14 @@
<h3>Training the network<a class="headerlink" href="#training-the-network" title="Permalink to this headline"></a></h3>
<p>Very similar to PPO, with several small (but very simplifying) changes:</p>
<ol class="arabic">
<li><p class="first">Train both the value and policy networks, simultaneously, by defining a single loss function,
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.</p>
</li>
<li><p class="first">The unified networks optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).</p>
</li>
<li><p class="first">Value targets are now also calculated based on the GAE advantages.
<li><p>Train both the value and policy networks, simultaneously, by defining a single loss function,
which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.</p></li>
<li><p>The unified networks optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).</p></li>
<li><p>Value targets are now also calculated based on the GAE advantages.
In this method, the <span class="math notranslate nohighlight">\(V\)</span> values are predicted from the critic network, and then added to the GAE based advantages,
in order to get a <span class="math notranslate nohighlight">\(Q\)</span> value for each action. Now, since our critic network is predicting a <span class="math notranslate nohighlight">\(V\)</span> value for
each state, setting the <span class="math notranslate nohighlight">\(Q\)</span> calculated action-values as a target, will on average serve as a <span class="math notranslate nohighlight">\(V\)</span> state-value target.</p>
</li>
<li><p class="first">Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
each state, setting the <span class="math notranslate nohighlight">\(Q\)</span> calculated action-values as a target, will on average serve as a <span class="math notranslate nohighlight">\(V\)</span> state-value target.</p></li>
<li><p>Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
<span class="math notranslate nohighlight">\(r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}\)</span> is clipped, to achieve a similar effect.
This is done by defining the policys loss function to be the minimum between the standard surrogate loss and an epsilon
clipped surrogate loss:</p>
@@ -253,46 +253,42 @@ clipped surrogate loss:</p>
<dl class="class">
<dt id="rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters">
<em class="property">class </em><code class="descclassname">rl_coach.agents.clipped_ppo_agent.</code><code class="descname">ClippedPPOAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/clipped_ppo_agent.html#ClippedPPOAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters" title="Permalink to this definition"></a></dt>
<dd><table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>policy_gradient_rescaler</strong> (PolicyGradientRescaler)
<dd><dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>policy_gradient_rescaler</strong> (PolicyGradientRescaler)
This represents how the critic will be used to update the actor. The critic value function is typically used
to rescale the gradients calculated by the actor. There are several ways for doing this, such as using the
advantage of the action, or the generalized advantage estimation (GAE) value.</li>
<li><strong>gae_lambda</strong> (float)
advantage of the action, or the generalized advantage estimation (GAE) value.</p></li>
<li><p><strong>gae_lambda</strong> (float)
The <span class="math notranslate nohighlight">\(\lambda\)</span> value is used within the GAE function in order to weight different bootstrap length
estimations. Typical values are in the range 0.9-1, and define an exponential decay over the different
n-step estimations.</li>
<li><strong>clip_likelihood_ratio_using_epsilon</strong> (float)
n-step estimations.</p></li>
<li><p><strong>clip_likelihood_ratio_using_epsilon</strong> (float)
If not None, the likelihood ratio between the current and new policy in the PPO loss function will be
clipped to the range [1-clip_likelihood_ratio_using_epsilon, 1+clip_likelihood_ratio_using_epsilon].
This is typically used in the Clipped PPO version of PPO, and should be set to None in regular PPO
implementations.</li>
<li><strong>value_targets_mix_fraction</strong> (float)
implementations.</p></li>
<li><p><strong>value_targets_mix_fraction</strong> (float)
The targets for the value network are an exponential weighted moving average which uses this mix fraction to
define how much of the new targets will be taken into account when calculating the loss.
This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</li>
<li><strong>estimate_state_value_using_gae</strong> (bool)
If set to True, the state value will be estimated using the GAE technique.</li>
<li><strong>use_kl_regularization</strong> (bool)
This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</p></li>
<li><p><strong>estimate_state_value_using_gae</strong> (bool)
If set to True, the state value will be estimated using the GAE technique.</p></li>
<li><p><strong>use_kl_regularization</strong> (bool)
If set to True, the loss function will be regularized using the KL diveregence between the current and new
policy, to bound the change of the policy during the network update.</li>
<li><strong>beta_entropy</strong> (float)
policy, to bound the change of the policy during the network update.</p></li>
<li><p><strong>beta_entropy</strong> (float)
An entropy regulaization term can be added to the loss function in order to control exploration. This term
is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</li>
<li><strong>optimization_epochs</strong> (int)
is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</p></li>
<li><p><strong>optimization_epochs</strong> (int)
For each training phase, the collected dataset will be used for multiple epochs, which are defined by the
optimization_epochs value.</li>
<li><strong>optimization_epochs</strong> (Schedule)
Can be used to define a schedule over the clipping of the likelihood ratio.</li>
optimization_epochs value.</p></li>
<li><p><strong>optimization_epochs</strong> (Schedule)
Can be used to define a schedule over the clipping of the likelihood ratio.</p></li>
</ul>
</td>
</tr>
</tbody>
</table>
</dd>
</dl>
</dd></dl>
</div>
@@ -310,7 +306,7 @@ Can be used to define a schedule over the clipping of the likelihood ratio.</li>
<a href="ddpg.html" class="btn btn-neutral float-right" title="Deep Deterministic Policy Gradient" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="../imitation/cil.html" class="btn btn-neutral" title="Conditional Imitation Learning" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
<a href="../imitation/cil.html" class="btn btn-neutral float-left" title="Conditional Imitation Learning" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
@@ -319,7 +315,7 @@ Can be used to define a schedule over the clipping of the likelihood ratio.</li>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
&copy; Copyright 2018-2019, Intel AI Lab
</p>
</div>
@@ -336,27 +332,16 @@ Can be used to define a schedule over the clipping of the likelihood ratio.</li>
<script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
<script type="text/javascript" src="../../../_static/jquery.js"></script>
<script type="text/javascript" src="../../../_static/underscore.js"></script>
<script type="text/javascript" src="../../../_static/doctools.js"></script>
<script type="text/javascript" src="../../../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../../../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>
</body>
</html>