1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 19:50:17 +01:00

Enabling Coach Documentation to be run even when environments are not installed (#326)

This commit is contained in:
anabwan
2019-05-27 10:46:07 +03:00
committed by Gal Leibovich
parent 2b7d536da4
commit 342b7184bc
157 changed files with 5167 additions and 7477 deletions

View File

@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<title>Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
@@ -17,13 +17,21 @@
<script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
<script type="text/javascript" src="../../../_static/jquery.js"></script>
<script type="text/javascript" src="../../../_static/underscore.js"></script>
<script type="text/javascript" src="../../../_static/doctools.js"></script>
<script type="text/javascript" src="../../../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../../../_static/js/theme.js"></script>
<link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
<link rel="prev" title="Policy Gradient" href="pg.html" />
<link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../../../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<div class="wy-side-nav-search" >
@@ -234,66 +237,62 @@ When testing, just take the mean values predicted by the network.</p>
<div class="section" id="training-the-network">
<h3>Training the network<a class="headerlink" href="#training-the-network" title="Permalink to this headline"></a></h3>
<ol class="arabic simple">
<li>Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).</li>
<li>Calculate the advantages for each transition, using the <em>Generalized Advantage Estimation</em> method (Schulman 2015).</li>
<li>Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
<li><p>Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).</p></li>
<li><p>Calculate the advantages for each transition, using the <em>Generalized Advantage Estimation</em> method (Schulman 2015).</p></li>
<li><p>Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
the L-BFGS optimizer runs on the entire dataset at once, without batching.
It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
discounted returns of each state in each episode.</li>
<li>Run several training iterations of the policy network. This is done by using the previously calculated advantages as
discounted returns of each state in each episode.</p></li>
<li><p>Run several training iterations of the policy network. This is done by using the previously calculated advantages as
targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used <em>before</em>
starting to run the current set of training iterations) using a regularization term.</li>
<li>After training is done, the last sampled KL divergence value will be compared with the <em>target KL divergence</em> value,
starting to run the current set of training iterations) using a regularization term.</p></li>
<li><p>After training is done, the last sampled KL divergence value will be compared with the <em>target KL divergence</em> value,
in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.</li>
increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.</p></li>
</ol>
<dl class="class">
<dt id="rl_coach.agents.ppo_agent.PPOAlgorithmParameters">
<em class="property">class </em><code class="descclassname">rl_coach.agents.ppo_agent.</code><code class="descname">PPOAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/ppo_agent.html#PPOAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.ppo_agent.PPOAlgorithmParameters" title="Permalink to this definition"></a></dt>
<dd><table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>policy_gradient_rescaler</strong> (PolicyGradientRescaler)
<dd><dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>policy_gradient_rescaler</strong> (PolicyGradientRescaler)
This represents how the critic will be used to update the actor. The critic value function is typically used
to rescale the gradients calculated by the actor. There are several ways for doing this, such as using the
advantage of the action, or the generalized advantage estimation (GAE) value.</li>
<li><strong>gae_lambda</strong> (float)
advantage of the action, or the generalized advantage estimation (GAE) value.</p></li>
<li><p><strong>gae_lambda</strong> (float)
The <span class="math notranslate nohighlight">\(\lambda\)</span> value is used within the GAE function in order to weight different bootstrap length
estimations. Typical values are in the range 0.9-1, and define an exponential decay over the different
n-step estimations.</li>
<li><strong>target_kl_divergence</strong> (float)
n-step estimations.</p></li>
<li><p><strong>target_kl_divergence</strong> (float)
The target kl divergence between the current policy distribution and the new policy. PPO uses a heuristic to
bring the KL divergence to this value, by adding a penalty if the kl divergence is higher.</li>
<li><strong>initial_kl_coefficient</strong> (float)
bring the KL divergence to this value, by adding a penalty if the kl divergence is higher.</p></li>
<li><p><strong>initial_kl_coefficient</strong> (float)
The initial weight that will be given to the KL divergence between the current and the new policy in the
regularization factor.</li>
<li><strong>high_kl_penalty_coefficient</strong> (float)
The penalty that will be given for KL divergence values which are highes than what was defined as the target.</li>
<li><strong>clip_likelihood_ratio_using_epsilon</strong> (float)
regularization factor.</p></li>
<li><p><strong>high_kl_penalty_coefficient</strong> (float)
The penalty that will be given for KL divergence values which are highes than what was defined as the target.</p></li>
<li><p><strong>clip_likelihood_ratio_using_epsilon</strong> (float)
If not None, the likelihood ratio between the current and new policy in the PPO loss function will be
clipped to the range [1-clip_likelihood_ratio_using_epsilon, 1+clip_likelihood_ratio_using_epsilon].
This is typically used in the Clipped PPO version of PPO, and should be set to None in regular PPO
implementations.</li>
<li><strong>value_targets_mix_fraction</strong> (float)
implementations.</p></li>
<li><p><strong>value_targets_mix_fraction</strong> (float)
The targets for the value network are an exponential weighted moving average which uses this mix fraction to
define how much of the new targets will be taken into account when calculating the loss.
This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</li>
<li><strong>estimate_state_value_using_gae</strong> (bool)
If set to True, the state value will be estimated using the GAE technique.</li>
<li><strong>use_kl_regularization</strong> (bool)
This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</p></li>
<li><p><strong>estimate_state_value_using_gae</strong> (bool)
If set to True, the state value will be estimated using the GAE technique.</p></li>
<li><p><strong>use_kl_regularization</strong> (bool)
If set to True, the loss function will be regularized using the KL diveregence between the current and new
policy, to bound the change of the policy during the network update.</li>
<li><strong>beta_entropy</strong> (float)
policy, to bound the change of the policy during the network update.</p></li>
<li><p><strong>beta_entropy</strong> (float)
An entropy regulaization term can be added to the loss function in order to control exploration. This term
is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</li>
is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</p></li>
</ul>
</td>
</tr>
</tbody>
</table>
</dd>
</dl>
</dd></dl>
</div>
@@ -311,7 +310,7 @@ is weighted using the <span class="math notranslate nohighlight">\(eta\)</span>
<a href="../value_optimization/rainbow.html" class="btn btn-neutral float-right" title="Rainbow" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="pg.html" class="btn btn-neutral" title="Policy Gradient" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
<a href="pg.html" class="btn btn-neutral float-left" title="Policy Gradient" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
@@ -320,7 +319,7 @@ is weighted using the <span class="math notranslate nohighlight">\(eta\)</span>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
&copy; Copyright 2018-2019, Intel AI Lab
</p>
</div>
@@ -337,27 +336,16 @@ is weighted using the <span class="math notranslate nohighlight">\(eta\)</span>
<script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
<script type="text/javascript" src="../../../_static/jquery.js"></script>
<script type="text/javascript" src="../../../_static/underscore.js"></script>
<script type="text/javascript" src="../../../_static/doctools.js"></script>
<script type="text/javascript" src="../../../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../../../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>
</body>
</html>