1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 19:50:17 +01:00

Enabling Coach Documentation to be run even when environments are not installed (#326)

This commit is contained in:
anabwan
2019-05-27 10:46:07 +03:00
committed by Gal Leibovich
parent 2b7d536da4
commit 342b7184bc
157 changed files with 5167 additions and 7477 deletions

View File

@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>ACER &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<title>ACER &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
@@ -17,13 +17,21 @@
<script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
<script type="text/javascript" src="../../../_static/jquery.js"></script>
<script type="text/javascript" src="../../../_static/underscore.js"></script>
<script type="text/javascript" src="../../../_static/doctools.js"></script>
<script type="text/javascript" src="../../../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../../../_static/js/theme.js"></script>
<link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
<link rel="prev" title="Actor-Critic" href="ac.html" />
<link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../../../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<div class="wy-side-nav-search" >
@@ -236,11 +239,11 @@ distribution assigned with these probabilities. When testing, the action with th
and <span class="math notranslate nohighlight">\(n\)</span> (replay ratio) off-policy updates from batches of <span class="math notranslate nohighlight">\(T_{max}\)</span> transitions sampled from the replay buffer.</p>
<p>Each update perform the following procedure:</p>
<ol class="arabic">
<li><p class="first"><strong>Calculate state values:</strong></p>
<li><p><strong>Calculate state values:</strong></p>
<div class="math notranslate nohighlight">
\[V(s_t) = \mathbb{E}_{a \sim \pi} [Q(s_t,a)]\]</div>
</li>
<li><p class="first"><strong>Calculate Q retrace:</strong></p>
<li><p><strong>Calculate Q retrace:</strong></p>
<blockquote>
<div><div class="math notranslate nohighlight">
\[Q^{ret}(s_t,a_t) = r_t +\gamma \bar{\rho}_{t+1}[Q^{ret}(s_{t+1},a_{t+1}) - Q(s_{t+1},a_{t+1})] + \gamma V(s_{t+1})\]</div>
@@ -248,7 +251,7 @@ and <span class="math notranslate nohighlight">\(n\)</span> (replay ratio) off-p
\[\text{where} \quad \bar{\rho}_{t} = \min{\left\{c,\rho_t\right\}},\quad \rho_t=\frac{\pi (a_t \mid s_t)}{\mu (a_t \mid s_t)}\]</div>
</div></blockquote>
</li>
<li><p class="first"><strong>Accumulate gradients:</strong></p>
<li><p><strong>Accumulate gradients:</strong></p>
<blockquote>
<div><p><span class="math notranslate nohighlight">\(\bullet\)</span> <strong>Policy gradients (with bias correction):</strong></p>
<blockquote>
@@ -263,7 +266,7 @@ and <span class="math notranslate nohighlight">\(n\)</span> (replay ratio) off-p
</div></blockquote>
</div></blockquote>
</li>
<li><p class="first"><strong>(Optional) Trust region update:</strong> change the policy loss gradient w.r.t network output:</p>
<li><p><strong>(Optional) Trust region update:</strong> change the policy loss gradient w.r.t network output:</p>
<blockquote>
<div><div class="math notranslate nohighlight">
\[\hat{g}_t^{trust-region} = \hat{g}_t^{policy} - \max \left\{0, \frac{k^T \hat{g}_t^{policy} - \delta}{\lVert k \rVert_2^2}\right\} k\]</div>
@@ -277,39 +280,35 @@ The goal of the trust region update is to the difference between the updated pol
<dl class="class">
<dt id="rl_coach.agents.acer_agent.ACERAlgorithmParameters">
<em class="property">class </em><code class="descclassname">rl_coach.agents.acer_agent.</code><code class="descname">ACERAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/acer_agent.html#ACERAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.acer_agent.ACERAlgorithmParameters" title="Permalink to this definition"></a></dt>
<dd><table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>num_steps_between_gradient_updates</strong> (int)
<dd><dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>num_steps_between_gradient_updates</strong> (int)
Every num_steps_between_gradient_updates transitions will be considered as a single batch and use for
accumulating gradients. This is also the number of steps used for bootstrapping according to the n-step formulation.</li>
<li><strong>ratio_of_replay</strong> (int)
The number of off-policy training iterations in each ACER iteration.</li>
<li><strong>num_transitions_to_start_replay</strong> (int)
accumulating gradients. This is also the number of steps used for bootstrapping according to the n-step formulation.</p></li>
<li><p><strong>ratio_of_replay</strong> (int)
The number of off-policy training iterations in each ACER iteration.</p></li>
<li><p><strong>num_transitions_to_start_replay</strong> (int)
Number of environment steps until ACER starts to train off-policy from the experience replay.
This emulates a heat-up phase where the agents learns only on-policy until there are enough transitions in
the experience replay to start the off-policy training.</li>
<li><strong>rate_for_copying_weights_to_target</strong> (float)
the experience replay to start the off-policy training.</p></li>
<li><p><strong>rate_for_copying_weights_to_target</strong> (float)
The rate of the exponential moving average for the average policy which is used for the trust region optimization.
The target network in this algorithm is used as the average policy.</li>
<li><strong>importance_weight_truncation</strong> (float)
The clipping constant for the importance weight truncation (not used in the Q-retrace calculation).</li>
<li><strong>use_trust_region_optimization</strong> (bool)
The target network in this algorithm is used as the average policy.</p></li>
<li><p><strong>importance_weight_truncation</strong> (float)
The clipping constant for the importance weight truncation (not used in the Q-retrace calculation).</p></li>
<li><p><strong>use_trust_region_optimization</strong> (bool)
If set to True, the gradients of the network will be modified with a term dependant on the KL divergence between
the average policy and the current one, to bound the change of the policy during the network update.</li>
<li><strong>max_KL_divergence</strong> (float)
the average policy and the current one, to bound the change of the policy during the network update.</p></li>
<li><p><strong>max_KL_divergence</strong> (float)
The upper bound parameter for the trust region optimization, use_trust_region_optimization needs to be set true
for this parameter to have an effect.</li>
<li><strong>beta_entropy</strong> (float)
for this parameter to have an effect.</p></li>
<li><p><strong>beta_entropy</strong> (float)
An entropy regulaization term can be added to the loss function in order to control exploration. This term
is weighted using the beta value defined by beta_entropy.</li>
is weighted using the beta value defined by beta_entropy.</p></li>
</ul>
</td>
</tr>
</tbody>
</table>
</dd>
</dl>
</dd></dl>
</div>
@@ -327,7 +326,7 @@ is weighted using the beta value defined by beta_entropy.</li>
<a href="../imitation/bc.html" class="btn btn-neutral float-right" title="Behavioral Cloning" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="ac.html" class="btn btn-neutral" title="Actor-Critic" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
<a href="ac.html" class="btn btn-neutral float-left" title="Actor-Critic" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
@@ -336,7 +335,7 @@ is weighted using the beta value defined by beta_entropy.</li>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
&copy; Copyright 2018-2019, Intel AI Lab
</p>
</div>
@@ -353,27 +352,16 @@ is weighted using the beta value defined by beta_entropy.</li>
<script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
<script type="text/javascript" src="../../../_static/jquery.js"></script>
<script type="text/javascript" src="../../../_static/underscore.js"></script>
<script type="text/javascript" src="../../../_static/doctools.js"></script>
<script type="text/javascript" src="../../../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../../../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>
</body>
</html>