Enabling Coach Documentation to be run even when environments are not installed (#326)

2025-12-17 19:20:19 +01:00 · 2019-05-27 10:46:07 +03:00
parent 2b7d536da4
commit 342b7184bc
157 changed files with 5167 additions and 7477 deletions
--- a/docs/components/agents/policy_optimization/ac.html
+++ b/docs/components/agents/policy_optimization/ac.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>Actor-Critic &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>Actor-Critic &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
    <link rel="prev" title="Agents" href="../index.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -235,41 +238,37 @@ distribution assigned with these probabilities. When testing, the action with th
 <p>A batch of <span class="math notranslate nohighlight">\(T_{max}\)</span> transitions is used, and the advantages are calculated upon it.</p>
 <p>Advantages can be calculated by either of the following methods (configured by the selected preset) -</p>
 <ol class="arabic simple">
-<li><strong>A_VALUE</strong> - Estimating advantage directly:
+<li><p><strong>A_VALUE</strong> - Estimating advantage directly:
 <span class="math notranslate nohighlight">\(A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)\)</span>
-where <span class="math notranslate nohighlight">\(k\)</span> is <span class="math notranslate nohighlight">\(T_{max} - State\_Index\)</span> for each state in the batch.</li>
-<li><strong>GAE</strong> - By following the <a class="reference external" href="https://arxiv.org/abs/1506.02438">Generalized Advantage Estimation</a> paper.</li>
+where <span class="math notranslate nohighlight">\(k\)</span> is <span class="math notranslate nohighlight">\(T_{max} - State\_Index\)</span> for each state in the batch.</p></li>
+<li><p><strong>GAE</strong> - By following the <a class="reference external" href="https://arxiv.org/abs/1506.02438">Generalized Advantage Estimation</a> paper.</p></li>
 </ol>
 <p>The advantages are then used in order to accumulate gradients according to
 <span class="math notranslate nohighlight">\(L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]\)</span></p>
 <dl class="class">
 <dt id="rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters">
 <em class="property">class </em><code class="descclassname">rl_coach.agents.actor_critic_agent.</code><code class="descname">ActorCriticAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/actor_critic_agent.html#ActorCriticAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters" title="Permalink to this definition">¶</a></dt>
-<dd><table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
-<li><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
-The value that will be used to rescale the policy gradient</li>
-<li><strong>apply_gradients_every_x_episodes</strong> – (int)
+<dd><dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
+The value that will be used to rescale the policy gradient</p></li>
+<li><p><strong>apply_gradients_every_x_episodes</strong> – (int)
 The number of episodes to wait before applying the accumulated gradients to the network.
-The training iterations only accumulate gradients without actually applying them.</li>
-<li><strong>beta_entropy</strong> – (float)
-The weight that will be given to the entropy regularization which is used in order to improve exploration.</li>
-<li><strong>num_steps_between_gradient_updates</strong> – (int)
+The training iterations only accumulate gradients without actually applying them.</p></li>
+<li><p><strong>beta_entropy</strong> – (float)
+The weight that will be given to the entropy regularization which is used in order to improve exploration.</p></li>
+<li><p><strong>num_steps_between_gradient_updates</strong> – (int)
 Every num_steps_between_gradient_updates transitions will be considered as a single batch and use for
-accumulating gradients. This is also the number of steps used for bootstrapping according to the n-step formulation.</li>
-<li><strong>gae_lambda</strong> – (float)
+accumulating gradients. This is also the number of steps used for bootstrapping according to the n-step formulation.</p></li>
+<li><p><strong>gae_lambda</strong> – (float)
 If the policy gradient rescaler was defined as PolicyGradientRescaler.GAE, the generalized advantage estimation
-scheme will be used, in which case the lambda value controls the decay for the different n-step lengths.</li>
-<li><strong>estimate_state_value_using_gae</strong> – (bool)
-If set to True, the state value targets for the V head will be estimated using the GAE scheme.</li>
+scheme will be used, in which case the lambda value controls the decay for the different n-step lengths.</p></li>
+<li><p><strong>estimate_state_value_using_gae</strong> – (bool)
+If set to True, the state value targets for the V head will be estimated using the GAE scheme.</p></li>
 </ul>
-</td>
-</tr>
-</tbody>
-</table>
+</dd>
+</dl>
 </dd></dl>

 </div>
@@ -287,7 +286,7 @@ If set to True, the state value targets for the V head will be estimated using t
        <a href="acer.html" class="btn btn-neutral float-right" title="ACER" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="../index.html" class="btn btn-neutral" title="Agents" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="../index.html" class="btn btn-neutral float-left" title="Agents" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  
@@ -296,7 +295,7 @@ If set to True, the state value targets for the V head will be estimated using t

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -313,27 +312,16 @@ If set to True, the state value targets for the V head will be estimated using t
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>
--- a/docs/components/agents/policy_optimization/acer.html
+++ b/docs/components/agents/policy_optimization/acer.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>ACER &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>ACER &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
    <link rel="prev" title="Actor-Critic" href="ac.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -236,11 +239,11 @@ distribution assigned with these probabilities. When testing, the action with th
 and <span class="math notranslate nohighlight">\(n\)</span> (replay ratio) off-policy updates from batches of <span class="math notranslate nohighlight">\(T_{max}\)</span> transitions sampled from the replay buffer.</p>
 <p>Each update perform the following procedure:</p>
 <ol class="arabic">
-<li><p class="first"><strong>Calculate state values:</strong></p>
+<li><p><strong>Calculate state values:</strong></p>
 <div class="math notranslate nohighlight">
 \[V(s_t) = \mathbb{E}_{a \sim \pi} [Q(s_t,a)]\]</div>
 </li>
-<li><p class="first"><strong>Calculate Q retrace:</strong></p>
+<li><p><strong>Calculate Q retrace:</strong></p>
 <blockquote>
 <div><div class="math notranslate nohighlight">
 \[Q^{ret}(s_t,a_t) = r_t +\gamma \bar{\rho}_{t+1}[Q^{ret}(s_{t+1},a_{t+1}) - Q(s_{t+1},a_{t+1})] + \gamma V(s_{t+1})\]</div>
@@ -248,7 +251,7 @@ and <span class="math notranslate nohighlight">\(n\)</span> (replay ratio) off-p
 \[\text{where} \quad \bar{\rho}_{t} = \min{\left\{c,\rho_t\right\}},\quad \rho_t=\frac{\pi (a_t \mid s_t)}{\mu (a_t \mid s_t)}\]</div>
 </div></blockquote>
 </li>
-<li><p class="first"><strong>Accumulate gradients:</strong></p>
+<li><p><strong>Accumulate gradients:</strong></p>
 <blockquote>
 <div><p><span class="math notranslate nohighlight">\(\bullet\)</span> <strong>Policy gradients (with bias correction):</strong></p>
 <blockquote>
@@ -263,7 +266,7 @@ and <span class="math notranslate nohighlight">\(n\)</span> (replay ratio) off-p
 </div></blockquote>
 </div></blockquote>
 </li>
-<li><p class="first"><strong>(Optional) Trust region update:</strong> change the policy loss gradient w.r.t network output:</p>
+<li><p><strong>(Optional) Trust region update:</strong> change the policy loss gradient w.r.t network output:</p>
 <blockquote>
 <div><div class="math notranslate nohighlight">
 \[\hat{g}_t^{trust-region} = \hat{g}_t^{policy} - \max \left\{0, \frac{k^T \hat{g}_t^{policy} - \delta}{\lVert k \rVert_2^2}\right\} k\]</div>
@@ -277,39 +280,35 @@ The goal of the trust region update is to the difference between the updated pol
 <dl class="class">
 <dt id="rl_coach.agents.acer_agent.ACERAlgorithmParameters">
 <em class="property">class </em><code class="descclassname">rl_coach.agents.acer_agent.</code><code class="descname">ACERAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/acer_agent.html#ACERAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.acer_agent.ACERAlgorithmParameters" title="Permalink to this definition">¶</a></dt>
-<dd><table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
-<li><strong>num_steps_between_gradient_updates</strong> – (int)
+<dd><dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>num_steps_between_gradient_updates</strong> – (int)
 Every num_steps_between_gradient_updates transitions will be considered as a single batch and use for
-accumulating gradients. This is also the number of steps used for bootstrapping according to the n-step formulation.</li>
-<li><strong>ratio_of_replay</strong> – (int)
-The number of off-policy training iterations in each ACER iteration.</li>
-<li><strong>num_transitions_to_start_replay</strong> – (int)
+accumulating gradients. This is also the number of steps used for bootstrapping according to the n-step formulation.</p></li>
+<li><p><strong>ratio_of_replay</strong> – (int)
+The number of off-policy training iterations in each ACER iteration.</p></li>
+<li><p><strong>num_transitions_to_start_replay</strong> – (int)
 Number of environment steps until ACER starts to train off-policy from the experience replay.
 This emulates a heat-up phase where the agents learns only on-policy until there are enough transitions in
-the experience replay to start the off-policy training.</li>
-<li><strong>rate_for_copying_weights_to_target</strong> – (float)
+the experience replay to start the off-policy training.</p></li>
+<li><p><strong>rate_for_copying_weights_to_target</strong> – (float)
 The rate of the exponential moving average for the average policy which is used for the trust region optimization.
-The target network in this algorithm is used as the average policy.</li>
-<li><strong>importance_weight_truncation</strong> – (float)
-The clipping constant for the importance weight truncation (not used in the Q-retrace calculation).</li>
-<li><strong>use_trust_region_optimization</strong> – (bool)
+The target network in this algorithm is used as the average policy.</p></li>
+<li><p><strong>importance_weight_truncation</strong> – (float)
+The clipping constant for the importance weight truncation (not used in the Q-retrace calculation).</p></li>
+<li><p><strong>use_trust_region_optimization</strong> – (bool)
 If set to True, the gradients of the network will be modified with a term dependant on the KL divergence between
-the average policy and the current one, to bound the change of the policy during the network update.</li>
-<li><strong>max_KL_divergence</strong> – (float)
+the average policy and the current one, to bound the change of the policy during the network update.</p></li>
+<li><p><strong>max_KL_divergence</strong> – (float)
 The upper bound parameter for the trust region optimization, use_trust_region_optimization needs to be set true
-for this parameter to have an effect.</li>
-<li><strong>beta_entropy</strong> – (float)
+for this parameter to have an effect.</p></li>
+<li><p><strong>beta_entropy</strong> – (float)
 An entropy regulaization term can be added to the loss function in order to control exploration. This term
-is weighted using the beta value defined by beta_entropy.</li>
+is weighted using the beta value defined by beta_entropy.</p></li>
 </ul>
-</td>
-</tr>
-</tbody>
-</table>
+</dd>
+</dl>
 </dd></dl>

 </div>
@@ -327,7 +326,7 @@ is weighted using the beta value defined by beta_entropy.</li>
        <a href="../imitation/bc.html" class="btn btn-neutral float-right" title="Behavioral Cloning" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="ac.html" class="btn btn-neutral" title="Actor-Critic" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="ac.html" class="btn btn-neutral float-left" title="Actor-Critic" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  
@@ -336,7 +335,7 @@ is weighted using the beta value defined by beta_entropy.</li>

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -353,27 +352,16 @@ is weighted using the beta value defined by beta_entropy.</li>
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>
--- a/docs/components/agents/policy_optimization/cppo.html
+++ b/docs/components/agents/policy_optimization/cppo.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>Clipped Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>Clipped Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
    <link rel="prev" title="Conditional Imitation Learning" href="../imitation/cil.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -233,17 +236,14 @@
 <h3>Training the network<a class="headerlink" href="#training-the-network" title="Permalink to this headline">¶</a></h3>
 <p>Very similar to PPO, with several small (but very simplifying) changes:</p>
 <ol class="arabic">
-<li><p class="first">Train both the value and policy networks, simultaneously, by defining a single loss function,
-which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.</p>
-</li>
-<li><p class="first">The unified network’s optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).</p>
-</li>
-<li><p class="first">Value targets are now also calculated based on the GAE advantages.
+<li><p>Train both the value and policy networks, simultaneously, by defining a single loss function,
+which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.</p></li>
+<li><p>The unified network’s optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).</p></li>
+<li><p>Value targets are now also calculated based on the GAE advantages.
 In this method, the <span class="math notranslate nohighlight">\(V\)</span> values are predicted from the critic network, and then added to the GAE based advantages,
 in order to get a <span class="math notranslate nohighlight">\(Q\)</span> value for each action. Now, since our critic network is predicting a <span class="math notranslate nohighlight">\(V\)</span> value for
-each state, setting the <span class="math notranslate nohighlight">\(Q\)</span> calculated action-values as a target, will on average serve as a <span class="math notranslate nohighlight">\(V\)</span> state-value target.</p>
-</li>
-<li><p class="first">Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
+each state, setting the <span class="math notranslate nohighlight">\(Q\)</span> calculated action-values as a target, will on average serve as a <span class="math notranslate nohighlight">\(V\)</span> state-value target.</p></li>
+<li><p>Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
 <span class="math notranslate nohighlight">\(r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}\)</span> is clipped, to achieve a similar effect.
 This is done by defining the policy’s loss function to be the minimum between the standard surrogate loss and an epsilon
 clipped surrogate loss:</p>
@@ -253,46 +253,42 @@ clipped surrogate loss:</p>
 <dl class="class">
 <dt id="rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters">
 <em class="property">class </em><code class="descclassname">rl_coach.agents.clipped_ppo_agent.</code><code class="descname">ClippedPPOAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/clipped_ppo_agent.html#ClippedPPOAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters" title="Permalink to this definition">¶</a></dt>
-<dd><table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
-<li><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
+<dd><dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
 This represents how the critic will be used to update the actor. The critic value function is typically used
 to rescale the gradients calculated by the actor. There are several ways for doing this, such as using the
-advantage of the action, or the generalized advantage estimation (GAE) value.</li>
-<li><strong>gae_lambda</strong> – (float)
+advantage of the action, or the generalized advantage estimation (GAE) value.</p></li>
+<li><p><strong>gae_lambda</strong> – (float)
 The <span class="math notranslate nohighlight">\(\lambda\)</span> value is used within the GAE function in order to weight different bootstrap length
 estimations. Typical values are in the range 0.9-1, and define an exponential decay over the different
-n-step estimations.</li>
-<li><strong>clip_likelihood_ratio_using_epsilon</strong> – (float)
+n-step estimations.</p></li>
+<li><p><strong>clip_likelihood_ratio_using_epsilon</strong> – (float)
 If not None, the likelihood ratio between the current and new policy in the PPO loss function will be
 clipped to the range [1-clip_likelihood_ratio_using_epsilon, 1+clip_likelihood_ratio_using_epsilon].
 This is typically used in the Clipped PPO version of PPO, and should be set to None in regular PPO
-implementations.</li>
-<li><strong>value_targets_mix_fraction</strong> – (float)
+implementations.</p></li>
+<li><p><strong>value_targets_mix_fraction</strong> – (float)
 The targets for the value network are an exponential weighted moving average which uses this mix fraction to
 define how much of the new targets will be taken into account when calculating the loss.
-This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</li>
-<li><strong>estimate_state_value_using_gae</strong> – (bool)
-If set to True, the state value will be estimated using the GAE technique.</li>
-<li><strong>use_kl_regularization</strong> – (bool)
+This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</p></li>
+<li><p><strong>estimate_state_value_using_gae</strong> – (bool)
+If set to True, the state value will be estimated using the GAE technique.</p></li>
+<li><p><strong>use_kl_regularization</strong> – (bool)
 If set to True, the loss function will be regularized using the KL diveregence between the current and new
-policy, to bound the change of the policy during the network update.</li>
-<li><strong>beta_entropy</strong> – (float)
+policy, to bound the change of the policy during the network update.</p></li>
+<li><p><strong>beta_entropy</strong> – (float)
 An entropy regulaization term can be added to the loss function in order to control exploration. This term
-is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</li>
-<li><strong>optimization_epochs</strong> – (int)
+is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</p></li>
+<li><p><strong>optimization_epochs</strong> – (int)
 For each training phase, the collected dataset will be used for multiple epochs, which are defined by the
-optimization_epochs value.</li>
-<li><strong>optimization_epochs</strong> – (Schedule)
-Can be used to define a schedule over the clipping of the likelihood ratio.</li>
+optimization_epochs value.</p></li>
+<li><p><strong>optimization_epochs</strong> – (Schedule)
+Can be used to define a schedule over the clipping of the likelihood ratio.</p></li>
 </ul>
-</td>
-</tr>
-</tbody>
-</table>
+</dd>
+</dl>
 </dd></dl>

 </div>
@@ -310,7 +306,7 @@ Can be used to define a schedule over the clipping of the likelihood ratio.</li>
        <a href="ddpg.html" class="btn btn-neutral float-right" title="Deep Deterministic Policy Gradient" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="../imitation/cil.html" class="btn btn-neutral" title="Conditional Imitation Learning" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="../imitation/cil.html" class="btn btn-neutral float-left" title="Conditional Imitation Learning" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  
@@ -319,7 +315,7 @@ Can be used to define a schedule over the clipping of the likelihood ratio.</li>

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -336,27 +332,16 @@ Can be used to define a schedule over the clipping of the likelihood ratio.</li>
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>
--- a/docs/components/agents/policy_optimization/ddpg.html
+++ b/docs/components/agents/policy_optimization/ddpg.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>Deep Deterministic Policy Gradient &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>Deep Deterministic Policy Gradient &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
    <link rel="prev" title="Clipped Proximal Policy Optimization" href="cppo.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -235,14 +238,14 @@ to add exploration noise to the action. When testing, use the mean vector <span
 <h3>Training the network<a class="headerlink" href="#training-the-network" title="Permalink to this headline">¶</a></h3>
 <p>Start by sampling a batch of transitions from the experience replay.</p>
 <ul>
-<li><p class="first">To train the <strong>critic network</strong>, use the following targets:</p>
+<li><p>To train the <strong>critic network</strong>, use the following targets:</p>
 <p><span class="math notranslate nohighlight">\(y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))\)</span></p>
 <p>First run the actor target network, using the next states as the inputs, and get <span class="math notranslate nohighlight">\(\mu (s_{t+1} )\)</span>.
 Next, run the critic target network using the next states and <span class="math notranslate nohighlight">\(\mu (s_{t+1} )\)</span>, and use the output to
 calculate <span class="math notranslate nohighlight">\(y_t\)</span> according to the equation above. To train the network, use the current states and actions
 as the inputs, and <span class="math notranslate nohighlight">\(y_t\)</span> as the targets.</p>
 </li>
-<li><p class="first">To train the <strong>actor network</strong>, use the following equation:</p>
+<li><p>To train the <strong>actor network</strong>, use the following equation:</p>
 <p><span class="math notranslate nohighlight">\(\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]\)</span></p>
 <p>Use the actor’s online network to get the action mean values using the current states as the inputs.
 Then, use the critic online network in order to get the gradients of the critic output with respect to the
@@ -255,35 +258,31 @@ given <span class="math notranslate nohighlight">\(\nabla_a Q(s,a)\)</span>. Fin
 <dl class="class">
 <dt id="rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters">
 <em class="property">class </em><code class="descclassname">rl_coach.agents.ddpg_agent.</code><code class="descname">DDPGAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/ddpg_agent.html#DDPGAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters" title="Permalink to this definition">¶</a></dt>
-<dd><table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
-<li><strong>num_steps_between_copying_online_weights_to_target</strong> – (StepMethod)
-The number of steps between copying the online network weights to the target network weights.</li>
-<li><strong>rate_for_copying_weights_to_target</strong> – (float)
+<dd><dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>num_steps_between_copying_online_weights_to_target</strong> – (StepMethod)
+The number of steps between copying the online network weights to the target network weights.</p></li>
+<li><p><strong>rate_for_copying_weights_to_target</strong> – (float)
 When copying the online network weights to the target network weights, a soft update will be used, which
-weight the new online network weights by rate_for_copying_weights_to_target</li>
-<li><strong>num_consecutive_playing_steps</strong> – (StepMethod)
-The number of consecutive steps to act between every two training iterations</li>
-<li><strong>use_target_network_for_evaluation</strong> – (bool)
+weight the new online network weights by rate_for_copying_weights_to_target</p></li>
+<li><p><strong>num_consecutive_playing_steps</strong> – (StepMethod)
+The number of consecutive steps to act between every two training iterations</p></li>
+<li><p><strong>use_target_network_for_evaluation</strong> – (bool)
 If set to True, the target network will be used for predicting the actions when choosing actions to act.
-Since the target network weights change more slowly, the predicted actions will be more consistent.</li>
-<li><strong>action_penalty</strong> – (float)
+Since the target network weights change more slowly, the predicted actions will be more consistent.</p></li>
+<li><p><strong>action_penalty</strong> – (float)
 The amount by which to penalize the network on high action feature (pre-activation) values.
 This can prevent the actions features from saturating the TanH activation function, and therefore prevent the
-gradients from becoming very low.</li>
-<li><strong>clip_critic_targets</strong> – (Tuple[float, float] or None)
-The range to clip the critic target to in order to prevent overestimation of the action values.</li>
-<li><strong>use_non_zero_discount_for_terminal_states</strong> – (bool)
+gradients from becoming very low.</p></li>
+<li><p><strong>clip_critic_targets</strong> – (Tuple[float, float] or None)
+The range to clip the critic target to in order to prevent overestimation of the action values.</p></li>
+<li><p><strong>use_non_zero_discount_for_terminal_states</strong> – (bool)
 If set to True, the discount factor will be used for terminal states to bootstrap the next predicted state
-values. If set to False, the terminal states reward will be taken as the target return for the network.</li>
+values. If set to False, the terminal states reward will be taken as the target return for the network.</p></li>
 </ul>
-</td>
-</tr>
-</tbody>
-</table>
+</dd>
+</dl>
 </dd></dl>

 </div>
@@ -301,7 +300,7 @@ values. If set to False, the terminal states reward will be taken as the target
        <a href="sac.html" class="btn btn-neutral float-right" title="Soft Actor-Critic" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="cppo.html" class="btn btn-neutral" title="Clipped Proximal Policy Optimization" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="cppo.html" class="btn btn-neutral float-left" title="Clipped Proximal Policy Optimization" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  
@@ -310,7 +309,7 @@ values. If set to False, the terminal states reward will be taken as the target

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -327,27 +326,16 @@ values. If set to False, the terminal states reward will be taken as the target
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>
--- a/docs/components/agents/policy_optimization/hac.html
+++ b/docs/components/agents/policy_optimization/hac.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>Hierarchical Actor Critic &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>Hierarchical Actor Critic &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -31,21 +39,16 @@
    <link rel="search" title="Search" href="../../../search.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -212,7 +215,7 @@ to add exploration noise to the action. When testing, use the mean vector <span

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -229,27 +232,16 @@ to add exploration noise to the action. When testing, use the mean vector <span
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>
--- a/docs/components/agents/policy_optimization/pg.html
+++ b/docs/components/agents/policy_optimization/pg.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>Policy Gradient &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>Policy Gradient &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
    <link rel="prev" title="Persistent Advantage Learning" href="../value_optimization/pal.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -237,11 +240,11 @@ The <code class="code docutils literal notranslate"><span class="pre">PolicyGrad
 This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy’s
 convergence. The rescaler is a configurable parameter and there are few options to choose from:</p>
 <ul class="simple">
-<li><strong>Total Episode Return</strong> - The sum of all the discounted rewards during the episode.</li>
-<li><strong>Future Return</strong> - Return from each transition until the end of the episode.</li>
-<li><strong>Future Return Normalized by Episode</strong> - Future returns across the episode normalized by the episode’s mean and standard deviation.</li>
-<li><strong>Future Return Normalized by Timestep</strong> - Future returns normalized using running means and standard deviations,
-which are calculated seperately for each timestep, across different episodes.</li>
+<li><p><strong>Total Episode Return</strong> - The sum of all the discounted rewards during the episode.</p></li>
+<li><p><strong>Future Return</strong> - Return from each transition until the end of the episode.</p></li>
+<li><p><strong>Future Return Normalized by Episode</strong> - Future returns across the episode normalized by the episode’s mean and standard deviation.</p></li>
+<li><p><strong>Future Return Normalized by Timestep</strong> - Future returns normalized using running means and standard deviations,
+which are calculated seperately for each timestep, across different episodes.</p></li>
 </ul>
 <p>Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
 serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
@@ -249,32 +252,28 @@ the gradients are then applied to the network.</p>
 <dl class="class">
 <dt id="rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters">
 <em class="property">class </em><code class="descclassname">rl_coach.agents.policy_gradients_agent.</code><code class="descname">PolicyGradientAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/policy_gradients_agent.html#PolicyGradientAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters" title="Permalink to this definition">¶</a></dt>
-<dd><table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
-<li><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
+<dd><dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
 The rescaler type to use for the policy gradient loss. For policy gradients, we calculate log probability of
 the action and then multiply it by the policy gradient rescaler. The most basic rescaler is the discounter
-return, but there are other rescalers that are intended for reducing the variance of the updates.</li>
-<li><strong>apply_gradients_every_x_episodes</strong> – (int)
+return, but there are other rescalers that are intended for reducing the variance of the updates.</p></li>
+<li><p><strong>apply_gradients_every_x_episodes</strong> – (int)
 The number of episodes between applying the accumulated gradients to the network. After every
 num_steps_between_gradient_updates steps, the agent will calculate the gradients for the collected data,
 it will then accumulate it in internal accumulators, and will only apply them to the network once in every
-apply_gradients_every_x_episodes episodes.</li>
-<li><strong>beta_entropy</strong> – (float)
+apply_gradients_every_x_episodes episodes.</p></li>
+<li><p><strong>beta_entropy</strong> – (float)
 A factor which defines the amount of entropy regularization to apply to the network. The entropy of the actions
-will be added to the loss and scaled by the given beta factor.</li>
-<li><strong>num_steps_between_gradient_updates</strong> – (int)
+will be added to the loss and scaled by the given beta factor.</p></li>
+<li><p><strong>num_steps_between_gradient_updates</strong> – (int)
 The number of steps between calculating gradients for the collected data. In the A3C paper, this parameter is
 called t_max. Since this algorithm is on-policy, only the steps collected between each two gradient calculations
-are used in the batch.</li>
+are used in the batch.</p></li>
 </ul>
-</td>
-</tr>
-</tbody>
-</table>
+</dd>
+</dl>
 </dd></dl>

 </div>
@@ -292,7 +291,7 @@ are used in the batch.</li>
        <a href="ppo.html" class="btn btn-neutral float-right" title="Proximal Policy Optimization" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="../value_optimization/pal.html" class="btn btn-neutral" title="Persistent Advantage Learning" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="../value_optimization/pal.html" class="btn btn-neutral float-left" title="Persistent Advantage Learning" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  
@@ -301,7 +300,7 @@ are used in the batch.</li>

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -318,27 +317,16 @@ are used in the batch.</li>
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>
--- a/docs/components/agents/policy_optimization/ppo.html
+++ b/docs/components/agents/policy_optimization/ppo.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>Proximal Policy Optimization &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
    <link rel="prev" title="Policy Gradient" href="pg.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -234,66 +237,62 @@ When testing, just take the mean values predicted by the network.</p>
 <div class="section" id="training-the-network">
 <h3>Training the network<a class="headerlink" href="#training-the-network" title="Permalink to this headline">¶</a></h3>
 <ol class="arabic simple">
-<li>Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).</li>
-<li>Calculate the advantages for each transition, using the <em>Generalized Advantage Estimation</em> method (Schulman ‘2015).</li>
-<li>Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
+<li><p>Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).</p></li>
+<li><p>Calculate the advantages for each transition, using the <em>Generalized Advantage Estimation</em> method (Schulman ‘2015).</p></li>
+<li><p>Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
 the L-BFGS optimizer runs on the entire dataset at once, without batching.
 It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
 the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
-discounted returns of each state in each episode.</li>
-<li>Run several training iterations of the policy network. This is done by using the previously calculated advantages as
+discounted returns of each state in each episode.</p></li>
+<li><p>Run several training iterations of the policy network. This is done by using the previously calculated advantages as
 targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used <em>before</em>
-starting to run the current set of training iterations) using a regularization term.</li>
-<li>After training is done, the last sampled KL divergence value will be compared with the <em>target KL divergence</em> value,
+starting to run the current set of training iterations) using a regularization term.</p></li>
+<li><p>After training is done, the last sampled KL divergence value will be compared with the <em>target KL divergence</em> value,
 in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
-increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.</li>
+increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.</p></li>
 </ol>
 <dl class="class">
 <dt id="rl_coach.agents.ppo_agent.PPOAlgorithmParameters">
 <em class="property">class </em><code class="descclassname">rl_coach.agents.ppo_agent.</code><code class="descname">PPOAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/ppo_agent.html#PPOAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.ppo_agent.PPOAlgorithmParameters" title="Permalink to this definition">¶</a></dt>
-<dd><table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
-<li><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
+<dd><dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>policy_gradient_rescaler</strong> – (PolicyGradientRescaler)
 This represents how the critic will be used to update the actor. The critic value function is typically used
 to rescale the gradients calculated by the actor. There are several ways for doing this, such as using the
-advantage of the action, or the generalized advantage estimation (GAE) value.</li>
-<li><strong>gae_lambda</strong> – (float)
+advantage of the action, or the generalized advantage estimation (GAE) value.</p></li>
+<li><p><strong>gae_lambda</strong> – (float)
 The <span class="math notranslate nohighlight">\(\lambda\)</span> value is used within the GAE function in order to weight different bootstrap length
 estimations. Typical values are in the range 0.9-1, and define an exponential decay over the different
-n-step estimations.</li>
-<li><strong>target_kl_divergence</strong> – (float)
+n-step estimations.</p></li>
+<li><p><strong>target_kl_divergence</strong> – (float)
 The target kl divergence between the current policy distribution and the new policy. PPO uses a heuristic to
-bring the KL divergence to this value, by adding a penalty if the kl divergence is higher.</li>
-<li><strong>initial_kl_coefficient</strong> – (float)
+bring the KL divergence to this value, by adding a penalty if the kl divergence is higher.</p></li>
+<li><p><strong>initial_kl_coefficient</strong> – (float)
 The initial weight that will be given to the KL divergence between the current and the new policy in the
-regularization factor.</li>
-<li><strong>high_kl_penalty_coefficient</strong> – (float)
-The penalty that will be given for KL divergence values which are highes than what was defined as the target.</li>
-<li><strong>clip_likelihood_ratio_using_epsilon</strong> – (float)
+regularization factor.</p></li>
+<li><p><strong>high_kl_penalty_coefficient</strong> – (float)
+The penalty that will be given for KL divergence values which are highes than what was defined as the target.</p></li>
+<li><p><strong>clip_likelihood_ratio_using_epsilon</strong> – (float)
 If not None, the likelihood ratio between the current and new policy in the PPO loss function will be
 clipped to the range [1-clip_likelihood_ratio_using_epsilon, 1+clip_likelihood_ratio_using_epsilon].
 This is typically used in the Clipped PPO version of PPO, and should be set to None in regular PPO
-implementations.</li>
-<li><strong>value_targets_mix_fraction</strong> – (float)
+implementations.</p></li>
+<li><p><strong>value_targets_mix_fraction</strong> – (float)
 The targets for the value network are an exponential weighted moving average which uses this mix fraction to
 define how much of the new targets will be taken into account when calculating the loss.
-This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</li>
-<li><strong>estimate_state_value_using_gae</strong> – (bool)
-If set to True, the state value will be estimated using the GAE technique.</li>
-<li><strong>use_kl_regularization</strong> – (bool)
+This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.</p></li>
+<li><p><strong>estimate_state_value_using_gae</strong> – (bool)
+If set to True, the state value will be estimated using the GAE technique.</p></li>
+<li><p><strong>use_kl_regularization</strong> – (bool)
 If set to True, the loss function will be regularized using the KL diveregence between the current and new
-policy, to bound the change of the policy during the network update.</li>
-<li><strong>beta_entropy</strong> – (float)
+policy, to bound the change of the policy during the network update.</p></li>
+<li><p><strong>beta_entropy</strong> – (float)
 An entropy regulaization term can be added to the loss function in order to control exploration. This term
-is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</li>
+is weighted using the <span class="math notranslate nohighlight">\(eta\)</span> value defined by beta_entropy.</p></li>
 </ul>
-</td>
-</tr>
-</tbody>
-</table>
+</dd>
+</dl>
 </dd></dl>

 </div>
@@ -311,7 +310,7 @@ is weighted using the <span class="math notranslate nohighlight">\(eta\)</span>
        <a href="../value_optimization/rainbow.html" class="btn btn-neutral float-right" title="Rainbow" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="pg.html" class="btn btn-neutral" title="Policy Gradient" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="pg.html" class="btn btn-neutral float-left" title="Policy Gradient" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  
@@ -320,7 +319,7 @@ is weighted using the <span class="math notranslate nohighlight">\(eta\)</span>

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -337,27 +336,16 @@ is weighted using the <span class="math notranslate nohighlight">\(eta\)</span>
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>
--- a/docs/components/agents/policy_optimization/sac.html
+++ b/docs/components/agents/policy_optimization/sac.html
@@ -8,7 +8,7 @@
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
-  <title>Soft Actor-Critic &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
+  <title>Soft Actor-Critic &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
  

  
@@ -17,13 +17,21 @@
  

  
+  <script type="text/javascript" src="../../../_static/js/modernizr.min.js"></script>
+  
+    
+      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
+        <script type="text/javascript" src="../../../_static/jquery.js"></script>
+        <script type="text/javascript" src="../../../_static/underscore.js"></script>
+        <script type="text/javascript" src="../../../_static/doctools.js"></script>
+        <script type="text/javascript" src="../../../_static/language_data.js"></script>
+        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    
+    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

-  
-  
    

  
-
  <link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
    <link rel="prev" title="Deep Deterministic Policy Gradient" href="ddpg.html" />
    <link href="../../../_static/css/custom.css" rel="stylesheet" type="text/css">

-
-  
-  <script src="../../../_static/js/modernizr.min.js"></script>
-
 </head>

 <body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
-
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
-        <div class="wy-side-nav-search">
+        <div class="wy-side-nav-search" >
          

          
@@ -235,19 +238,19 @@ by picking the mean value or sample from a gaussian distribution like in trainin
 <h3>Training the network<a class="headerlink" href="#training-the-network" title="Permalink to this headline">¶</a></h3>
 <p>Start by sampling a batch <span class="math notranslate nohighlight">\(B\)</span> of transitions from the experience replay.</p>
 <ul>
-<li><p class="first">To train the <strong>Q network</strong>, use the following targets:</p>
+<li><p>To train the <strong>Q network</strong>, use the following targets:</p>
 <div class="math notranslate nohighlight">
 \[y_t^Q=r(s_t,a_t)+\gamma \cdot V(s_{t+1})\]</div>
 <p>The state value used in the above target is acquired by running the target state value network.</p>
 </li>
-<li><p class="first">To train the <strong>State Value network</strong>, use the following targets:</p>
+<li><p>To train the <strong>State Value network</strong>, use the following targets:</p>
 <div class="math notranslate nohighlight">
 \[y_t^V = \min_{i=1,2}Q_i(s_t,\tilde{a}) - log\pi (\tilde{a} \vert s),\,\,\,\, \tilde{a} \sim \pi(\cdot \vert s_t)\]</div>
 <p>The state value network is trained using a sample-based approximation of the connection between and state value and state
 action values, The actions used for constructing the target are <strong>not</strong> sampled from the replay buffer, but rather sampled
 from the current policy.</p>
 </li>
-<li><p class="first">To train the <strong>actor network</strong>, use the following equation:</p>
+<li><p>To train the <strong>actor network</strong>, use the following equation:</p>
 <div class="math notranslate nohighlight">
 \[\nabla_{\theta} J \approx \nabla_{\theta} \frac{1}{\vert B \vert} \sum_{s_t\in B} \left( Q \left(s_t, \tilde{a}_\theta(s_t)\right) - log\pi_{\theta}(\tilde{a}_{\theta}(s_t)\vert s_t) \right),\,\,\,\, \tilde{a} \sim \pi(\cdot \vert s_t)\]</div>
 </li>
@@ -256,24 +259,20 @@ from the current policy.</p>
 <dl class="class">
 <dt id="rl_coach.agents.soft_actor_critic_agent.SoftActorCriticAlgorithmParameters">
 <em class="property">class </em><code class="descclassname">rl_coach.agents.soft_actor_critic_agent.</code><code class="descname">SoftActorCriticAlgorithmParameters</code><a class="reference internal" href="../../../_modules/rl_coach/agents/soft_actor_critic_agent.html#SoftActorCriticAlgorithmParameters"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.agents.soft_actor_critic_agent.SoftActorCriticAlgorithmParameters" title="Permalink to this definition">¶</a></dt>
-<dd><table class="docutils field-list" frame="void" rules="none">
-<col class="field-name" />
-<col class="field-body" />
-<tbody valign="top">
-<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
-<li><strong>num_steps_between_copying_online_weights_to_target</strong> – (StepMethod)
-The number of steps between copying the online network weights to the target network weights.</li>
-<li><strong>rate_for_copying_weights_to_target</strong> – (float)
+<dd><dl class="field-list simple">
+<dt class="field-odd">Parameters</dt>
+<dd class="field-odd"><ul class="simple">
+<li><p><strong>num_steps_between_copying_online_weights_to_target</strong> – (StepMethod)
+The number of steps between copying the online network weights to the target network weights.</p></li>
+<li><p><strong>rate_for_copying_weights_to_target</strong> – (float)
 When copying the online network weights to the target network weights, a soft update will be used, which
-weight the new online network weights by rate_for_copying_weights_to_target. (Tau as defined in the paper)</li>
-<li><strong>use_deterministic_for_evaluation</strong> – (bool)
+weight the new online network weights by rate_for_copying_weights_to_target. (Tau as defined in the paper)</p></li>
+<li><p><strong>use_deterministic_for_evaluation</strong> – (bool)
 If True, during the evaluation phase, action are chosen deterministically according to the policy mean
-and not sampled from the policy distribution.</li>
+and not sampled from the policy distribution.</p></li>
 </ul>
-</td>
-</tr>
-</tbody>
-</table>
+</dd>
+</dl>
 </dd></dl>

 </div>
@@ -291,7 +290,7 @@ and not sampled from the policy distribution.</li>
        <a href="../other/dfp.html" class="btn btn-neutral float-right" title="Direct Future Prediction" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="ddpg.html" class="btn btn-neutral" title="Deep Deterministic Policy Gradient" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="ddpg.html" class="btn btn-neutral float-left" title="Deep Deterministic Policy Gradient" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  
@@ -300,7 +299,7 @@ and not sampled from the policy distribution.</li>

  <div role="contentinfo">
    <p>
-        &copy; Copyright 2018, Intel AI Lab
+        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
@@ -317,27 +316,16 @@ and not sampled from the policy distribution.</li>
  


-  
-
-    
-    
-      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
-        <script type="text/javascript" src="../../../_static/jquery.js"></script>
-        <script type="text/javascript" src="../../../_static/underscore.js"></script>
-        <script type="text/javascript" src="../../../_static/doctools.js"></script>
-        <script type="text/javascript" src="../../../_static/language_data.js"></script>
-        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
-    
-
-  
-
-  <script type="text/javascript" src="../../../_static/js/theme.js"></script>
-
  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
-  </script> 
+  </script>
+
+  
+  
+    
+   

 </body>
 </html>