1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 03:30:19 +01:00

Docs changes - fixing blogpost links, removing importing all exploration policies (#139)

* updated docs

* removing imports for all exploration policies in __init__ + setting the right blog-post link

* small cleanups
This commit is contained in:
Gal Leibovich
2018-12-05 23:16:16 +02:00
committed by Scott Leishman
parent 155b78b995
commit f12857a8c7
33 changed files with 191 additions and 160 deletions

View File

@@ -216,7 +216,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.additive_noise:AdditiveNoise&#39;</span>
<div class="viewcode-block" id="AdditiveNoise"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.AdditiveNoise">[docs]</a><span class="k">class</span> <span class="nc">AdditiveNoise</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="AdditiveNoise"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.additive_noise.AdditiveNoise">[docs]</a><span class="k">class</span> <span class="nc">AdditiveNoise</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> AdditiveNoise is an exploration policy intended for continuous action spaces. It takes the action from the agent</span>
<span class="sd"> and adds a Gaussian distributed noise to it. The amount of noise added to the action follows the noise amount that</span>

View File

@@ -215,7 +215,7 @@
<div class="viewcode-block" id="Boltzmann"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.Boltzmann">[docs]</a><span class="k">class</span> <span class="nc">Boltzmann</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="Boltzmann"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.boltzmann.Boltzmann">[docs]</a><span class="k">class</span> <span class="nc">Boltzmann</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> The Boltzmann exploration policy is intended for discrete action spaces. It assumes that each of the possible</span>
<span class="sd"> actions has some value assigned to it (such as the Q value), and uses a softmax function to convert these values</span>

View File

@@ -218,7 +218,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.bootstrapped:Bootstrapped&#39;</span>
<div class="viewcode-block" id="Bootstrapped"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.Bootstrapped">[docs]</a><span class="k">class</span> <span class="nc">Bootstrapped</span><span class="p">(</span><span class="n">EGreedy</span><span class="p">):</span>
<div class="viewcode-block" id="Bootstrapped"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.bootstrapped.Bootstrapped">[docs]</a><span class="k">class</span> <span class="nc">Bootstrapped</span><span class="p">(</span><span class="n">EGreedy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Bootstrapped exploration policy is currently only used for discrete action spaces along with the</span>
<span class="sd"> Bootstrapped DQN agent. It assumes that there is an ensemble of network heads, where each one predicts the</span>

View File

@@ -209,7 +209,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.categorical:Categorical&#39;</span>
<div class="viewcode-block" id="Categorical"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.Categorical">[docs]</a><span class="k">class</span> <span class="nc">Categorical</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="Categorical"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.categorical.Categorical">[docs]</a><span class="k">class</span> <span class="nc">Categorical</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Categorical exploration policy is intended for discrete action spaces. It expects the action values to</span>
<span class="sd"> represent a probability distribution over the action, from which a single action will be sampled.</span>

View File

@@ -203,7 +203,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.continuous_entropy:ContinuousEntropy&#39;</span>
<div class="viewcode-block" id="ContinuousEntropy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ContinuousEntropy">[docs]</a><span class="k">class</span> <span class="nc">ContinuousEntropy</span><span class="p">(</span><span class="n">AdditiveNoise</span><span class="p">):</span>
<div class="viewcode-block" id="ContinuousEntropy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.continuous_entropy.ContinuousEntropy">[docs]</a><span class="k">class</span> <span class="nc">ContinuousEntropy</span><span class="p">(</span><span class="n">AdditiveNoise</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Continuous entropy is an exploration policy that is actually implemented as part of the network.</span>
<span class="sd"> The exploration policy class is only a placeholder for choosing this policy. The exploration policy is</span>

View File

@@ -222,7 +222,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.e_greedy:EGreedy&#39;</span>
<div class="viewcode-block" id="EGreedy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.EGreedy">[docs]</a><span class="k">class</span> <span class="nc">EGreedy</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="EGreedy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.e_greedy.EGreedy">[docs]</a><span class="k">class</span> <span class="nc">EGreedy</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> e-greedy is an exploration policy that is intended for both discrete and continuous action spaces.</span>

View File

@@ -210,7 +210,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.exploration_policy:ExplorationPolicy&#39;</span>
<div class="viewcode-block" id="ExplorationPolicy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ExplorationPolicy">[docs]</a><span class="k">class</span> <span class="nc">ExplorationPolicy</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<div class="viewcode-block" id="ExplorationPolicy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy">[docs]</a><span class="k">class</span> <span class="nc">ExplorationPolicy</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> An exploration policy takes the predicted actions or action values from the agent, and selects the action to</span>
<span class="sd"> actually apply to the environment using some predefined algorithm.</span>
@@ -222,14 +222,14 @@
<span class="bp">self</span><span class="o">.</span><span class="n">phase</span> <span class="o">=</span> <span class="n">RunPhase</span><span class="o">.</span><span class="n">HEATUP</span>
<span class="bp">self</span><span class="o">.</span><span class="n">action_space</span> <span class="o">=</span> <span class="n">action_space</span>
<div class="viewcode-block" id="ExplorationPolicy.reset"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ExplorationPolicy.reset">[docs]</a> <span class="k">def</span> <span class="nf">reset</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<div class="viewcode-block" id="ExplorationPolicy.reset"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.reset">[docs]</a> <span class="k">def</span> <span class="nf">reset</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Used for resetting the exploration policy parameters when needed</span>
<span class="sd"> :return: None</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="k">pass</span></div>
<div class="viewcode-block" id="ExplorationPolicy.get_action"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ExplorationPolicy.get_action">[docs]</a> <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">action_values</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">ActionType</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">ActionType</span><span class="p">:</span>
<div class="viewcode-block" id="ExplorationPolicy.get_action"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.get_action">[docs]</a> <span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">action_values</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">ActionType</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">ActionType</span><span class="p">:</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Given a list of values corresponding to each action, </span>
<span class="sd"> choose one actions according to the exploration policy</span>
@@ -243,7 +243,7 @@
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s2">&quot;The get_action function should be overridden in the inheriting exploration class&quot;</span><span class="p">)</span></div>
<div class="viewcode-block" id="ExplorationPolicy.change_phase"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ExplorationPolicy.change_phase">[docs]</a> <span class="k">def</span> <span class="nf">change_phase</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">phase</span><span class="p">):</span>
<div class="viewcode-block" id="ExplorationPolicy.change_phase"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.change_phase">[docs]</a> <span class="k">def</span> <span class="nf">change_phase</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">phase</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Change between running phases of the algorithm</span>
<span class="sd"> :param phase: Either Heatup or Train</span>
@@ -251,7 +251,7 @@
<span class="sd"> &quot;&quot;&quot;</span>
<span class="bp">self</span><span class="o">.</span><span class="n">phase</span> <span class="o">=</span> <span class="n">phase</span></div>
<div class="viewcode-block" id="ExplorationPolicy.requires_action_values"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ExplorationPolicy.requires_action_values">[docs]</a> <span class="k">def</span> <span class="nf">requires_action_values</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
<div class="viewcode-block" id="ExplorationPolicy.requires_action_values"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.requires_action_values">[docs]</a> <span class="k">def</span> <span class="nf">requires_action_values</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> Allows exploration policies to define if they require the action values for the current step.</span>
<span class="sd"> This can save up a lot of computation. For example in e-greedy, if the random value generated is smaller</span>

View File

@@ -209,7 +209,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.greedy:Greedy&#39;</span>
<div class="viewcode-block" id="Greedy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.Greedy">[docs]</a><span class="k">class</span> <span class="nc">Greedy</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="Greedy"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.greedy.Greedy">[docs]</a><span class="k">class</span> <span class="nc">Greedy</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> The Greedy exploration policy is intended for both discrete and continuous action spaces.</span>
<span class="sd"> For discrete action spaces, it always selects the action with the maximum value, as given by the agent.</span>

View File

@@ -219,7 +219,7 @@
<span class="c1"># Ornstein-Uhlenbeck process</span>
<div class="viewcode-block" id="OUProcess"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.OUProcess">[docs]</a><span class="k">class</span> <span class="nc">OUProcess</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="OUProcess"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ou_process.OUProcess">[docs]</a><span class="k">class</span> <span class="nc">OUProcess</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> OUProcess exploration policy is intended for continuous action spaces, and selects the action according to</span>
<span class="sd"> an Ornstein-Uhlenbeck process. The Ornstein-Uhlenbeck process implements the action as a Gaussian process, where</span>

View File

@@ -210,7 +210,8 @@
<span class="k">class</span> <span class="nc">ParameterNoiseParameters</span><span class="p">(</span><span class="n">ExplorationParameters</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">agent_params</span><span class="p">:</span> <span class="n">AgentParameters</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">agent_params</span><span class="p">,</span> <span class="n">DQNAgentParameters</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">agent_params</span><span class="o">.</span><span class="n">algorithm</span><span class="o">.</span><span class="n">supports_parameter_noise</span><span class="p">:</span>
<span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s2">&quot;Currently only DQN variants are supported for using an exploration type of &quot;</span>
<span class="s2">&quot;ParameterNoise.&quot;</span><span class="p">)</span>
@@ -221,7 +222,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.parameter_noise:ParameterNoise&#39;</span>
<div class="viewcode-block" id="ParameterNoise"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ParameterNoise">[docs]</a><span class="k">class</span> <span class="nc">ParameterNoise</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="ParameterNoise"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.parameter_noise.ParameterNoise">[docs]</a><span class="k">class</span> <span class="nc">ParameterNoise</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> The ParameterNoise exploration policy is intended for both discrete and continuous action spaces.</span>
<span class="sd"> It applies the exploration policy by replacing all the dense network layers with noisy layers.</span>

View File

@@ -218,7 +218,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.truncated_normal:TruncatedNormal&#39;</span>
<div class="viewcode-block" id="TruncatedNormal"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.TruncatedNormal">[docs]</a><span class="k">class</span> <span class="nc">TruncatedNormal</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<div class="viewcode-block" id="TruncatedNormal"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.truncated_normal.TruncatedNormal">[docs]</a><span class="k">class</span> <span class="nc">TruncatedNormal</span><span class="p">(</span><span class="n">ExplorationPolicy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> The TruncatedNormal exploration policy is intended for continuous action spaces. It samples the action from a</span>
<span class="sd"> normal distribution, where the mean action is given by the agent, and the standard deviation can be given in t</span>

View File

@@ -222,7 +222,7 @@
<span class="k">return</span> <span class="s1">&#39;rl_coach.exploration_policies.ucb:UCB&#39;</span>
<div class="viewcode-block" id="UCB"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.UCB">[docs]</a><span class="k">class</span> <span class="nc">UCB</span><span class="p">(</span><span class="n">EGreedy</span><span class="p">):</span>
<div class="viewcode-block" id="UCB"><a class="viewcode-back" href="../../../components/exploration_policies/index.html#rl_coach.exploration_policies.ucb.UCB">[docs]</a><span class="k">class</span> <span class="nc">UCB</span><span class="p">(</span><span class="n">EGreedy</span><span class="p">):</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="sd"> UCB exploration policy is following the upper confidence bound heuristic to sample actions in discrete action spaces.</span>
<span class="sd"> It assumes that there are multiple network heads that are predicting action values, and that the standard deviation</span>