mirror of
https://github.com/gryf/coach.git
synced 2025-12-18 11:40:18 +01:00
TD3 (#338)
This commit is contained in:
@@ -205,7 +205,7 @@ predefined policy. This is one of the most important aspects of reinforcement le
|
||||
tuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended with
|
||||
custom policies. Note that not all exploration policies are expected to work for both discrete and continuous action
|
||||
spaces.</p>
|
||||
<table class="docutils align-center">
|
||||
<table class="docutils align-default">
|
||||
<colgroup>
|
||||
<col style="width: 35%" />
|
||||
<col style="width: 37%" />
|
||||
@@ -268,7 +268,7 @@ spaces.</p>
|
||||
<h2>ExplorationPolicy<a class="headerlink" href="#explorationpolicy" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.exploration_policy.</code><code class="descname">ExplorationPolicy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.exploration_policy.</code><code class="sig-name descname">ExplorationPolicy</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>An exploration policy takes the predicted actions or action values from the agent, and selects the action to
|
||||
actually apply to the environment using some predefined algorithm.</p>
|
||||
<dl class="field-list simple">
|
||||
@@ -278,7 +278,7 @@ actually apply to the environment using some predefined algorithm.</p>
|
||||
</dl>
|
||||
<dl class="method">
|
||||
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.change_phase">
|
||||
<code class="descname">change_phase</code><span class="sig-paren">(</span><em>phase</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.change_phase"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.change_phase" title="Permalink to this definition">¶</a></dt>
|
||||
<code class="sig-name descname">change_phase</code><span class="sig-paren">(</span><em class="sig-param">phase</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.change_phase"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.change_phase" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Change between running phases of the algorithm
|
||||
:param phase: Either Heatup or Train
|
||||
:return: none</p>
|
||||
@@ -286,16 +286,19 @@ actually apply to the environment using some predefined algorithm.</p>
|
||||
|
||||
<dl class="method">
|
||||
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.get_action">
|
||||
<code class="descname">get_action</code><span class="sig-paren">(</span><em>action_values: List[Union[int, float, numpy.ndarray, List]]</em><span class="sig-paren">)</span> → Union[int, float, numpy.ndarray, List]<a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.get_action"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.get_action" title="Permalink to this definition">¶</a></dt>
|
||||
<code class="sig-name descname">get_action</code><span class="sig-paren">(</span><em class="sig-param">action_values: List[Union[int, float, numpy.ndarray, List]]</em><span class="sig-paren">)</span> → Union[int, float, numpy.ndarray, List]<a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.get_action"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.get_action" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Given a list of values corresponding to each action,
|
||||
choose one actions according to the exploration policy
|
||||
:param action_values: A list of action values
|
||||
:return: The chosen action</p>
|
||||
:return: The chosen action,</p>
|
||||
<blockquote>
|
||||
<div><p>The probability of the action (if available, otherwise 1 for absolute certainty in the action)</p>
|
||||
</div></blockquote>
|
||||
</dd></dl>
|
||||
|
||||
<dl class="method">
|
||||
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.requires_action_values">
|
||||
<code class="descname">requires_action_values</code><span class="sig-paren">(</span><span class="sig-paren">)</span> → bool<a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.requires_action_values"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.requires_action_values" title="Permalink to this definition">¶</a></dt>
|
||||
<code class="sig-name descname">requires_action_values</code><span class="sig-paren">(</span><span class="sig-paren">)</span> → bool<a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.requires_action_values"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.requires_action_values" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Allows exploration policies to define if they require the action values for the current step.
|
||||
This can save up a lot of computation. For example in e-greedy, if the random value generated is smaller
|
||||
than epsilon, the action is completely random, and the action values don’t need to be calculated
|
||||
@@ -304,7 +307,7 @@ than epsilon, the action is completely random, and the action values don’t nee
|
||||
|
||||
<dl class="method">
|
||||
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.reset">
|
||||
<code class="descname">reset</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.reset"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.reset" title="Permalink to this definition">¶</a></dt>
|
||||
<code class="sig-name descname">reset</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.reset"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.reset" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Used for resetting the exploration policy parameters when needed
|
||||
:return: None</p>
|
||||
</dd></dl>
|
||||
@@ -316,7 +319,7 @@ than epsilon, the action is completely random, and the action values don’t nee
|
||||
<h2>AdditiveNoise<a class="headerlink" href="#additivenoise" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.additive_noise.AdditiveNoise">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.additive_noise.</code><code class="descname">AdditiveNoise</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>noise_percentage_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_noise_percentage: float</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/additive_noise.html#AdditiveNoise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.additive_noise.AdditiveNoise" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.additive_noise.</code><code class="sig-name descname">AdditiveNoise</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">noise_schedule: rl_coach.schedules.Schedule</em>, <em class="sig-param">evaluation_noise: float</em>, <em class="sig-param">noise_as_percentage_from_action_space: bool = True</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/additive_noise.html#AdditiveNoise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.additive_noise.AdditiveNoise" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>AdditiveNoise is an exploration policy intended for continuous action spaces. It takes the action from the agent
|
||||
and adds a Gaussian distributed noise to it. The amount of noise added to the action follows the noise amount that
|
||||
can be given in two different ways:
|
||||
@@ -327,9 +330,10 @@ be the mean of the action, and 2nd is assumed to be its standard deviation.</p>
|
||||
<dt class="field-odd">Parameters</dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
|
||||
<li><p><strong>noise_percentage_schedule</strong> – the schedule for the noise variance percentage relative to the absolute range
|
||||
of the action space</p></li>
|
||||
<li><p><strong>evaluation_noise_percentage</strong> – the noise variance percentage that will be used during evaluation phases</p></li>
|
||||
<li><p><strong>noise_schedule</strong> – the schedule for the noise</p></li>
|
||||
<li><p><strong>evaluation_noise</strong> – the noise variance that will be used during evaluation phases</p></li>
|
||||
<li><p><strong>noise_as_percentage_from_action_space</strong> – a bool deciding whether the noise is absolute or as a percentage
|
||||
from the action space</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
</dl>
|
||||
@@ -340,7 +344,7 @@ of the action space</p></li>
|
||||
<h2>Boltzmann<a class="headerlink" href="#boltzmann" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.boltzmann.Boltzmann">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.boltzmann.</code><code class="descname">Boltzmann</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>temperature_schedule: rl_coach.schedules.Schedule</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/boltzmann.html#Boltzmann"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.boltzmann.Boltzmann" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.boltzmann.</code><code class="sig-name descname">Boltzmann</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">temperature_schedule: rl_coach.schedules.Schedule</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/boltzmann.html#Boltzmann"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.boltzmann.Boltzmann" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>The Boltzmann exploration policy is intended for discrete action spaces. It assumes that each of the possible
|
||||
actions has some value assigned to it (such as the Q value), and uses a softmax function to convert these values
|
||||
into a distribution over the actions. It then samples the action for playing out of the calculated distribution.
|
||||
@@ -360,7 +364,7 @@ An additional temperature schedule can be given by the user, and will control th
|
||||
<h2>Bootstrapped<a class="headerlink" href="#bootstrapped" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.bootstrapped.Bootstrapped">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.bootstrapped.</code><code class="descname">Bootstrapped</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>epsilon_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_epsilon: float</em>, <em>architecture_num_q_heads: int</em>, <em>continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object></em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/bootstrapped.html#Bootstrapped"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.bootstrapped.Bootstrapped" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.bootstrapped.</code><code class="sig-name descname">Bootstrapped</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">epsilon_schedule: rl_coach.schedules.Schedule</em>, <em class="sig-param">evaluation_epsilon: float</em>, <em class="sig-param">architecture_num_q_heads: int</em>, <em class="sig-param">continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object></em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/bootstrapped.html#Bootstrapped"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.bootstrapped.Bootstrapped" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Bootstrapped exploration policy is currently only used for discrete action spaces along with the
|
||||
Bootstrapped DQN agent. It assumes that there is an ensemble of network heads, where each one predicts the
|
||||
values for all the possible actions. For each episode, a single head is selected to lead the agent, according
|
||||
@@ -390,7 +394,7 @@ if the e-greedy is used for a continuous policy</p></li>
|
||||
<h2>Categorical<a class="headerlink" href="#categorical" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.categorical.Categorical">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.categorical.</code><code class="descname">Categorical</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/categorical.html#Categorical"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.categorical.Categorical" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.categorical.</code><code class="sig-name descname">Categorical</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/categorical.html#Categorical"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.categorical.Categorical" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Categorical exploration policy is intended for discrete action spaces. It expects the action values to
|
||||
represent a probability distribution over the action, from which a single action will be sampled.
|
||||
In evaluation, the action that has the highest probability will be selected. This is particularly useful for
|
||||
@@ -407,7 +411,7 @@ actor-critic schemes, where the actors output is a probability distribution over
|
||||
<h2>ContinuousEntropy<a class="headerlink" href="#continuousentropy" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.continuous_entropy.ContinuousEntropy">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.continuous_entropy.</code><code class="descname">ContinuousEntropy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>noise_percentage_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_noise_percentage: float</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/continuous_entropy.html#ContinuousEntropy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.continuous_entropy.ContinuousEntropy" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.continuous_entropy.</code><code class="sig-name descname">ContinuousEntropy</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">noise_schedule: rl_coach.schedules.Schedule</em>, <em class="sig-param">evaluation_noise: float</em>, <em class="sig-param">noise_as_percentage_from_action_space: bool = True</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/continuous_entropy.html#ContinuousEntropy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.continuous_entropy.ContinuousEntropy" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>Continuous entropy is an exploration policy that is actually implemented as part of the network.
|
||||
The exploration policy class is only a placeholder for choosing this policy. The exploration policy is
|
||||
implemented by adding a regularization factor to the network loss, which regularizes the entropy of the action.
|
||||
@@ -422,9 +426,10 @@ There are only a few heads that actually are relevant and implement the entropy
|
||||
<dt class="field-odd">Parameters</dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
|
||||
<li><p><strong>noise_percentage_schedule</strong> – the schedule for the noise variance percentage relative to the absolute range
|
||||
of the action space</p></li>
|
||||
<li><p><strong>evaluation_noise_percentage</strong> – the noise variance percentage that will be used during evaluation phases</p></li>
|
||||
<li><p><strong>noise_schedule</strong> – the schedule for the noise</p></li>
|
||||
<li><p><strong>evaluation_noise</strong> – the noise variance that will be used during evaluation phases</p></li>
|
||||
<li><p><strong>noise_as_percentage_from_action_space</strong> – a bool deciding whether the noise is absolute or as a percentage
|
||||
from the action space</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
</dl>
|
||||
@@ -435,7 +440,7 @@ of the action space</p></li>
|
||||
<h2>EGreedy<a class="headerlink" href="#egreedy" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.e_greedy.EGreedy">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.e_greedy.</code><code class="descname">EGreedy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>epsilon_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_epsilon: float</em>, <em>continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object></em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/e_greedy.html#EGreedy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.e_greedy.EGreedy" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.e_greedy.</code><code class="sig-name descname">EGreedy</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">epsilon_schedule: rl_coach.schedules.Schedule</em>, <em class="sig-param">evaluation_epsilon: float</em>, <em class="sig-param">continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object></em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/e_greedy.html#EGreedy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.e_greedy.EGreedy" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>e-greedy is an exploration policy that is intended for both discrete and continuous action spaces.</p>
|
||||
<p>For discrete action spaces, it assumes that each action is assigned a value, and it selects the action with the
|
||||
highest value with probability 1 - epsilon. Otherwise, it selects a action sampled uniformly out of all the
|
||||
@@ -463,7 +468,7 @@ if the e-greedy is used for a continuous policy</p></li>
|
||||
<h2>Greedy<a class="headerlink" href="#greedy" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.greedy.Greedy">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.greedy.</code><code class="descname">Greedy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/greedy.html#Greedy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.greedy.Greedy" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.greedy.</code><code class="sig-name descname">Greedy</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/greedy.html#Greedy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.greedy.Greedy" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>The Greedy exploration policy is intended for both discrete and continuous action spaces.
|
||||
For discrete action spaces, it always selects the action with the maximum value, as given by the agent.
|
||||
For continuous action spaces, it always return the exact action, as it was given by the agent.</p>
|
||||
@@ -479,7 +484,7 @@ For continuous action spaces, it always return the exact action, as it was given
|
||||
<h2>OUProcess<a class="headerlink" href="#ouprocess" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.ou_process.OUProcess">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.ou_process.</code><code class="descname">OUProcess</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>mu: float = 0</em>, <em>theta: float = 0.15</em>, <em>sigma: float = 0.2</em>, <em>dt: float = 0.01</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/ou_process.html#OUProcess"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.ou_process.OUProcess" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.ou_process.</code><code class="sig-name descname">OUProcess</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">mu: float = 0</em>, <em class="sig-param">theta: float = 0.15</em>, <em class="sig-param">sigma: float = 0.2</em>, <em class="sig-param">dt: float = 0.01</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/ou_process.html#OUProcess"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.ou_process.OUProcess" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>OUProcess exploration policy is intended for continuous action spaces, and selects the action according to
|
||||
an Ornstein-Uhlenbeck process. The Ornstein-Uhlenbeck process implements the action as a Gaussian process, where
|
||||
the samples are correlated between consequent time steps.</p>
|
||||
@@ -495,7 +500,7 @@ the samples are correlated between consequent time steps.</p>
|
||||
<h2>ParameterNoise<a class="headerlink" href="#parameternoise" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.parameter_noise.ParameterNoise">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.parameter_noise.</code><code class="descname">ParameterNoise</code><span class="sig-paren">(</span><em>network_params: Dict[str, rl_coach.base_parameters.NetworkParameters], action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/parameter_noise.html#ParameterNoise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.parameter_noise.ParameterNoise" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.parameter_noise.</code><code class="sig-name descname">ParameterNoise</code><span class="sig-paren">(</span><em class="sig-param">network_params: Dict[str, rl_coach.base_parameters.NetworkParameters], action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/parameter_noise.html#ParameterNoise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.parameter_noise.ParameterNoise" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>The ParameterNoise exploration policy is intended for both discrete and continuous action spaces.
|
||||
It applies the exploration policy by replacing all the dense network layers with noisy layers.
|
||||
The noisy layers have both weight means and weight standard deviations, and for each forward pass of the network
|
||||
@@ -514,7 +519,7 @@ values.</p>
|
||||
<h2>TruncatedNormal<a class="headerlink" href="#truncatednormal" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.truncated_normal.TruncatedNormal">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.truncated_normal.</code><code class="descname">TruncatedNormal</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>noise_percentage_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_noise_percentage: float</em>, <em>clip_low: float</em>, <em>clip_high: float</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/truncated_normal.html#TruncatedNormal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.truncated_normal.TruncatedNormal" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.truncated_normal.</code><code class="sig-name descname">TruncatedNormal</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">noise_schedule: rl_coach.schedules.Schedule</em>, <em class="sig-param">evaluation_noise: float</em>, <em class="sig-param">clip_low: float</em>, <em class="sig-param">clip_high: float</em>, <em class="sig-param">noise_as_percentage_from_action_space: bool = True</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/truncated_normal.html#TruncatedNormal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.truncated_normal.TruncatedNormal" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>The TruncatedNormal exploration policy is intended for continuous action spaces. It samples the action from a
|
||||
normal distribution, where the mean action is given by the agent, and the standard deviation can be given in t
|
||||
wo different ways:
|
||||
@@ -527,9 +532,10 @@ is within the bounds.</p>
|
||||
<dt class="field-odd">Parameters</dt>
|
||||
<dd class="field-odd"><ul class="simple">
|
||||
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
|
||||
<li><p><strong>noise_percentage_schedule</strong> – the schedule for the noise variance percentage relative to the absolute range
|
||||
of the action space</p></li>
|
||||
<li><p><strong>evaluation_noise_percentage</strong> – the noise variance percentage that will be used during evaluation phases</p></li>
|
||||
<li><p><strong>noise_schedule</strong> – the schedule for the noise variance</p></li>
|
||||
<li><p><strong>evaluation_noise</strong> – the noise variance that will be used during evaluation phases</p></li>
|
||||
<li><p><strong>noise_as_percentage_from_action_space</strong> – whether to consider the noise as a percentage of the action space
|
||||
or absolute value</p></li>
|
||||
</ul>
|
||||
</dd>
|
||||
</dl>
|
||||
@@ -540,7 +546,7 @@ of the action space</p></li>
|
||||
<h2>UCB<a class="headerlink" href="#ucb" title="Permalink to this headline">¶</a></h2>
|
||||
<dl class="class">
|
||||
<dt id="rl_coach.exploration_policies.ucb.UCB">
|
||||
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.ucb.</code><code class="descname">UCB</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>epsilon_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_epsilon: float</em>, <em>architecture_num_q_heads: int</em>, <em>lamb: int</em>, <em>continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object></em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/ucb.html#UCB"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.ucb.UCB" title="Permalink to this definition">¶</a></dt>
|
||||
<em class="property">class </em><code class="sig-prename descclassname">rl_coach.exploration_policies.ucb.</code><code class="sig-name descname">UCB</code><span class="sig-paren">(</span><em class="sig-param">action_space: rl_coach.spaces.ActionSpace</em>, <em class="sig-param">epsilon_schedule: rl_coach.schedules.Schedule</em>, <em class="sig-param">evaluation_epsilon: float</em>, <em class="sig-param">architecture_num_q_heads: int</em>, <em class="sig-param">lamb: int</em>, <em class="sig-param">continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object></em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/ucb.html#UCB"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.ucb.UCB" title="Permalink to this definition">¶</a></dt>
|
||||
<dd><p>UCB exploration policy is following the upper confidence bound heuristic to sample actions in discrete action spaces.
|
||||
It assumes that there are multiple network heads that are predicting action values, and that the standard deviation
|
||||
between the heads predictions represents the uncertainty of the agent in each of the actions.
|
||||
|
||||
Reference in New Issue
Block a user