coach/docs/components/exploration_policies/index.html



<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
  <meta charset="utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0">

  <title>Exploration Policies &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>


  <script type="text/javascript" src="../../_static/js/modernizr.min.js"></script>


      <script type="text/javascript" id="documentation_options" data-url_root="../../" src="../../_static/documentation_options.js"></script>
        <script type="text/javascript" src="../../_static/jquery.js"></script>
        <script type="text/javascript" src="../../_static/underscore.js"></script>
        <script type="text/javascript" src="../../_static/doctools.js"></script>
        <script type="text/javascript" src="../../_static/language_data.js"></script>
        <script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>

    <script type="text/javascript" src="../../_static/js/theme.js"></script>


  <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/css/custom.css" type="text/css" />
    <link rel="index" title="Index" href="../../genindex.html" />
    <link rel="search" title="Search" href="../../search.html" />
    <link rel="next" title="Filters" href="../filters/index.html" />
    <link rel="prev" title="Environments" href="../environments/index.html" />
    <link href="../../_static/css/custom.css" rel="stylesheet" type="text/css">

</head>

<body class="wy-body-for-nav">


  <div class="wy-grid-for-nav">

    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search" >


            <a href="../../index.html" class="icon icon-home"> Reinforcement Learning Coach


            <img src="../../_static/dark_logo.png" class="logo" alt="Logo"/>

          </a>


<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>


        </div>

        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">


              <p class="caption"><span class="caption-text">Intro</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../usage.html">Usage</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../dist_usage.html">Usage - Distributed Coach</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../features/index.html">Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../selecting_an_algorithm.html">Selecting an Algorithm</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../dashboard.html">Coach Dashboard</a></li>
</ul>
<p class="caption"><span class="caption-text">Design</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../design/control_flow.html">Control Flow</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../design/network.html">Network Design</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../design/horizontal_scaling.html">Distributed Coach - Horizontal Scale-Out</a></li>
</ul>
<p class="caption"><span class="caption-text">Contributing</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../contributing/add_agent.html">Adding a New Agent</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../contributing/add_env.html">Adding a New Environment</a></li>
</ul>
<p class="caption"><span class="caption-text">Components</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../agents/index.html">Agents</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architectures/index.html">Architectures</a></li>
<li class="toctree-l1"><a class="reference internal" href="../data_stores/index.html">Data Stores</a></li>
<li class="toctree-l1"><a class="reference internal" href="../environments/index.html">Environments</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Exploration Policies</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#explorationpolicy">ExplorationPolicy</a></li>
<li class="toctree-l2"><a class="reference internal" href="#additivenoise">AdditiveNoise</a></li>
<li class="toctree-l2"><a class="reference internal" href="#boltzmann">Boltzmann</a></li>
<li class="toctree-l2"><a class="reference internal" href="#bootstrapped">Bootstrapped</a></li>
<li class="toctree-l2"><a class="reference internal" href="#categorical">Categorical</a></li>
<li class="toctree-l2"><a class="reference internal" href="#continuousentropy">ContinuousEntropy</a></li>
<li class="toctree-l2"><a class="reference internal" href="#egreedy">EGreedy</a></li>
<li class="toctree-l2"><a class="reference internal" href="#greedy">Greedy</a></li>
<li class="toctree-l2"><a class="reference internal" href="#ouprocess">OUProcess</a></li>
<li class="toctree-l2"><a class="reference internal" href="#parameternoise">ParameterNoise</a></li>
<li class="toctree-l2"><a class="reference internal" href="#truncatednormal">TruncatedNormal</a></li>
<li class="toctree-l2"><a class="reference internal" href="#ucb">UCB</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../filters/index.html">Filters</a></li>
<li class="toctree-l1"><a class="reference internal" href="../memories/index.html">Memories</a></li>
<li class="toctree-l1"><a class="reference internal" href="../memory_backends/index.html">Memory Backends</a></li>
<li class="toctree-l1"><a class="reference internal" href="../orchestrators/index.html">Orchestrators</a></li>
<li class="toctree-l1"><a class="reference internal" href="../core_types.html">Core Types</a></li>
<li class="toctree-l1"><a class="reference internal" href="../spaces.html">Spaces</a></li>
<li class="toctree-l1"><a class="reference internal" href="../additional_parameters.html">Additional Parameters</a></li>
</ul>


        </div>
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">


      <nav class="wy-nav-top" aria-label="top navigation">

          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../../index.html">Reinforcement Learning Coach</a>

      </nav>


      <div class="wy-nav-content">

        <div class="rst-content">


<div role="navigation" aria-label="breadcrumbs navigation">

  <ul class="wy-breadcrumbs">

      <li><a href="../../index.html">Docs</a> &raquo;</li>

      <li>Exploration Policies</li>


      <li class="wy-breadcrumbs-aside">


            <a href="../../_sources/components/exploration_policies/index.rst.txt" rel="nofollow"> View page source</a>


      </li>

  </ul>


  <hr/>
</div>
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">

  <div class="section" id="exploration-policies">
<h1>Exploration Policies<a class="headerlink" href="#exploration-policies" title="Permalink to this headline">¶</a></h1>
<p>Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to a
predefined policy. This is one of the most important aspects of reinforcement learning agents, and can require some
tuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended with
custom policies. Note that not all exploration policies are expected to work for both discrete and continuous action
spaces.</p>
<table class="docutils align-center">
<colgroup>
<col style="width: 35%" />
<col style="width: 37%" />
<col style="width: 29%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Exploration Policy</p></th>
<th class="head"><p>Discrete Action Space</p></th>
<th class="head"><p>Box Action Space</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>AdditiveNoise</p></td>
<td><p><span class="red">X</span></p></td>
<td><p><span class="green">V</span></p></td>
</tr>
<tr class="row-odd"><td><p>Boltzmann</p></td>
<td><p><span class="green">V</span></p></td>
<td><p><span class="red">X</span></p></td>
</tr>
<tr class="row-even"><td><p>Bootstrapped</p></td>
<td><p><span class="green">V</span></p></td>
<td><p><span class="red">X</span></p></td>
</tr>
<tr class="row-odd"><td><p>Categorical</p></td>
<td><p><span class="green">V</span></p></td>
<td><p><span class="red">X</span></p></td>
</tr>
<tr class="row-even"><td><p>ContinuousEntropy</p></td>
<td><p><span class="red">X</span></p></td>
<td><p><span class="green">V</span></p></td>
</tr>
<tr class="row-odd"><td><p>EGreedy</p></td>
<td><p><span class="green">V</span></p></td>
<td><p><span class="green">V</span></p></td>
</tr>
<tr class="row-even"><td><p>Greedy</p></td>
<td><p><span class="green">V</span></p></td>
<td><p><span class="green">V</span></p></td>
</tr>
<tr class="row-odd"><td><p>OUProcess</p></td>
<td><p><span class="red">X</span></p></td>
<td><p><span class="green">V</span></p></td>
</tr>
<tr class="row-even"><td><p>ParameterNoise</p></td>
<td><p><span class="green">V</span></p></td>
<td><p><span class="green">V</span></p></td>
</tr>
<tr class="row-odd"><td><p>TruncatedNormal</p></td>
<td><p><span class="red">X</span></p></td>
<td><p><span class="green">V</span></p></td>
</tr>
<tr class="row-even"><td><p>UCB</p></td>
<td><p><span class="green">V</span></p></td>
<td><p><span class="red">X</span></p></td>
</tr>
</tbody>
</table>
<div class="section" id="explorationpolicy">
<h2>ExplorationPolicy<a class="headerlink" href="#explorationpolicy" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.exploration_policy.</code><code class="descname">ExplorationPolicy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy" title="Permalink to this definition">¶</a></dt>
<dd><p>An exploration policy takes the predicted actions or action values from the agent, and selects the action to
actually apply to the environment using some predefined algorithm.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>action_space</strong> – the action space used by the environment</p>
</dd>
</dl>
<dl class="method">
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.change_phase">
<code class="descname">change_phase</code><span class="sig-paren">(</span><em>phase</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.change_phase"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.change_phase" title="Permalink to this definition">¶</a></dt>
<dd><p>Change between running phases of the algorithm
:param phase: Either Heatup or Train
:return: none</p>
</dd></dl>

<dl class="method">
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.get_action">
<code class="descname">get_action</code><span class="sig-paren">(</span><em>action_values: List[Union[int, float, numpy.ndarray, List]]</em><span class="sig-paren">)</span> &#x2192; Union[int, float, numpy.ndarray, List]<a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.get_action"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.get_action" title="Permalink to this definition">¶</a></dt>
<dd><p>Given a list of values corresponding to each action,
choose one actions according to the exploration policy
:param action_values: A list of action values
:return: The chosen action</p>
</dd></dl>

<dl class="method">
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.requires_action_values">
<code class="descname">requires_action_values</code><span class="sig-paren">(</span><span class="sig-paren">)</span> &#x2192; bool<a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.requires_action_values"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.requires_action_values" title="Permalink to this definition">¶</a></dt>
<dd><p>Allows exploration policies to define if they require the action values for the current step.
This can save up a lot of computation. For example in e-greedy, if the random value generated is smaller
than epsilon, the action is completely random, and the action values don’t need to be calculated
:return: True if the action values are required. False otherwise</p>
</dd></dl>

<dl class="method">
<dt id="rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.reset">
<code class="descname">reset</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/exploration_policy.html#ExplorationPolicy.reset"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.exploration_policy.ExplorationPolicy.reset" title="Permalink to this definition">¶</a></dt>
<dd><p>Used for resetting the exploration policy parameters when needed
:return: None</p>
</dd></dl>

</dd></dl>

</div>
<div class="section" id="additivenoise">
<h2>AdditiveNoise<a class="headerlink" href="#additivenoise" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.additive_noise.AdditiveNoise">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.additive_noise.</code><code class="descname">AdditiveNoise</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>noise_percentage_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_noise_percentage: float</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/additive_noise.html#AdditiveNoise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.additive_noise.AdditiveNoise" title="Permalink to this definition">¶</a></dt>
<dd><p>AdditiveNoise is an exploration policy intended for continuous action spaces. It takes the action from the agent
and adds a Gaussian distributed noise to it. The amount of noise added to the action follows the noise amount that
can be given in two different ways:
1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size
2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed to
be the mean of the action, and 2nd is assumed to be its standard deviation.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
<li><p><strong>noise_percentage_schedule</strong> – the schedule for the noise variance percentage relative to the absolute range
of the action space</p></li>
<li><p><strong>evaluation_noise_percentage</strong> – the noise variance percentage that will be used during evaluation phases</p></li>
</ul>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="boltzmann">
<h2>Boltzmann<a class="headerlink" href="#boltzmann" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.boltzmann.Boltzmann">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.boltzmann.</code><code class="descname">Boltzmann</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>temperature_schedule: rl_coach.schedules.Schedule</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/boltzmann.html#Boltzmann"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.boltzmann.Boltzmann" title="Permalink to this definition">¶</a></dt>
<dd><p>The Boltzmann exploration policy is intended for discrete action spaces. It assumes that each of the possible
actions has some value assigned to it (such as the Q value), and uses a softmax function to convert these values
into a distribution over the actions. It then samples the action for playing out of the calculated distribution.
An additional temperature schedule can be given by the user, and will control the steepness of the softmax function.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
<li><p><strong>temperature_schedule</strong> – the schedule for the temperature parameter of the softmax</p></li>
</ul>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="bootstrapped">
<h2>Bootstrapped<a class="headerlink" href="#bootstrapped" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.bootstrapped.Bootstrapped">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.bootstrapped.</code><code class="descname">Bootstrapped</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>epsilon_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_epsilon: float</em>, <em>architecture_num_q_heads: int</em>, <em>continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = &lt;rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object&gt;</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/bootstrapped.html#Bootstrapped"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.bootstrapped.Bootstrapped" title="Permalink to this definition">¶</a></dt>
<dd><p>Bootstrapped exploration policy is currently only used for discrete action spaces along with the
Bootstrapped DQN agent. It assumes that there is an ensemble of network heads, where each one predicts the
values for all the possible actions. For each episode, a single head is selected to lead the agent, according
to its value predictions. In evaluation, the action is selected using a majority vote over all the heads
predictions.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This exploration policy will only work for Discrete action spaces with Bootstrapped DQN style agents,
since it requires the agent to have a network with multiple heads.</p>
</div>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
<li><p><strong>epsilon_schedule</strong> – a schedule for the epsilon values</p></li>
<li><p><strong>evaluation_epsilon</strong> – the epsilon value to use for evaluation phases</p></li>
<li><p><strong>continuous_exploration_policy_parameters</strong> – the parameters of the continuous exploration policy to use
if the e-greedy is used for a continuous policy</p></li>
<li><p><strong>architecture_num_q_heads</strong> – the number of q heads to select from</p></li>
</ul>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="categorical">
<h2>Categorical<a class="headerlink" href="#categorical" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.categorical.Categorical">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.categorical.</code><code class="descname">Categorical</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/categorical.html#Categorical"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.categorical.Categorical" title="Permalink to this definition">¶</a></dt>
<dd><p>Categorical exploration policy is intended for discrete action spaces. It expects the action values to
represent a probability distribution over the action, from which a single action will be sampled.
In evaluation, the action that has the highest probability will be selected. This is particularly useful for
actor-critic schemes, where the actors output is a probability distribution over the actions.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>action_space</strong> – the action space used by the environment</p>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="continuousentropy">
<h2>ContinuousEntropy<a class="headerlink" href="#continuousentropy" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.continuous_entropy.ContinuousEntropy">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.continuous_entropy.</code><code class="descname">ContinuousEntropy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>noise_percentage_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_noise_percentage: float</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/continuous_entropy.html#ContinuousEntropy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.continuous_entropy.ContinuousEntropy" title="Permalink to this definition">¶</a></dt>
<dd><p>Continuous entropy is an exploration policy that is actually implemented as part of the network.
The exploration policy class is only a placeholder for choosing this policy. The exploration policy is
implemented by adding a regularization factor to the network loss, which regularizes the entropy of the action.
This exploration policy is only intended for continuous action spaces, and assumes that the entire calculation
is implemented as part of the head.</p>
<div class="admonition warning">
<p class="admonition-title">Warning</p>
<p>This exploration policy expects the agent or the network to implement the exploration functionality.
There are only a few heads that actually are relevant and implement the entropy regularization factor.</p>
</div>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
<li><p><strong>noise_percentage_schedule</strong> – the schedule for the noise variance percentage relative to the absolute range
of the action space</p></li>
<li><p><strong>evaluation_noise_percentage</strong> – the noise variance percentage that will be used during evaluation phases</p></li>
</ul>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="egreedy">
<h2>EGreedy<a class="headerlink" href="#egreedy" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.e_greedy.EGreedy">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.e_greedy.</code><code class="descname">EGreedy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>epsilon_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_epsilon: float</em>, <em>continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = &lt;rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object&gt;</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/e_greedy.html#EGreedy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.e_greedy.EGreedy" title="Permalink to this definition">¶</a></dt>
<dd><p>e-greedy is an exploration policy that is intended for both discrete and continuous action spaces.</p>
<p>For discrete action spaces, it assumes that each action is assigned a value, and it selects the action with the
highest value with probability 1 - epsilon. Otherwise, it selects a action sampled uniformly out of all the
possible actions. The epsilon value is given by the user and can be given as a schedule.
In evaluation, a different epsilon value can be specified.</p>
<p>For continuous action spaces, it assumes that the mean action is given by the agent. With probability epsilon,
it samples a random action out of the action space bounds. Otherwise, it selects the action according to a
given continuous exploration policy, which is set to AdditiveNoise by default. In evaluation, the action is
always selected according to the given continuous exploration policy (where its phase is set to evaluation as well).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
<li><p><strong>epsilon_schedule</strong> – a schedule for the epsilon values</p></li>
<li><p><strong>evaluation_epsilon</strong> – the epsilon value to use for evaluation phases</p></li>
<li><p><strong>continuous_exploration_policy_parameters</strong> – the parameters of the continuous exploration policy to use
if the e-greedy is used for a continuous policy</p></li>
</ul>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="greedy">
<h2>Greedy<a class="headerlink" href="#greedy" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.greedy.Greedy">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.greedy.</code><code class="descname">Greedy</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/greedy.html#Greedy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.greedy.Greedy" title="Permalink to this definition">¶</a></dt>
<dd><p>The Greedy exploration policy is intended for both discrete and continuous action spaces.
For discrete action spaces, it always selects the action with the maximum value, as given by the agent.
For continuous action spaces, it always return the exact action, as it was given by the agent.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>action_space</strong> – the action space used by the environment</p>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="ouprocess">
<h2>OUProcess<a class="headerlink" href="#ouprocess" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.ou_process.OUProcess">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.ou_process.</code><code class="descname">OUProcess</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>mu: float = 0</em>, <em>theta: float = 0.15</em>, <em>sigma: float = 0.2</em>, <em>dt: float = 0.01</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/ou_process.html#OUProcess"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.ou_process.OUProcess" title="Permalink to this definition">¶</a></dt>
<dd><p>OUProcess exploration policy is intended for continuous action spaces, and selects the action according to
an Ornstein-Uhlenbeck process. The Ornstein-Uhlenbeck process implements the action as a Gaussian process, where
the samples are correlated between consequent time steps.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>action_space</strong> – the action space used by the environment</p>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="parameternoise">
<h2>ParameterNoise<a class="headerlink" href="#parameternoise" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.parameter_noise.ParameterNoise">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.parameter_noise.</code><code class="descname">ParameterNoise</code><span class="sig-paren">(</span><em>network_params: Dict[str, rl_coach.base_parameters.NetworkParameters], action_space: rl_coach.spaces.ActionSpace</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/parameter_noise.html#ParameterNoise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.parameter_noise.ParameterNoise" title="Permalink to this definition">¶</a></dt>
<dd><p>The ParameterNoise exploration policy is intended for both discrete and continuous action spaces.
It applies the exploration policy by replacing all the dense network layers with noisy layers.
The noisy layers have both weight means and weight standard deviations, and for each forward pass of the network
the weights are sampled from a normal distribution that follows the learned weights mean and standard deviation
values.</p>
<p>Warning: currently supported only by DQN variants</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>action_space</strong> – the action space used by the environment</p>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="truncatednormal">
<h2>TruncatedNormal<a class="headerlink" href="#truncatednormal" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.truncated_normal.TruncatedNormal">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.truncated_normal.</code><code class="descname">TruncatedNormal</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>noise_percentage_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_noise_percentage: float</em>, <em>clip_low: float</em>, <em>clip_high: float</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/truncated_normal.html#TruncatedNormal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.truncated_normal.TruncatedNormal" title="Permalink to this definition">¶</a></dt>
<dd><p>The TruncatedNormal exploration policy is intended for continuous action spaces. It samples the action from a
normal distribution, where the mean action is given by the agent, and the standard deviation can be given in t
wo different ways:
1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size
2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed to
be the mean of the action, and 2nd is assumed to be its standard deviation.
When the sampled action is outside of the action bounds given by the user, it is sampled again and again, until it
is within the bounds.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
<li><p><strong>noise_percentage_schedule</strong> – the schedule for the noise variance percentage relative to the absolute range
of the action space</p></li>
<li><p><strong>evaluation_noise_percentage</strong> – the noise variance percentage that will be used during evaluation phases</p></li>
</ul>
</dd>
</dl>
</dd></dl>

</div>
<div class="section" id="ucb">
<h2>UCB<a class="headerlink" href="#ucb" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="rl_coach.exploration_policies.ucb.UCB">
<em class="property">class </em><code class="descclassname">rl_coach.exploration_policies.ucb.</code><code class="descname">UCB</code><span class="sig-paren">(</span><em>action_space: rl_coach.spaces.ActionSpace</em>, <em>epsilon_schedule: rl_coach.schedules.Schedule</em>, <em>evaluation_epsilon: float</em>, <em>architecture_num_q_heads: int</em>, <em>lamb: int</em>, <em>continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = &lt;rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object&gt;</em><span class="sig-paren">)</span><a class="reference internal" href="../../_modules/rl_coach/exploration_policies/ucb.html#UCB"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#rl_coach.exploration_policies.ucb.UCB" title="Permalink to this definition">¶</a></dt>
<dd><p>UCB exploration policy is following the upper confidence bound heuristic to sample actions in discrete action spaces.
It assumes that there are multiple network heads that are predicting action values, and that the standard deviation
between the heads predictions represents the uncertainty of the agent in each of the actions.
It then updates the action value estimates to by mean(actions)+lambda*stdev(actions), where lambda is
given by the user. This exploration policy aims to take advantage of the uncertainty of the agent in its predictions,
and select the action according to the tradeoff between how uncertain the agent is, and how large it predicts
the outcome from those actions to be.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>action_space</strong> – the action space used by the environment</p></li>
<li><p><strong>epsilon_schedule</strong> – a schedule for the epsilon values</p></li>
<li><p><strong>evaluation_epsilon</strong> – the epsilon value to use for evaluation phases</p></li>
<li><p><strong>architecture_num_q_heads</strong> – the number of q heads to select from</p></li>
<li><p><strong>lamb</strong> – lambda coefficient for taking the standard deviation into account</p></li>
<li><p><strong>continuous_exploration_policy_parameters</strong> – the parameters of the continuous exploration policy to use
if the e-greedy is used for a continuous policy</p></li>
</ul>
</dd>
</dl>
</dd></dl>

</div>
</div>


           </div>

          </div>
          <footer>

    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">

        <a href="../filters/index.html" class="btn btn-neutral float-right" title="Filters" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>


        <a href="../environments/index.html" class="btn btn-neutral float-left" title="Environments" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>

    </div>


  <hr/>

  <div role="contentinfo">
    <p>
        &copy; Copyright 2018-2019, Intel AI Lab

    </p>
  </div>
  Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.

</footer>

        </div>
      </div>

    </section>

  </div>


  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script>


</body>
</html>