1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00

update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
This commit is contained in:
Itai Caspi
2018-11-15 15:00:13 +02:00
committed by Gal Novik
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions

View File

@@ -0,0 +1,325 @@
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Control Flow &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/custom.css" type="text/css" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Network Design" href="network.html" />
<link rel="prev" title="Coach Dashboard" href="../dashboard.html" />
<link href="../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href="../index.html" class="icon icon-home"> Reinforcement Learning Coach
<img src="../_static/dark_logo.png" class="logo" alt="Logo"/>
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<p class="caption"><span class="caption-text">Intro</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../usage.html">Usage</a></li>
<li class="toctree-l1"><a class="reference internal" href="../features/index.html">Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="../selecting_an_algorithm.html">Selecting an Algorithm</a></li>
<li class="toctree-l1"><a class="reference internal" href="../dashboard.html">Coach Dashboard</a></li>
</ul>
<p class="caption"><span class="caption-text">Design</span></p>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">Control Flow</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#graph-manager">Graph Manager</a></li>
<li class="toctree-l2"><a class="reference internal" href="#level-manager">Level Manager</a></li>
<li class="toctree-l2"><a class="reference internal" href="#agent">Agent</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="network.html">Network Design</a></li>
</ul>
<p class="caption"><span class="caption-text">Contributing</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/add_agent.html">Adding a New Agent</a></li>
<li class="toctree-l1"><a class="reference internal" href="../contributing/add_env.html">Adding a New Environment</a></li>
</ul>
<p class="caption"><span class="caption-text">Components</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../components/agents/index.html">Agents</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/architectures/index.html">Architectures</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/environments/index.html">Environments</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/exploration_policies/index.html">Exploration Policies</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/filters/index.html">Filters</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/memories/index.html">Memories</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/core_types.html">Core Types</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/spaces.html">Spaces</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/additional_parameters.html">Additional Parameters</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li>Control Flow</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/design/control_flow.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="control-flow">
<h1>Control Flow<a class="headerlink" href="#control-flow" title="Permalink to this headline"></a></h1>
<p>Coach is built in a modular way, encouraging modules reuse and reducing the amount of boilerplate code needed
for developing new algorithms or integrating a new challenge as an environment.
On the other hand, it can be overwhelming for new users to ramp up on the code.
To help with that, heres a short overview of the control flow.</p>
<div class="section" id="graph-manager">
<h2>Graph Manager<a class="headerlink" href="#graph-manager" title="Permalink to this headline"></a></h2>
<p>The main entry point for Coach is <code class="code docutils literal notranslate"><span class="pre">coach.py</span></code>.
The main functionality of this script is to parse the command line arguments and invoke all the sub-processes needed
for the given experiment.
<code class="code docutils literal notranslate"><span class="pre">coach.py</span></code> executes the given <strong>preset</strong> file which returns a <code class="code docutils literal notranslate"><span class="pre">GraphManager</span></code> object.</p>
<p>A <strong>preset</strong> is a design pattern that is intended for concentrating the entire definition of an experiment in a single
file. This helps with experiments reproducibility, improves readability and prevents confusion.
The outcome of a preset is a <code class="code docutils literal notranslate"><span class="pre">GraphManager</span></code> which will usually be instantiated in the final lines of the preset.</p>
<p>A <code class="code docutils literal notranslate"><span class="pre">GraphManager</span></code> is an object that holds all the agents and environments of an experiment, and is mostly responsible
for scheduling their work. Why is it called a <strong>graph</strong> manager? Because agents and environments are structured into
a graph of interactions. For example, in hierarchical reinforcement learning schemes, there will often be a master
policy agent, that will control a sub-policy agent, which will interact with the environment. Other schemes can have
much more complex graphs of control, such as several hierarchy layers, each with multiple agents.
The graph managers main loop is the improve loop.</p>
<a class="reference internal image-reference" href="../_images/improve.png"><img alt="../_images/improve.png" class="align-center" src="../_images/improve.png" style="width: 400px;" /></a>
<p>The improve loop skips between 3 main phases - heatup, training and evaluation:</p>
<ul class="simple">
<li><strong>Heatup</strong> - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phase
takes place only in the beginning of the experiment, and the agents will act completely randomly during this phase.
Importantly, the agents do not train their networks during this phase. DQN for example, uses 50k random steps in order
to initialize the replay buffers.</li>
<li><strong>Training</strong> - the training phase is the main phase of the experiment. This phase can change between agent types,
but essentially consists of repeated cycles of acting, collecting data from the environment, and training the agent
networks. During this phase, the agent will use its exploration policy in training mode, which will add noise to its
actions in order to improve its knowledge about the environment state space.</li>
<li><strong>Evaluation</strong> - the evaluation phase is intended for evaluating the current performance of the agent. The agents
will act greedily in order to exploit the knowledge aggregated so far and the performance over multiple episodes of
evaluation will be averaged in order to reduce the stochasticity effects of all the components.</li>
</ul>
</div>
<div class="section" id="level-manager">
<h2>Level Manager<a class="headerlink" href="#level-manager" title="Permalink to this headline"></a></h2>
<p>In each of the 3 phases described above, the graph manager will invoke all the hierarchy levels in the graph in a
synchronized manner. In Coach, agents do not interact directly with the environment. Instead, they go through a
<em>LevelManager</em>, which is a proxy that manages their interaction. The level manager passes the current state and reward
from the environment to the agent, and the actions from the agent to the environment.</p>
<p>The motivation for having a level manager is to disentangle the code of the environment and the agent, so to allow more
complex interactions. Each level can have multiple agents which interact with the environment. Who gets to choose the
action for each step is controlled by the level manager.
Additionally, each level manager can act as an environment for the hierarchy level above it, such that each hierarchy
level can be seen as an interaction between an agent and an environment, even if the environment is just more agents in
a lower hierarchy level.</p>
</div>
<div class="section" id="agent">
<h2>Agent<a class="headerlink" href="#agent" title="Permalink to this headline"></a></h2>
<p>The base agent class has 3 main function that will be used during those phases - observe, act and train.</p>
<ul class="simple">
<li><strong>Observe</strong> - this function gets the latest response from the environment as input, and updates the internal state
of the agent with the new information. The environment response will
be first passed through the agents <code class="code docutils literal notranslate"><span class="pre">InputFilter</span></code> object, which will process the values in the response, according
to the specific agent definition. The environment response will then be converted into a
<code class="code docutils literal notranslate"><span class="pre">Transition</span></code> which will contain the information from a single step
<span class="math notranslate nohighlight">\((s_{t}, a_{t}, r_{t}, s_{t+1}, \textrm{terminal signal})\)</span>, and store it in the memory.</li>
</ul>
<a class="reference internal image-reference" href="../_images/observe.png"><img alt="../_images/observe.png" class="align-center" src="../_images/observe.png" style="width: 700px;" /></a>
<ul class="simple">
<li><strong>Act</strong> - this function uses the current internal state of the agent in order to select the next action to take on
the environment. This function will call the per-agent custom function <code class="code docutils literal notranslate"><span class="pre">choose_action</span></code> that will use the network
and the exploration policy in order to select an action. The action will be stored, together with any additional
information (like the action value for example) in an <code class="code docutils literal notranslate"><span class="pre">ActionInfo</span></code> object. The ActionInfo object will then be
passed through the agents <code class="code docutils literal notranslate"><span class="pre">OutputFilter</span></code> to allow any processing of the action (like discretization,
or shifting, for example), before passing it to the environment.</li>
</ul>
<a class="reference internal image-reference" href="../_images/act.png"><img alt="../_images/act.png" class="align-center" src="../_images/act.png" style="width: 700px;" /></a>
<ul class="simple">
<li><strong>Train</strong> - this function will sample a batch from the memory and train on it. The batch of transitions will be
first wrapped into a <code class="code docutils literal notranslate"><span class="pre">Batch</span></code> object to allow efficient querying of the batch values. It will then be passed into
the agent specific <code class="code docutils literal notranslate"><span class="pre">learn_from_batch</span></code> function, that will extract network target values from the batch and will
train the networks accordingly. Lastly, if theres a target network defined for the agent, it will sync the target
network weights with the online network.</li>
</ul>
<a class="reference internal image-reference" href="../_images/train.png"><img alt="../_images/train.png" class="align-center" src="../_images/train.png" style="width: 700px;" /></a>
</div>
</div>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="network.html" class="btn btn-neutral float-right" title="Network Design" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="../dashboard.html" class="btn btn-neutral" title="Coach Dashboard" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>

View File

@@ -1,367 +0,0 @@
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Control Flow - Reinforcement Learning Coach</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="../../css/highlight.css">
<link href="../../extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Control Flow";
var mkdocs_page_input_path = "design/control_flow.md";
var mkdocs_page_url = "/design/control_flow/";
</script>
<script src="../../js/jquery-2.1.1.min.js"></script>
<script src="../../js/modernizr-2.8.3.min.js"></script>
<script type="text/javascript" src="../../js/highlight.pack.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Reinforcement Learning Coach</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="../..">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../usage/">Usage</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Design</span>
<ul class="subnav">
<li class="">
<a class="" href="../features/">Features</a>
</li>
<li class=" current">
<a class="current" href="./">Control Flow</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#coach-control-flow">Coach Control Flow</a></li>
<ul>
<li><a class="toctree-l4" href="#graph-manager">Graph Manager</a></li>
<li><a class="toctree-l4" href="#level-manager">Level Manager</a></li>
<li><a class="toctree-l4" href="#agent">Agent</a></li>
</ul>
</ul>
</li>
<li class="">
<a class="" href="../network/">Network</a>
</li>
<li class="">
<a class="" href="../filters/">Filters</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Algorithms</span>
<ul class="subnav">
<li class="">
<a class="" href="../../algorithms/value_optimization/dqn/">DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/double_dqn/">Double DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/dueling_dqn/">Dueling DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/categorical_dqn/">Categorical DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/mmc/">Mixed Monte Carlo</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/pal/">Persistent Advantage Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/nec/">Neural Episodic Control</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/bs_dqn/">Bootstrapped DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/n_step/">N-Step Q Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/naf/">Normalized Advantage Functions</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/pg/">Policy Gradient</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ac/">Actor-Critic</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ddpg/">Deep Determinstic Policy Gradients</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ppo/">Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/cppo/">Clipped Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/other/dfp/">Direct Future Prediction</a>
</li>
<li class="">
<a class="" href="../../algorithms/imitation/bc/">Behavioral Cloning</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<a class="" href="../../dashboard/">Coach Dashboard</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Contributing</span>
<ul class="subnav">
<li class="">
<a class="" href="../../contributing/add_agent/">Adding a New Agent</a>
</li>
<li class="">
<a class="" href="../../contributing/add_env/">Adding a New Environment</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>Design &raquo;</li>
<li>Control Flow</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<!-- language-all: python -->
<h1 id="coach-control-flow">Coach Control Flow</h1>
<p>Coach is built in a modular way, encouraging modules reuse and reducing the amount of boilerplate code needed
for developing new algorithms or integrating a new challenge as an environment.
On the other hand, it can be overwhelming for new users to ramp up on the code.
To help with that, here's a short overview of the control flow.</p>
<h2 id="graph-manager">Graph Manager</h2>
<p>The main entry point for Coach is <strong>coach.py</strong>.
The main functionality of this script is to parse the command line arguments and invoke all the sub-processes needed
for the given experiment.
<strong>coach.py</strong> executes the given <strong>preset</strong> file which returns a <strong>GraphManager</strong> object.</p>
<p>A <strong>preset</strong> is a design pattern that is intended for concentrating the entire definition of an experiment in a single
file. This helps with experiments reproducibility, improves readability and prevents confusion.
The outcome of a preset is a <strong>GraphManager</strong> which will usually be instantiated in the final lines of the preset.</p>
<p>A <strong>GraphManager</strong> is an object that holds all the agents and environments of an experiment, and is mostly responsible
for scheduling their work. Why is it called a <strong>graph</strong> manager? Because agents and environments are structured into
a graph of interactions. For example, in hierarchical reinforcement learning schemes, there will often be a master
policy agent, that will control a sub-policy agent, which will interact with the environment. Other schemes can have
much more complex graphs of control, such as several hierarchy layers, each with multiple agents.
The graph manager's main loop is the improve loop.</p>
<p style="text-align: center;">
<img src="../../img/improve.png" alt="Improve loop" style="width: 400px;"/>
</p>
<p>The improve loop skips between 3 main phases - heatup, training and evaluation:</p>
<ul>
<li>
<p><strong>Heatup</strong> - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phase
takes place only in the beginning of the experiment, and the agents will act completely randomly during this phase.
Importantly, the agents do not train their networks during this phase. DQN for example, uses 50k random steps in order
to initialize the replay buffers.</p>
</li>
<li>
<p><strong>Training</strong> - the training phase is the main phase of the experiment. This phase can change between agent types,
but essentially consists of repeated cycles of acting, collecting data from the environment, and training the agent
networks. During this phase, the agent will use its exploration policy in training mode, which will add noise to its
actions in order to improve its knowledge about the environment state space.</p>
</li>
<li>
<p><strong>Evaluation</strong> - the evaluation phase is intended for evaluating the current performance of the agent. The agents
will act greedily in order to exploit the knowledge aggregated so far and the performance over multiple episodes of
evaluation will be averaged in order to reduce the stochasticity effects of all the components.</p>
</li>
</ul>
<h2 id="level-manager">Level Manager</h2>
<p>In each of the 3 phases described above, the graph manager will invoke all the hierarchy levels in the graph in a
synchronized manner. In Coach, agents do not interact directly with the environment. Instead, they go through a
<em>LevelManager</em>, which is a proxy that manages their interaction. The level manager passes the current state and reward
from the environment to the agent, and the actions from the agent to the environment.</p>
<p>The motivation for having a level manager is to disentangle the code of the environment and the agent, so to allow more
complex interactions. Each level can have multiple agents which interact with the environment. Who gets to choose the
action for each step is controlled by the level manager.
Additionally, each level manager can act as an environment for the hierarchy level above it, such that each hierarchy
level can be seen as an interaction between an agent and an environment, even if the environment is just more agents in
a lower hierarchy level.</p>
<h2 id="agent">Agent</h2>
<p>The base agent class has 3 main function that will be used during those phases - observe, act and train.</p>
<ul>
<li><strong>Observe</strong> - this function gets the latest response from the environment as input, and updates the internal state
of the agent with the new information. The environment response will
be first passed through the agent's <strong>InputFilter</strong> object, which will process the values in the response, according
to the specific agent definition. The environment response will then be converted into a
<strong>Transition</strong> which will contain the information from a single step
(<script type="math/tex"> s_{t}, a_{t}, r_{t}, s_{t+1}, terminal signal </script>), and store it in the memory.</li>
</ul>
<p><img src="../../img/observe.png" alt="Observe" style="width: 700px;"/></p>
<ul>
<li><strong>Act</strong> - this function uses the current internal state of the agent in order to select the next action to take on
the environment. This function will call the per-agent custom function <strong>choose_action</strong> that will use the network
and the exploration policy in order to select an action. The action will be stored, together with any additional
information (like the action value for example) in an <strong>ActionInfo</strong> object. The ActionInfo object will then be
passed through the agent's <strong>OutputFilter</strong> to allow any processing of the action (like discretization,
or shifting, for example), before passing it to the environment.</li>
</ul>
<p><img src="../../img/act.png" alt="Act" style="width: 700px;"/></p>
<ul>
<li><strong>Train</strong> - this function will sample a batch from the memory and train on it. The batch of transitions will be
first wrapped into a <strong>Batch</strong> object to allow efficient querying of the batch values. It will then be passed into
the agent specific <strong>learn_from_batch</strong> function, that will extract network target values from the batch and will
train the networks accordingly. Lastly, if there's a target network defined for the agent, it will sync the target
network weights with the online network.</li>
</ul>
<p><img src="../../img/train.png" alt="Train" style="width: 700px;"/></p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../network/" class="btn btn-neutral float-right" title="Network">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../features/" class="btn btn-neutral" title="Features"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../features/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../network/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js"></script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<script src="../../search/require.js"></script>
<script src="../../search/search.js"></script>
</body>
</html>

View File

@@ -1,328 +0,0 @@
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Features - Reinforcement Learning Coach</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="../../css/highlight.css">
<link href="../../extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Features";
var mkdocs_page_input_path = "design/features.md";
var mkdocs_page_url = "/design/features/";
</script>
<script src="../../js/jquery-2.1.1.min.js"></script>
<script src="../../js/modernizr-2.8.3.min.js"></script>
<script type="text/javascript" src="../../js/highlight.pack.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Reinforcement Learning Coach</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="../..">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../usage/">Usage</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Design</span>
<ul class="subnav">
<li class=" current">
<a class="current" href="./">Features</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#coach-features">Coach Features</a></li>
<ul>
<li><a class="toctree-l4" href="#supported-algorithms">Supported Algorithms</a></li>
<li><a class="toctree-l4" href="#supported-environments">Supported Environments</a></li>
</ul>
</ul>
</li>
<li class="">
<a class="" href="../control_flow/">Control Flow</a>
</li>
<li class="">
<a class="" href="../network/">Network</a>
</li>
<li class="">
<a class="" href="../filters/">Filters</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Algorithms</span>
<ul class="subnav">
<li class="">
<a class="" href="../../algorithms/value_optimization/dqn/">DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/double_dqn/">Double DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/dueling_dqn/">Dueling DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/categorical_dqn/">Categorical DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/mmc/">Mixed Monte Carlo</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/pal/">Persistent Advantage Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/nec/">Neural Episodic Control</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/bs_dqn/">Bootstrapped DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/n_step/">N-Step Q Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/naf/">Normalized Advantage Functions</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/pg/">Policy Gradient</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ac/">Actor-Critic</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ddpg/">Deep Determinstic Policy Gradients</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ppo/">Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/cppo/">Clipped Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/other/dfp/">Direct Future Prediction</a>
</li>
<li class="">
<a class="" href="../../algorithms/imitation/bc/">Behavioral Cloning</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<a class="" href="../../dashboard/">Coach Dashboard</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Contributing</span>
<ul class="subnav">
<li class="">
<a class="" href="../../contributing/add_agent/">Adding a New Agent</a>
</li>
<li class="">
<a class="" href="../../contributing/add_env/">Adding a New Environment</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>Design &raquo;</li>
<li>Features</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="coach-features">Coach Features</h1>
<h2 id="supported-algorithms">Supported Algorithms</h2>
<p>Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into two main classes -
value optimization and policy optimization. A detailed description of those algorithms may be found in the algorithms
section.</p>
<p style="text-align: center;">
<img src="../../img/algorithms.png" alt="Supported Algorithms" style="width: 600px;"/>
</p>
<h2 id="supported-environments">Supported Environments</h2>
<p>Coach supports a large number of environments which can be solved using reinforcement learning:</p>
<ul>
<li>
<p><strong><a href="https://github.com/deepmind/dm_control">DeepMind Control Suite</a></strong> - a set of reinforcement learning environments
powered by the MuJoCo physics engine.</p>
</li>
<li>
<p><strong><a href="https://github.com/deepmind/pysc2">Blizzard Starcraft II</a></strong> - a popular strategy game which was wrapped with a
python interface by DeepMind.</p>
</li>
<li>
<p><strong><a href="http://vizdoom.cs.put.edu.pl/">ViZDoom</a></strong> - a Doom-based AI research platform for reinforcement learning
from raw visual information.</p>
</li>
<li>
<p><strong><a href="https://github.com/carla-simulator/carla">CARLA</a></strong> - an open-source simulator for autonomous driving research.</p>
</li>
<li>
<p><strong><a href="https://gym.openai.com/">OpenAI Gym</a></strong> - a library which consists of a set of environments, from games to robotics.
Additionally, it can be extended using the API defined by the authors.</p>
</li>
</ul>
<p>In Coach, we support all the native environments in Gym, along with several extensions such as:</p>
<ul>
<li>
<p><strong><a href="https://github.com/openai/roboschool">Roboschool</a></strong> - a set of environments powered by the PyBullet engine,
that offer a free alternative to MuJoCo.</p>
</li>
<li>
<p><strong><a href="https://github.com/Breakend/gym-extensions">Gym Extensions</a></strong> - a set of environments that extends Gym for
auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)</p>
</li>
<li>
<p><strong><a href="https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet">PyBullet</a></strong> - a physics engine that
includes a set of robotics environments.</p>
</li>
</ul>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../control_flow/" class="btn btn-neutral float-right" title="Control Flow">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../../usage/" class="btn btn-neutral" title="Usage"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../../usage/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../control_flow/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js"></script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<script src="../../search/require.js"></script>
<script src="../../search/search.js"></script>
</body>
</html>

View File

@@ -1,416 +0,0 @@
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Filters - Reinforcement Learning Coach</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="../../css/highlight.css">
<link href="../../extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Filters";
var mkdocs_page_input_path = "design/filters.md";
var mkdocs_page_url = "/design/filters/";
</script>
<script src="../../js/jquery-2.1.1.min.js"></script>
<script src="../../js/modernizr-2.8.3.min.js"></script>
<script type="text/javascript" src="../../js/highlight.pack.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Reinforcement Learning Coach</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="../..">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../usage/">Usage</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Design</span>
<ul class="subnav">
<li class="">
<a class="" href="../features/">Features</a>
</li>
<li class="">
<a class="" href="../control_flow/">Control Flow</a>
</li>
<li class="">
<a class="" href="../network/">Network</a>
</li>
<li class=" current">
<a class="current" href="./">Filters</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#filters">Filters</a></li>
<ul>
<li><a class="toctree-l4" href="#input-filters">Input Filters</a></li>
<li><a class="toctree-l4" href="#output-filters">Output Filters</a></li>
</ul>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Algorithms</span>
<ul class="subnav">
<li class="">
<a class="" href="../../algorithms/value_optimization/dqn/">DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/double_dqn/">Double DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/dueling_dqn/">Dueling DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/categorical_dqn/">Categorical DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/mmc/">Mixed Monte Carlo</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/pal/">Persistent Advantage Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/nec/">Neural Episodic Control</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/bs_dqn/">Bootstrapped DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/n_step/">N-Step Q Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/naf/">Normalized Advantage Functions</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/pg/">Policy Gradient</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ac/">Actor-Critic</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ddpg/">Deep Determinstic Policy Gradients</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ppo/">Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/cppo/">Clipped Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/other/dfp/">Direct Future Prediction</a>
</li>
<li class="">
<a class="" href="../../algorithms/imitation/bc/">Behavioral Cloning</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<a class="" href="../../dashboard/">Coach Dashboard</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Contributing</span>
<ul class="subnav">
<li class="">
<a class="" href="../../contributing/add_agent/">Adding a New Agent</a>
</li>
<li class="">
<a class="" href="../../contributing/add_env/">Adding a New Environment</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>Design &raquo;</li>
<li>Filters</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="filters">Filters</h1>
<p>Filters are a mechanism in Coach that allows doing pre-processing and post-processing of the internal agent information.
There are two filter categories -</p>
<ul>
<li>
<p><strong>Input filters</strong> - these are filters that process the information passed <strong>into</strong> the agent from the environment.
This information includes the observation and the reward. Input filters therefore allow rescaling observations,
normalizing rewards, stack observations, etc.</p>
</li>
<li>
<p><strong>Output filters</strong> - these are filters that process the information going <strong>out</strong> of the agent into the environment.
This information includes the action the agent chooses to take. Output filters therefore allow conversion of
actions from one space into another. For example, the agent can take <script type="math/tex"> N </script> discrete actions, that will be mapped by
the output filter onto <script type="math/tex"> N </script> continuous actions.</p>
</li>
</ul>
<p>Filters can be stacked on top of each other in order to build complex processing flows of the inputs or outputs.</p>
<p style="text-align: center;">
<img src="../../img/filters.png" alt="Filters mechanism" style="width: 350px;"/>
</p>
<h2 id="input-filters">Input Filters</h2>
<p>The input filters are separated into two categories - <strong>observation filters</strong> and <strong>reward filters</strong>.</p>
<h3 id="observation-filters">Observation Filters</h3>
<ul>
<li>
<p><strong>ObservationClippingFilter</strong> - Clips the observation values to a given range of values. For example, if the
observation consists of measurements in an arbitrary range, and we want to control the minimum and maximum values
of these observations, we can define a range and clip the values of the measurements.</p>
</li>
<li>
<p><strong>ObservationCropFilter</strong> - Crops the size of the observation to a given crop window. For example, in Atari, the
observations are images with a shape of 210x160. Usually, we will want to crop the size of the observation to a
square of 160x160 before rescaling them.</p>
</li>
<li>
<p><strong>ObservationMoveAxisFilter</strong> - Reorders the axes of the observation. This can be useful when the observation is an
image, and we want to move the channel axis to be the last axis instead of the first axis.</p>
</li>
<li>
<p><strong>ObservationNormalizationFilter</strong> - Normalizes the observation values with a running mean and standard deviation of
all the observations seen so far. The normalization is performed element-wise. Additionally, when working with
multiple workers, the statistics used for the normalization operation are accumulated over all the workers.</p>
</li>
<li>
<p><strong>ObservationReductionBySubPartsNameFilter</strong> - Allows keeping only parts of the observation, by specifying their
name. For example, the CARLA environment extracts multiple measurements that can be used by the agent, such as
speed and location. If we want to only use the speed, it can be done using this filter.</p>
</li>
<li>
<p><strong>ObservationRescaleSizeByFactorFilter</strong> - Rescales an image observation by some factor. For example, the image size
can be reduced by a factor of 2.</p>
</li>
<li>
<p><strong>ObservationRescaleToSizeFilter</strong> - Rescales an image observation to a given size. The target size does not
necessarily keep the aspect ratio of the original observation.</p>
</li>
<li>
<p><strong>ObservationRGBToYFilter</strong> - Converts a color image observation specified using the RGB encoding into a grayscale
image observation, by keeping only the luminance (Y) channel of the YUV encoding. This can be useful if the colors
in the original image are not relevant for solving the task at hand.</p>
</li>
<li>
<p><strong>ObservationSqueezeFilter</strong> - Removes redundant axes from the observation, which are axes with a dimension of 1.</p>
</li>
<li>
<p><strong>ObservationStackingFilter</strong> - Stacks several observations on top of each other. For image observation this will
create a 3D blob. The stacking is done in a lazy manner in order to reduce memory consumption. To achieve this,
a LazyStack object is used in order to wrap the observations in the stack. For this reason, the
ObservationStackingFilter <strong>must</strong> be the last filter in the inputs filters stack.</p>
</li>
<li>
<p><strong>ObservationUint8Filter</strong> - Converts a floating point observation into an unsigned int 8 bit observation. This is
mostly useful for reducing memory consumption and is usually used for image observations. The filter will first
spread the observation values over the range 0-255 and then discretize them into integer values.</p>
</li>
</ul>
<h3 id="reward-filters">Reward Filters</h3>
<ul>
<li>
<p><strong>RewardClippingFilter</strong> - Clips the reward values into a given range. For example, in DQN, the Atari rewards are
clipped into the range -1 and 1 in order to control the scale of the returns.</p>
</li>
<li>
<p><strong>RewardNormalizationFilter</strong> - Normalizes the reward values with a running mean and standard deviation of
all the rewards seen so far. When working with multiple workers, the statistics used for the normalization operation
are accumulated over all the workers.</p>
</li>
<li>
<p><strong>RewardRescaleFilter</strong> - Rescales the reward by a given factor. Rescaling the rewards of the environment has been
observed to have a large effect (negative or positive) on the behavior of the learning process.</p>
</li>
</ul>
<h2 id="output-filters">Output Filters</h2>
<p>The output filters only process the actions.</p>
<h3 id="action-filters">Action Filters</h3>
<ul>
<li>
<p><strong>AttentionDiscretization</strong> - Discretizes an <strong>AttentionActionSpace</strong>. The attention action space defines the actions
as choosing sub-boxes in a given box. For example, consider an image of size 100x100, where the action is choosing
a crop window of size 20x20 to attend to in the image. AttentionDiscretization allows discretizing the possible crop
windows to choose into a finite number of options, and map a discrete action space into those crop windows.</p>
</li>
<li>
<p><strong>BoxDiscretization</strong> - Discretizes a continuous action space into a discrete action space, allowing the usage of
agents such as DQN for continuous environments such as MuJoCo. Given the number of bins to discretize into, the
original continuous action space is uniformly separated into the given number of bins, each mapped to a discrete
action index. For example, if the original actions space is between -1 and 1 and 5 bins were selected, the new action
space will consist of 5 actions mapped to -1, -0.5, 0, 0.5 and 1.</p>
</li>
<li>
<p><strong>BoxMasking</strong> - Masks part of the action space to enforce the agent to work in a defined space. For example,
if the original action space is between -1 and 1, then this filter can be used in order to constrain the agent actions
to the range 0 and 1 instead. This essentially masks the range -1 and 0 from the agent.</p>
</li>
<li>
<p><strong>PartialDiscreteActionSpaceMap</strong> - Partial map of two countable action spaces. For example, consider an environment
with a MultiSelect action space (select multiple actions at the same time, such as jump and go right), with 8 actual
MultiSelect actions. If we want the agent to be able to select only 5 of those actions by their index (0-4), we can
map a discrete action space with 5 actions into the 5 selected MultiSelect actions. This will both allow the agent to
use regular discrete actions, and mask 3 of the actions from the agent.</p>
</li>
<li>
<p><strong>FullDiscreteActionSpaceMap</strong> - Full map of two countable action spaces. This works in a similar way to the
PartialDiscreteActionSpaceMap, but maps the entire source action space into the entire target action space, without
masking any actions.</p>
</li>
<li>
<p><strong>LinearBoxToBoxMap</strong> - A linear mapping of two box action spaces. For example, if the action space of the
environment consists of continuous actions between 0 and 1, and we want the agent to choose actions between -1 and 1,
the LinearBoxToBoxMap can be used to map the range -1 and 1 to the range 0 and 1 in a linear way. This means that the
action -1 will be mapped to 0, the action 1 will be mapped to 1, and the rest of the actions will be linearly mapped
between those values.</p>
</li>
</ul>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../../algorithms/value_optimization/dqn/" class="btn btn-neutral float-right" title="DQN">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../network/" class="btn btn-neutral" title="Network"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../network/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../../algorithms/value_optimization/dqn/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js"></script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<script src="../../search/require.js"></script>
<script src="../../search/search.js"></script>
</body>
</html>

View File

@@ -0,0 +1,394 @@
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>&lt;no title&gt; &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/custom.css" type="text/css" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link href="../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href="../index.html" class="icon icon-home"> Reinforcement Learning Coach
<img src="../_static/dark_logo.png" class="logo" alt="Logo"/>
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<p class="caption"><span class="caption-text">Intro</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../usage.html">Usage</a></li>
<li class="toctree-l1"><a class="reference internal" href="../features/index.html">Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="../selecting_an_algorithm.html">Selecting an Algorithm</a></li>
<li class="toctree-l1"><a class="reference internal" href="../dashboard.html">Coach Dashboard</a></li>
</ul>
<p class="caption"><span class="caption-text">Design</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="control_flow.html">Control Flow</a></li>
<li class="toctree-l1"><a class="reference internal" href="network.html">Network Design</a></li>
</ul>
<p class="caption"><span class="caption-text">Contributing</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/add_agent.html">Adding a New Agent</a></li>
<li class="toctree-l1"><a class="reference internal" href="../contributing/add_env.html">Adding a New Environment</a></li>
</ul>
<p class="caption"><span class="caption-text">Components</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../components/agents/index.html">Agents</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/architectures/index.html">Architectures</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/environments/index.html">Environments</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/exploration_policies/index.html">Exploration Policies</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/filters/index.html">Filters</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/memories/index.html">Memories</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/core_types.html">Core Types</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/spaces.html">Spaces</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/additional_parameters.html">Additional Parameters</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li>&lt;no title&gt;</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/design/horizontal_scaling.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<p># Scaling out rollout workers</p>
<p>This document contains some options for how we could implement horizontal scaling of rollout workers in coach, though most details are not specific to coach. A few options are laid out, my current suggestion would be to start with Option 1, and move on to Option 1a or Option 1b as required.</p>
<p>## Off Policy Algorithms</p>
<p>### Option 1 - master polls file system</p>
<ul>
<li><p class="first">one master process samples memories and updates the policy</p>
</li>
<li><p class="first">many worker processes execute rollouts</p>
</li>
<li><p class="first">coordinate using a single shared networked file system: nfs, ceph, dat, s3fs, etc.</p>
</li>
<li><p class="first">policy sync communication method:
- master process occasionally writes policy to shared file system
- worker processes occasionally read policy from shared file system
- prevent workers from reading a policy which has not been completely written to disk using either:</p>
<blockquote>
<div><ul class="simple">
<li>redis lock</li>
<li>write to temporary files and then rename</li>
</ul>
</div></blockquote>
</li>
<li><p class="first">rollout memories:
- sync communication method:</p>
<blockquote>
<div><ul class="simple">
<li>worker processes write rollout memories as they are generated to shared filesystem</li>
<li>master process occasionally reads rollout memories from shared file system</li>
<li>master process must be resilient to corrupted or incompletely written memories</li>
</ul>
</div></blockquote>
<ul class="simple">
<li>sampling method:
- master process keeps all rollouts in memory utilizing existing coach memory classes</li>
</ul>
</li>
<li><p class="first">control flow:
- master:</p>
<blockquote>
<div><ul class="simple">
<li>run training updates interleaved with loading of any newly available rollouts in memory</li>
<li>periodically write policy to disk</li>
</ul>
</div></blockquote>
<ul class="simple">
<li>workers:
- periodically read policy from disk
- evaluate rollouts and write them to disk</li>
</ul>
</li>
<li><p class="first">ops:
- kubernetes yaml, kml, docker compose, etc
- a default shared file system can be provided, while allowing the user to specify something else if desired
- a default method of launching the workers and master (in kubernetes, gce, aws, etc) can be provided</p>
</li>
</ul>
<p>#### Pros</p>
<ul class="simple">
<li>very simple to implement, infrastructure already available in ai-lab-kubernetes</li>
<li>fast enough for proof of concept and iteration of interface design</li>
<li>rollout memories are durable and can be easily reused in later off policy training</li>
<li>if designed properly, there is a clear path towards:
- decreasing latency using in-memory store (option 1a/b)
- increasing rollout memory size using distributed sampling methods (option 1c)</li>
</ul>
<p>#### Cons</p>
<ul class="simple">
<li>file system interface incurs additional latency. rollout memories must be written to disk, and later read from disk, instead of going directly from memory to memory.</li>
<li>will require modifying standard control flow. there will be an impact on algorithms which expect particular training regimens. Specifically, algorithms which are sensitive to the number of update steps between target/online network updates</li>
<li>will not be particularly efficient in strictly on policy algorithms where each rollout must use the most recent policy available</li>
</ul>
<p>### Option 1a - master polls (redis) list</p>
<ul class="simple">
<li>instead of using a file system as in Option 1, redis lists can be used</li>
<li>policy is stored as a single key/value pair (locking no longer necessary)</li>
<li>rollout memory communication:
- workers: redis list push
- master: redis list len, redis list range</li>
<li>note: many databases are interchangeable with redis protocol: google memorystore, aws elasticache, etc.</li>
<li>note: many databases can implement this interface with minimal glue: SQL, any objectstore, etc.</li>
</ul>
<p>#### Pros</p>
<ul class="simple">
<li>lower latency than disk since it is all in memory</li>
<li>clear path toward scaling to large number of workers</li>
<li>no concern about reading partially written rollouts</li>
<li>no synchronization or additional threads necessary, though an additional thread would be helpful for concurrent reads from redis and training</li>
<li>will be slightly more efficient in the case of strictly on policy algorithms</li>
</ul>
<p>#### Cons</p>
<ul class="simple">
<li>more complex to set up, especially if you are concerned about rollout memory durability</li>
</ul>
<p>### Option 1b - master subscribes to (redis) pub sub</p>
<ul class="simple">
<li>instead of using a file system as in Option 1, redis pub sub can be used</li>
<li>policy is stored as a single key/value pair (locking no longer necessary)</li>
<li>rollout memory communication:
- workers: redis publish
- master: redis subscribe</li>
<li>no synchronization necessary, however an additional thread would be necessary?
- it looks like the python client might handle this already, would need further investigation</li>
<li>note: many possible pub sub systems could be used with different characteristics under specific contexts: kafka, google pub/sub, aws kinesis, etc</li>
</ul>
<p>#### Pros</p>
<ul class="simple">
<li>lower latency than disk since it is all in memory</li>
<li>clear path toward scaling to large number of workers</li>
<li>no concern about reading partially written rollouts</li>
<li>will be slightly more efficient in the case of strictly on policy algorithms</li>
</ul>
<p>#### Cons</p>
<ul class="simple">
<li>more complex to set up then shared file system</li>
<li>on its own, does not persist worker rollouts for future off policy training</li>
</ul>
<p>### Option 1c - distributed rollout memory sampling</p>
<ul class="simple">
<li>if rollout memories do not fit in memory of a single machine, a distributed storage and sampling method would be necessary</li>
<li>for example:
- rollout memory store: redis set add
- rollout memory sample: redis set randmember</li>
</ul>
<p>#### Pros</p>
<ul class="simple">
<li>capable of taking advantage of rollout memory larger than the available memory of a single machine</li>
<li>reduce resource constraints on training machine</li>
</ul>
<p>#### Cons</p>
<ul class="simple">
<li>distributed versions of each memory type/sampling method need to be custom built</li>
<li>off-the-shelf implementations may not be available for complex memory types/sampling methods</li>
</ul>
<p>### Option 2 - master listens to workers</p>
<ul class="simple">
<li>rollout memories:
- workers send memories directly to master via: mpi, 0mq, etc
- master policy thread listens for new memories and stores them in shared memory</li>
<li>policy sync communication memory:
- master policy occasionally sends policies directly to workers via: mpi, 0mq, etc
- master and workers must synchronize so that all workers are listening when the master is ready to send a new policy</li>
</ul>
<p>#### Pros</p>
<ul class="simple">
<li>lower latency than option 1 (for a small number of workers)</li>
<li>will potentially be the optimal choice in the case of strictly on policy algorithms with relatively small number of worker nodes (small enough that more complex communication typologies would be necessary: rings, p2p, etc)</li>
</ul>
<p>#### Cons</p>
<ul class="simple">
<li>much less robust and more difficult to debug requiring lots of synchronization</li>
<li>much more difficult to be resiliency worker failure</li>
<li>more custom communication/synchronization code</li>
<li>as the number of workers scale up, a larger and larger fraction of time will be spent waiting and synchronizing</li>
</ul>
<p>### Option 3 - Ray</p>
<p>#### Pros</p>
<ul class="simple">
<li>Ray would allow us to easily convert our current algorithms to distributed versions, with minimal change to our code.</li>
</ul>
<p>#### Cons</p>
<ul class="simple">
<li>performance from naïve/simple use would be very similar to Option 2</li>
<li>nontrivial to replace with a higher performance system if desired. Additional performance will require significant code changes.</li>
</ul>
<p>## On Policy Algorithms</p>
<p>TODO</p>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>

290
docs/design/network.html Normal file
View File

@@ -0,0 +1,290 @@
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Network Design &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/custom.css" type="text/css" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Adding a New Agent" href="../contributing/add_agent.html" />
<link rel="prev" title="Control Flow" href="control_flow.html" />
<link href="../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href="../index.html" class="icon icon-home"> Reinforcement Learning Coach
<img src="../_static/dark_logo.png" class="logo" alt="Logo"/>
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<p class="caption"><span class="caption-text">Intro</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../usage.html">Usage</a></li>
<li class="toctree-l1"><a class="reference internal" href="../features/index.html">Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="../selecting_an_algorithm.html">Selecting an Algorithm</a></li>
<li class="toctree-l1"><a class="reference internal" href="../dashboard.html">Coach Dashboard</a></li>
</ul>
<p class="caption"><span class="caption-text">Design</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="control_flow.html">Control Flow</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Network Design</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#keeping-network-copies-in-sync">Keeping Network Copies in Sync</a></li>
</ul>
</li>
</ul>
<p class="caption"><span class="caption-text">Contributing</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/add_agent.html">Adding a New Agent</a></li>
<li class="toctree-l1"><a class="reference internal" href="../contributing/add_env.html">Adding a New Environment</a></li>
</ul>
<p class="caption"><span class="caption-text">Components</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../components/agents/index.html">Agents</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/architectures/index.html">Architectures</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/environments/index.html">Environments</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/exploration_policies/index.html">Exploration Policies</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/filters/index.html">Filters</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/memories/index.html">Memories</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/core_types.html">Core Types</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/spaces.html">Spaces</a></li>
<li class="toctree-l1"><a class="reference internal" href="../components/additional_parameters.html">Additional Parameters</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li>Network Design</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/design/network.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="network-design">
<h1>Network Design<a class="headerlink" href="#network-design" title="Permalink to this headline"></a></h1>
<p>Each agent has at least one neural network, used as the function approximator, for choosing the actions.
The network is designed in a modular way to allow reusability in different agents.
It is separated into three main parts:</p>
<ul>
<li><p class="first"><strong>Input Embedders</strong> - This is the first stage of the network, meant to convert the input into a feature vector representation.
It is possible to combine several instances of any of the supported embedders, in order to allow varied combinations of inputs.</p>
<blockquote>
<div><p>There are two main types of input embedders:</p>
<ol class="arabic simple">
<li>Image embedder - Convolutional neural network.</li>
<li>Vector embedder - Multi-layer perceptron.</li>
</ol>
</div></blockquote>
</li>
<li><p class="first"><strong>Middlewares</strong> - The middleware gets the output of the input embedder, and processes it into a different representation domain,
before sending it through the output head. The goal of the middleware is to enable processing the combined outputs of
several input embedders, and pass them through some extra processing.
This, for instance, might include an LSTM or just a plain simple FC layer.</p>
</li>
<li><p class="first"><strong>Output Heads</strong> - The output head is used in order to predict the values required from the network.
These might include action-values, state-values or a policy. As with the input embedders,
it is possible to use several output heads in the same network. For example, the <em>Actor Critic</em> agent combines two
heads - a policy head and a state-value head.
In addition, the output heads defines the loss function according to the head type.</p>
<p></p>
</li>
</ul>
<a class="reference internal image-reference" href="../_images/network.png"><img alt="../_images/network.png" class="align-center" src="../_images/network.png" style="width: 400px;" /></a>
<div class="section" id="keeping-network-copies-in-sync">
<h2>Keeping Network Copies in Sync<a class="headerlink" href="#keeping-network-copies-in-sync" title="Permalink to this headline"></a></h2>
<p>Most of the reinforcement learning agents include more than one copy of the neural network.
These copies serve as counterparts of the main network which are updated in different rates,
and are often synchronized either locally or between parallel workers. For easier synchronization of those copies,
a wrapper around these copies exposes a simplified API, which allows hiding these complexities from the agent.
In this wrapper, 3 types of networks can be defined:</p>
<ul class="simple">
<li><strong>online network</strong> - A mandatory network which is the main network the agent will use</li>
<li><strong>global network</strong> - An optional network which is shared between workers in single-node multi-process distributed learning.
It is updated by all the workers directly, and holds the most up-to-date weights.</li>
<li><strong>target network</strong> - An optional network which is local for each worker. It can be used in order to keep a copy of
the weights stable for a long period of time. This is used in different agents, like DQN for example, in order to
have stable targets for the online network while training it.</li>
</ul>
<a class="reference internal image-reference" href="../_images/distributed.png"><img alt="../_images/distributed.png" class="align-center" src="../_images/distributed.png" style="width: 600px;" /></a>
</div>
</div>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../contributing/add_agent.html" class="btn btn-neutral float-right" title="Adding a New Agent" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="control_flow.html" class="btn btn-neutral" title="Control Flow" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/rtfd/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>

View File

@@ -1,310 +0,0 @@
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="../../img/favicon.ico">
<title>Network - Reinforcement Learning Coach</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="../../css/highlight.css">
<link href="../../extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Network";
var mkdocs_page_input_path = "design/network.md";
var mkdocs_page_url = "/design/network/";
</script>
<script src="../../js/jquery-2.1.1.min.js"></script>
<script src="../../js/modernizr-2.8.3.min.js"></script>
<script type="text/javascript" src="../../js/highlight.pack.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../.." class="icon icon-home"> Reinforcement Learning Coach</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="../..">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../usage/">Usage</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Design</span>
<ul class="subnav">
<li class="">
<a class="" href="../features/">Features</a>
</li>
<li class="">
<a class="" href="../control_flow/">Control Flow</a>
</li>
<li class=" current">
<a class="current" href="./">Network</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#network-design">Network Design</a></li>
<ul>
<li><a class="toctree-l4" href="#keeping-network-copies-in-sync">Keeping Network Copies in Sync</a></li>
</ul>
</ul>
</li>
<li class="">
<a class="" href="../filters/">Filters</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Algorithms</span>
<ul class="subnav">
<li class="">
<a class="" href="../../algorithms/value_optimization/dqn/">DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/double_dqn/">Double DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/dueling_dqn/">Dueling DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/categorical_dqn/">Categorical DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/mmc/">Mixed Monte Carlo</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/pal/">Persistent Advantage Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/nec/">Neural Episodic Control</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/bs_dqn/">Bootstrapped DQN</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/n_step/">N-Step Q Learning</a>
</li>
<li class="">
<a class="" href="../../algorithms/value_optimization/naf/">Normalized Advantage Functions</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/pg/">Policy Gradient</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ac/">Actor-Critic</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ddpg/">Deep Determinstic Policy Gradients</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/ppo/">Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/policy_optimization/cppo/">Clipped Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../algorithms/other/dfp/">Direct Future Prediction</a>
</li>
<li class="">
<a class="" href="../../algorithms/imitation/bc/">Behavioral Cloning</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<a class="" href="../../dashboard/">Coach Dashboard</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Contributing</span>
<ul class="subnav">
<li class="">
<a class="" href="../../contributing/add_agent/">Adding a New Agent</a>
</li>
<li class="">
<a class="" href="../../contributing/add_env/">Adding a New Environment</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../..">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../..">Docs</a> &raquo;</li>
<li>Design &raquo;</li>
<li>Network</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="network-design">Network Design</h1>
<p>Each agent has at least one neural network, used as the function approximator, for choosing the actions. The network is designed in a modular way to allow reusability in different agents. It is separated into three main parts:</p>
<ul>
<li>
<p><strong>Input Embedders</strong> - This is the first stage of the network, meant to convert the input into a feature vector representation. It is possible to combine several instances of any of the supported embedders, in order to allow varied combinations of inputs. </p>
<p>There are two main types of input embedders: </p>
<ol>
<li>Image embedder - Convolutional neural network. </li>
<li>Vector embedder - Multi-layer perceptron. </li>
</ol>
</li>
<li>
<p><strong>Middlewares</strong> - The middleware gets the output of the input embedder, and processes it into a different representation domain, before sending it through the output head. The goal of the middleware is to enable processing the combined outputs of several input embedders, and pass them through some extra processing. This, for instance, might include an LSTM or just a plain simple FC layer.</p>
</li>
<li>
<p><strong>Output Heads</strong> - The output head is used in order to predict the values required from the network. These might include action-values, state-values or a policy. As with the input embedders, it is possible to use several output heads in the same network. For example, the <em>Actor Critic</em> agent combines two heads - a policy head and a state-value head.
In addition, the output heads defines the loss function according to the head type.</p>
</li>
</ul>
<p></p>
<p style="text-align: center;">
<img src="../../img/network.png" alt="Network Design" style="width: 400px;"/>
</p>
<h2 id="keeping-network-copies-in-sync">Keeping Network Copies in Sync</h2>
<p>Most of the reinforcement learning agents include more than one copy of the neural network. These copies serve as counterparts of the main network which are updated in different rates, and are often synchronized either locally or between parallel workers. For easier synchronization of those copies, a wrapper around these copies exposes a simplified API, which allows hiding these complexities from the agent. </p>
<p style="text-align: center;">
<img src="../../img/distributed.png" alt="Distributed Training" style="width: 600px;"/>
</p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../filters/" class="btn btn-neutral float-right" title="Filters">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../control_flow/" class="btn btn-neutral" title="Control Flow"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../control_flow/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../filters/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../..';</script>
<script src="../../js/theme.js"></script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<script src="../../search/require.js"></script>
<script src="../../search/search.js"></script>
</body>
</html>