1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00

Enabling Coach Documentation to be run even when environments are not installed (#326)

This commit is contained in:
anabwan
2019-05-27 10:46:07 +03:00
committed by Gal Leibovich
parent 2b7d536da4
commit 342b7184bc
157 changed files with 5167 additions and 7477 deletions

View File

@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Control Flow &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<title>Control Flow &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
@@ -17,13 +17,21 @@
<script type="text/javascript" src="../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
<link rel="prev" title="Coach Dashboard" href="../dashboard.html" />
<link href="../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<div class="wy-side-nav-search" >
@@ -210,17 +213,17 @@ The graph managers main loop is the improve loop.</p>
<a class="reference internal image-reference" href="../_images/improve.png"><img alt="../_images/improve.png" class="align-center" src="../_images/improve.png" style="width: 400px;" /></a>
<p>The improve loop skips between 3 main phases - heatup, training and evaluation:</p>
<ul class="simple">
<li><strong>Heatup</strong> - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phase
<li><p><strong>Heatup</strong> - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phase
takes place only in the beginning of the experiment, and the agents will act completely randomly during this phase.
Importantly, the agents do not train their networks during this phase. DQN for example, uses 50k random steps in order
to initialize the replay buffers.</li>
<li><strong>Training</strong> - the training phase is the main phase of the experiment. This phase can change between agent types,
to initialize the replay buffers.</p></li>
<li><p><strong>Training</strong> - the training phase is the main phase of the experiment. This phase can change between agent types,
but essentially consists of repeated cycles of acting, collecting data from the environment, and training the agent
networks. During this phase, the agent will use its exploration policy in training mode, which will add noise to its
actions in order to improve its knowledge about the environment state space.</li>
<li><strong>Evaluation</strong> - the evaluation phase is intended for evaluating the current performance of the agent. The agents
actions in order to improve its knowledge about the environment state space.</p></li>
<li><p><strong>Evaluation</strong> - the evaluation phase is intended for evaluating the current performance of the agent. The agents
will act greedily in order to exploit the knowledge aggregated so far and the performance over multiple episodes of
evaluation will be averaged in order to reduce the stochasticity effects of all the components.</li>
evaluation will be averaged in order to reduce the stochasticity effects of all the components.</p></li>
</ul>
</div>
<div class="section" id="level-manager">
@@ -240,29 +243,29 @@ a lower hierarchy level.</p>
<h2>Agent<a class="headerlink" href="#agent" title="Permalink to this headline"></a></h2>
<p>The base agent class has 3 main function that will be used during those phases - observe, act and train.</p>
<ul class="simple">
<li><strong>Observe</strong> - this function gets the latest response from the environment as input, and updates the internal state
<li><p><strong>Observe</strong> - this function gets the latest response from the environment as input, and updates the internal state
of the agent with the new information. The environment response will
be first passed through the agents <code class="code docutils literal notranslate"><span class="pre">InputFilter</span></code> object, which will process the values in the response, according
to the specific agent definition. The environment response will then be converted into a
<code class="code docutils literal notranslate"><span class="pre">Transition</span></code> which will contain the information from a single step
<span class="math notranslate nohighlight">\((s_{t}, a_{t}, r_{t}, s_{t+1}, \textrm{terminal signal})\)</span>, and store it in the memory.</li>
<span class="math notranslate nohighlight">\((s_{t}, a_{t}, r_{t}, s_{t+1}, \textrm{terminal signal})\)</span>, and store it in the memory.</p></li>
</ul>
<a class="reference internal image-reference" href="../_images/observe.png"><img alt="../_images/observe.png" class="align-center" src="../_images/observe.png" style="width: 700px;" /></a>
<ul class="simple">
<li><strong>Act</strong> - this function uses the current internal state of the agent in order to select the next action to take on
<li><p><strong>Act</strong> - this function uses the current internal state of the agent in order to select the next action to take on
the environment. This function will call the per-agent custom function <code class="code docutils literal notranslate"><span class="pre">choose_action</span></code> that will use the network
and the exploration policy in order to select an action. The action will be stored, together with any additional
information (like the action value for example) in an <code class="code docutils literal notranslate"><span class="pre">ActionInfo</span></code> object. The ActionInfo object will then be
passed through the agents <code class="code docutils literal notranslate"><span class="pre">OutputFilter</span></code> to allow any processing of the action (like discretization,
or shifting, for example), before passing it to the environment.</li>
or shifting, for example), before passing it to the environment.</p></li>
</ul>
<a class="reference internal image-reference" href="../_images/act.png"><img alt="../_images/act.png" class="align-center" src="../_images/act.png" style="width: 700px;" /></a>
<ul class="simple">
<li><strong>Train</strong> - this function will sample a batch from the memory and train on it. The batch of transitions will be
<li><p><strong>Train</strong> - this function will sample a batch from the memory and train on it. The batch of transitions will be
first wrapped into a <code class="code docutils literal notranslate"><span class="pre">Batch</span></code> object to allow efficient querying of the batch values. It will then be passed into
the agent specific <code class="code docutils literal notranslate"><span class="pre">learn_from_batch</span></code> function, that will extract network target values from the batch and will
train the networks accordingly. Lastly, if theres a target network defined for the agent, it will sync the target
network weights with the online network.</li>
network weights with the online network.</p></li>
</ul>
<a class="reference internal image-reference" href="../_images/train.png"><img alt="../_images/train.png" class="align-center" src="../_images/train.png" style="width: 700px;" /></a>
</div>
@@ -279,7 +282,7 @@ network weights with the online network.</li>
<a href="network.html" class="btn btn-neutral float-right" title="Network Design" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="../dashboard.html" class="btn btn-neutral" title="Coach Dashboard" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
<a href="../dashboard.html" class="btn btn-neutral float-left" title="Coach Dashboard" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
@@ -288,7 +291,7 @@ network weights with the online network.</li>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
&copy; Copyright 2018-2019, Intel AI Lab
</p>
</div>
@@ -305,27 +308,16 @@ network weights with the online network.</li>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>
</body>
</html>

View File

@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Distributed Coach - Horizontal Scale-Out &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<title>Distributed Coach - Horizontal Scale-Out &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
@@ -17,13 +17,21 @@
<script type="text/javascript" src="../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
<link rel="prev" title="Network Design" href="network.html" />
<link href="../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<div class="wy-side-nav-search" >
@@ -190,14 +193,14 @@
three interfaces for horizontal scale-out, which allows for integration with different technologies and flexibility.
These three interfaces are orchestrator, memory backend and data store.</p>
<ul class="simple">
<li><strong>Orchestrator</strong> - The orchestrator interface provides basic interaction points for orchestration, scheduling and
<li><p><strong>Orchestrator</strong> - The orchestrator interface provides basic interaction points for orchestration, scheduling and
resource management of training and rollout workers in the distributed coach mode. The interactions points define
how Coach should deploy, undeploy and monitor the workers spawned by Coach.</li>
<li><strong>Memory Backend</strong> - This interface is used as the backing store or stream for the memory abstraction in
how Coach should deploy, undeploy and monitor the workers spawned by Coach.</p></li>
<li><p><strong>Memory Backend</strong> - This interface is used as the backing store or stream for the memory abstraction in
distributed Coach. The implementation of this module is mainly used for communicating experiences (transitions
and episodes) from the rollout to the training worker.</li>
<li><strong>Data Store</strong> - This interface is used as a backing store for the policy checkpoints. It is mainly used to
synchronizing policy checkpoints from the training to the rollout worker.</li>
and episodes) from the rollout to the training worker.</p></li>
<li><p><strong>Data Store</strong> - This interface is used as a backing store for the policy checkpoints. It is mainly used to
synchronizing policy checkpoints from the training to the rollout worker.</p></li>
</ul>
<a class="reference internal image-reference" href="../_images/horizontal-scale-out.png"><img alt="../_images/horizontal-scale-out.png" class="align-center" src="../_images/horizontal-scale-out.png" style="width: 800px;" /></a>
<div class="section" id="supported-synchronization-types">
@@ -207,12 +210,12 @@ rollout worker. For each algorithm, it is specified by using the <cite>Distribut
<cite>agent_params.algorithm.distributed_coach_synchronization_type</cite> in the preset. In distributed Coach, two types of
synchronization modes are supported: <cite>SYNC</cite> and <cite>ASYNC</cite>.</p>
<ul class="simple">
<li><strong>SYNC</strong> - In this type, the trainer waits for all the experiences to be gathered from distributed rollout workers
<li><p><strong>SYNC</strong> - In this type, the trainer waits for all the experiences to be gathered from distributed rollout workers
before training a new policy and the rollout workers wait for a new policy before gathering experiences. It is suitable
for ON policy algorithms.</li>
<li><strong>ASYNC</strong> - In this type, the trainer doesnt wait for any set of experiences to be gathered from distributed
for ON policy algorithms.</p></li>
<li><p><strong>ASYNC</strong> - In this type, the trainer doesnt wait for any set of experiences to be gathered from distributed
rollout workers and the rollout workers continously gather experiences loading new policies, whenever they become
available. It is suitable for OFF policy algorithms.</li>
available. It is suitable for OFF policy algorithms.</p></li>
</ul>
</div>
</div>
@@ -228,7 +231,7 @@ available. It is suitable for OFF policy algorithms.</li>
<a href="../contributing/add_agent.html" class="btn btn-neutral float-right" title="Adding a New Agent" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="network.html" class="btn btn-neutral" title="Network Design" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
<a href="network.html" class="btn btn-neutral float-left" title="Network Design" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
@@ -237,7 +240,7 @@ available. It is suitable for OFF policy algorithms.</li>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
&copy; Copyright 2018-2019, Intel AI Lab
</p>
</div>
@@ -254,27 +257,16 @@ available. It is suitable for OFF policy algorithms.</li>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>
</body>
</html>

View File

@@ -8,7 +8,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Network Design &mdash; Reinforcement Learning Coach 0.11.0 documentation</title>
<title>Network Design &mdash; Reinforcement Learning Coach 0.12.1 documentation</title>
@@ -17,13 +17,21 @@
<script type="text/javascript" src="../_static/js/modernizr.min.js"></script>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/custom.css" type="text/css" />
@@ -33,21 +41,16 @@
<link rel="prev" title="Control Flow" href="control_flow.html" />
<link href="../_static/css/custom.css" rel="stylesheet" type="text/css">
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<div class="wy-side-nav-search" >
@@ -190,22 +193,21 @@
The network is designed in a modular way to allow reusability in different agents.
It is separated into three main parts:</p>
<ul>
<li><p class="first"><strong>Input Embedders</strong> - This is the first stage of the network, meant to convert the input into a feature vector representation.
<li><p><strong>Input Embedders</strong> - This is the first stage of the network, meant to convert the input into a feature vector representation.
It is possible to combine several instances of any of the supported embedders, in order to allow varied combinations of inputs.</p>
<blockquote>
<div><p>There are two main types of input embedders:</p>
<ol class="arabic simple">
<li>Image embedder - Convolutional neural network.</li>
<li>Vector embedder - Multi-layer perceptron.</li>
<li><p>Image embedder - Convolutional neural network.</p></li>
<li><p>Vector embedder - Multi-layer perceptron.</p></li>
</ol>
</div></blockquote>
</li>
<li><p class="first"><strong>Middlewares</strong> - The middleware gets the output of the input embedder, and processes it into a different representation domain,
<li><p><strong>Middlewares</strong> - The middleware gets the output of the input embedder, and processes it into a different representation domain,
before sending it through the output head. The goal of the middleware is to enable processing the combined outputs of
several input embedders, and pass them through some extra processing.
This, for instance, might include an LSTM or just a plain simple FC layer.</p>
</li>
<li><p class="first"><strong>Output Heads</strong> - The output head is used in order to predict the values required from the network.
This, for instance, might include an LSTM or just a plain simple FC layer.</p></li>
<li><p><strong>Output Heads</strong> - The output head is used in order to predict the values required from the network.
These might include action-values, state-values or a policy. As with the input embedders,
it is possible to use several output heads in the same network. For example, the <em>Actor Critic</em> agent combines two
heads - a policy head and a state-value head.
@@ -222,12 +224,12 @@ and are often synchronized either locally or between parallel workers. For easie
a wrapper around these copies exposes a simplified API, which allows hiding these complexities from the agent.
In this wrapper, 3 types of networks can be defined:</p>
<ul class="simple">
<li><strong>online network</strong> - A mandatory network which is the main network the agent will use</li>
<li><strong>global network</strong> - An optional network which is shared between workers in single-node multi-process distributed learning.
It is updated by all the workers directly, and holds the most up-to-date weights.</li>
<li><strong>target network</strong> - An optional network which is local for each worker. It can be used in order to keep a copy of
<li><p><strong>online network</strong> - A mandatory network which is the main network the agent will use</p></li>
<li><p><strong>global network</strong> - An optional network which is shared between workers in single-node multi-process distributed learning.
It is updated by all the workers directly, and holds the most up-to-date weights.</p></li>
<li><p><strong>target network</strong> - An optional network which is local for each worker. It can be used in order to keep a copy of
the weights stable for a long period of time. This is used in different agents, like DQN for example, in order to
have stable targets for the online network while training it.</li>
have stable targets for the online network while training it.</p></li>
</ul>
<a class="reference internal image-reference" href="../_images/distributed.png"><img alt="../_images/distributed.png" class="align-center" src="../_images/distributed.png" style="width: 600px;" /></a>
</div>
@@ -244,7 +246,7 @@ have stable targets for the online network while training it.</li>
<a href="horizontal_scaling.html" class="btn btn-neutral float-right" title="Distributed Coach - Horizontal Scale-Out" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="control_flow.html" class="btn btn-neutral" title="Control Flow" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
<a href="control_flow.html" class="btn btn-neutral float-left" title="Control Flow" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
@@ -253,7 +255,7 @@ have stable targets for the online network while training it.</li>
<div role="contentinfo">
<p>
&copy; Copyright 2018, Intel AI Lab
&copy; Copyright 2018-2019, Intel AI Lab
</p>
</div>
@@ -270,27 +272,16 @@ have stable targets for the online network while training it.</li>
<script type="text/javascript" id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/language_data.js"></script>
<script async="async" type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</script>
</body>
</html>