1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 19:20:19 +01:00
Files
coach/docs/algorithms/policy_optimization/pg/index.html
2018-08-13 17:11:34 +03:00

300 lines
11 KiB
HTML

<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="../../../img/favicon.ico">
<title>Policy Gradient - Reinforcement Learning Coach</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="../../../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../../../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="../../../css/highlight.css">
<link href="../../../extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Policy Gradient";
var mkdocs_page_input_path = "algorithms/policy_optimization/pg.md";
var mkdocs_page_url = "/algorithms/policy_optimization/pg/";
</script>
<script src="../../../js/jquery-2.1.1.min.js"></script>
<script src="../../../js/modernizr-2.8.3.min.js"></script>
<script type="text/javascript" src="../../../js/highlight.pack.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="../../.." class="icon icon-home"> Reinforcement Learning Coach</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="../../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="../../..">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="../../../usage/">Usage</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Design</span>
<ul class="subnav">
<li class="">
<a class="" href="../../../design/features/">Features</a>
</li>
<li class="">
<a class="" href="../../../design/control_flow/">Control Flow</a>
</li>
<li class="">
<a class="" href="../../../design/network/">Network</a>
</li>
<li class="">
<a class="" href="../../../design/filters/">Filters</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Algorithms</span>
<ul class="subnav">
<li class="">
<a class="" href="../../value_optimization/dqn/">DQN</a>
</li>
<li class="">
<a class="" href="../../value_optimization/double_dqn/">Double DQN</a>
</li>
<li class="">
<a class="" href="../../value_optimization/dueling_dqn/">Dueling DQN</a>
</li>
<li class="">
<a class="" href="../../value_optimization/categorical_dqn/">Categorical DQN</a>
</li>
<li class="">
<a class="" href="../../value_optimization/mmc/">Mixed Monte Carlo</a>
</li>
<li class="">
<a class="" href="../../value_optimization/pal/">Persistent Advantage Learning</a>
</li>
<li class="">
<a class="" href="../../value_optimization/nec/">Neural Episodic Control</a>
</li>
<li class="">
<a class="" href="../../value_optimization/bs_dqn/">Bootstrapped DQN</a>
</li>
<li class="">
<a class="" href="../../value_optimization/n_step/">N-Step Q Learning</a>
</li>
<li class="">
<a class="" href="../../value_optimization/naf/">Normalized Advantage Functions</a>
</li>
<li class=" current">
<a class="current" href="./">Policy Gradient</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#policy-gradient">Policy Gradient</a></li>
<ul>
<li><a class="toctree-l4" href="#network-structure">Network Structure</a></li>
<li><a class="toctree-l4" href="#algorithm-description">Algorithm Description</a></li>
</ul>
</ul>
</li>
<li class="">
<a class="" href="../ac/">Actor-Critic</a>
</li>
<li class="">
<a class="" href="../ddpg/">Deep Determinstic Policy Gradients</a>
</li>
<li class="">
<a class="" href="../ppo/">Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../cppo/">Clipped Proximal Policy Optimization</a>
</li>
<li class="">
<a class="" href="../../other/dfp/">Direct Future Prediction</a>
</li>
<li class="">
<a class="" href="../../imitation/bc/">Behavioral Cloning</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<a class="" href="../../../dashboard/">Coach Dashboard</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Contributing</span>
<ul class="subnav">
<li class="">
<a class="" href="../../../contributing/add_agent/">Adding a New Agent</a>
</li>
<li class="">
<a class="" href="../../../contributing/add_env/">Adding a New Environment</a>
</li>
</ul>
</li>
</ul>
</div>
&nbsp;
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../../..">Reinforcement Learning Coach</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../../..">Docs</a> &raquo;</li>
<li>Algorithms &raquo;</li>
<li>Policy Gradient</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="policy-gradient">Policy Gradient</h1>
<p><strong>Actions space:</strong> Discrete|Continuous</p>
<p><strong>References:</strong> <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning</a></p>
<h2 id="network-structure">Network Structure</h2>
<p style="text-align: center;">
<img src="..\..\design_imgs\pg.png">
</p>
<h2 id="algorithm-description">Algorithm Description</h2>
<h3 id="choosing-an-action-discrete-actions">Choosing an action - Discrete actions</h3>
<p>Run the current states through the network and get a policy distribution over the actions. While training, sample from the policy distribution. When testing, take the action with the highest probability. </p>
<h3 id="training-the-network">Training the network</h3>
<p>The policy head loss is defined as <script type="math/tex"> L=-log (\pi) \cdot PolicyGradientRescaler </script>. The <script type="math/tex">PolicyGradientRescaler</script> is used in order to reduce the policy gradient variance, which might be very noisy. This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's convergence. The rescaler is a configurable parameter and there are few options to choose from: <br />
<em> <strong>Total Episode Return</strong> - The sum of all the discounted rewards during the episode.
</em> <strong>Future Return</strong> - Return from each transition until the end of the episode.
<em> <strong>Future Return Normalized by Episode</strong> - Future returns across the episode normalized by the episode's mean and standard deviation.
</em> <strong>Future Return Normalized by Timestep</strong> - Future returns normalized using running means and standard deviations, which are calculated seperately for each timestep, across different episodes. </p>
<p>Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes serves the same purpose - reducing the update variance. After accumulating gradients for several episodes, the gradients are then applied to the network. </p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../ac/" class="btn btn-neutral float-right" title="Actor-Critic">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="../../value_optimization/naf/" class="btn btn-neutral" title="Normalized Advantage Functions"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../../value_optimization/naf/" style="color: #fcfcfc;">&laquo; Previous</a></span>
<span style="margin-left: 15px"><a href="../ac/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script>var base_url = '../../..';</script>
<script src="../../../js/theme.js"></script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<script src="../../../search/require.js"></script>
<script src="../../../search/search.js"></script>
</body>
</html>