mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 19:20:19 +01:00
update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
This commit is contained in:
270
docs_raw/source/selecting_an_algorithm.rst
Normal file
270
docs_raw/source/selecting_an_algorithm.rst
Normal file
@@ -0,0 +1,270 @@
|
||||
Selecting an Algorithm
|
||||
======================
|
||||
|
||||
As you probably already noticed, Coach has a lot of algorithms implemented into it:
|
||||
|
||||
.. image:: /_static/img/algorithms.png
|
||||
:width: 800px
|
||||
:align: center
|
||||
|
||||
**"ok that's prefect, but I am trying to build a solution for my application, how do I select the right algorithm?"**
|
||||
|
||||
We collected some guidelines for how to choose the right algorithm for your application.
|
||||
Answer the following questions to see what are the best algorithms for your task.
|
||||
The algorithms are ordered by their release date in descending order.
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
|
||||
<script>
|
||||
$(document).ready(function() {
|
||||
// descending order of the agent badges according to their publish year
|
||||
function order_badges() {
|
||||
$(".badges-wrapper").find('.algorithm').sort(function(a, b) {
|
||||
// dataset.year is the concatenated year and month of the paper publishing date
|
||||
return b.dataset.year - a.dataset.year;
|
||||
}).appendTo($(".badges-wrapper"));
|
||||
}
|
||||
|
||||
function update_algorithms_list() {
|
||||
// show all the badges
|
||||
$("input:checkbox, input:radio").each(function(){
|
||||
$('.' + this.id).show();
|
||||
});
|
||||
|
||||
// remove all that don't fit the task
|
||||
$("input:checkbox").each(function(){
|
||||
if (!this.checked) {
|
||||
$('.' + this.id).hide();
|
||||
}
|
||||
});
|
||||
$("input:radio").each(function(){
|
||||
if (this.checked) {
|
||||
$('.algorithm').not('.' + this.id).hide();
|
||||
}
|
||||
});
|
||||
|
||||
order_badges();
|
||||
}
|
||||
|
||||
// toggle badges according to the checkbox change
|
||||
$('input:checkbox, input:radio').click(update_algorithms_list);
|
||||
|
||||
update_algorithms_list();
|
||||
});
|
||||
</script>
|
||||
|
||||
<div class="bordered-container">
|
||||
<div class="questionnaire">
|
||||
What are the type of actions your task requires?
|
||||
<div style="margin-left: 12px;">
|
||||
<input type="radio" id="discrete" name="actions" checked>Discrete actions<br>
|
||||
<input type="radio" id="continuous" name="actions">Continuous actions<br>
|
||||
</div>
|
||||
<input type="checkbox" id="imitation" checked="True">Do you have expert demonstrations for your task?<br>
|
||||
<input type="checkbox" id="on-policy" checked="True">Can you collect new data for your task dynamically?<br>
|
||||
<input type="checkbox" id="requires-multi-worker" checked="True">Do you have a simulator for your task?<br>
|
||||
</div>
|
||||
|
||||
<br>
|
||||
<div class="badges-wrapper">
|
||||
<div class="algorithm discrete off-policy" data-year="201300">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/dqn.html">DQN</a>
|
||||
<br>
|
||||
Learns action values for discrete actions, and allows learning from a replay buffer with old experiences
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy" data-year="201710">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/rainbow.html">Rainbow</a>
|
||||
<br>
|
||||
Combines multiple recent innovations on top of DQN for discrete controls, and achieves
|
||||
much better results on known benchmarks
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm continuous off-policy" data-year="201712">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/hac.html">HAC</a>
|
||||
<br>
|
||||
Works only for continuous actions, and uses hierarchy of agents to make the learning
|
||||
more simple
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy data-year="201509">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/ddqn.html">DDQN</a>
|
||||
<br>
|
||||
An improvement over DQN, which learns more accurate action values, and therefore achieves better results
|
||||
on known benchmarks
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete on-policy" data-year="201611">
|
||||
<span class="badge">
|
||||
<a href="components/agents/other/dfp.html">DFP</a>
|
||||
<br>
|
||||
Works only for discrete actions, by learning to predict the future values of a set of
|
||||
measurements from the environment, and then using a goal vector to weight the importance of each of the
|
||||
measurements
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy" data-year="201606">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/mmc.html">MMC</a>
|
||||
<br>
|
||||
A simple modification to DQN, which instead of learning action values only by bootstrapping the current
|
||||
action value prediction, it mixes in the total discounted return as well. This helps learn the correct
|
||||
action values faster, and is particularly useful for environments with delayed rewards.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy" data-year="201512">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/pal.html">PAL</a>
|
||||
<br>
|
||||
An improvement over DQN, that tries to deal with the approximation errors present in reinforcement
|
||||
learning by increasing the gap between the value of the best action and the second best action.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm continuous off-policy" data-year="201603">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/naf.html">NAF</a>
|
||||
<br>
|
||||
A variant of Q learning for continuous control.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy" data-year="201703">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/ddqn.html">NEC</a>
|
||||
<br>
|
||||
Uses a memory to "memorize" its experience and learn much faster by querying the memory on newly
|
||||
seen states.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy" data-year="201710">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/qr_dqn.html">QR DQN</a>
|
||||
<br>
|
||||
Uses quantile regression to learn a distribution over the action values instead of only their mean.
|
||||
This boosts performance on known benchmarks.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy" data-year="201602">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/bs_dqn.html">Bootstrapped DQN</a>
|
||||
<br>
|
||||
Uses an ensemble of DQN networks, where each network learns from a different subset of the experience
|
||||
in order to improve exploration.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete on-policy requires-multi-worker" data-year="201602">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/n_step.html">N-Step Q Learning</a>
|
||||
<br>
|
||||
A variant of Q learning that uses bootstrapping of N steps ahead, instead of 1 step. Doing this
|
||||
makes the algorithm on-policy and therefore requires having multiple workers training in parallel in
|
||||
order for it to work well.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete off-policy" data-year="201706">
|
||||
<span class="badge">
|
||||
<a href="components/agents/value_optimization/categorical_dqn.html">Categorical DQN</a>
|
||||
<br>
|
||||
Learns a distribution over the action values instead of only their mean. This boosts performance on
|
||||
known algorithms but requires knowing the range of possible values for the accumulated rewards before hand.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm continuous discrete on-policy" data-year="199200">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/pg.html">Policy Gradient</a>
|
||||
<br>
|
||||
Based on the REINFORCE algorithm, this algorithm learn a probability distribution over the actions.
|
||||
This is the most simple algorithm available in Coach, but also has the worse results.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete continuous on-policy requires-multi-worker" data-year="201602">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/ac.html">Actor Critic (A3C / A2C)</a>
|
||||
<br>
|
||||
Combines REINFORCE with a learned baseline (Critic) to improve stability of learning. It also
|
||||
introduced the parallel learning of multiple workers to speed up data collection and improve the
|
||||
learning stability and speed, both for discrete and continuous action spaces.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm continuous off-policy" data-year="201509">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/ddpg.html">DDPG</a>
|
||||
<br>
|
||||
An actor critic scheme for continuous action spaces which assumes that the policy is deterministic,
|
||||
and therefore it is able to use a replay buffer in order to improve sample efficiency.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm continuous discrete on-policy" data-year="201706">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/ppo.html">PPO</a>
|
||||
<br>
|
||||
An actor critic scheme which uses bounded updates to the policy in order to make the learning process
|
||||
very stable.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete continuous on-policy" data-year="201706">
|
||||
<span class="badge">
|
||||
<a href="components/agents/policy_optimization/cppo.html">Clipped PPO</a>
|
||||
<br>
|
||||
A simplification of PPO, that reduces the code complexity while achieving similar results.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete continuous imitation off-policy" data-year="199700">
|
||||
<span class="badge">
|
||||
<a href="components/agents/imitation/bc.html">BC</a>
|
||||
<br>
|
||||
The simplest form of imitation learning. Uses supervised learning on a dataset of expert demonstrations
|
||||
in order to imitate the expert behavior.
|
||||
</span>
|
||||
</div>
|
||||
<div class="algorithm discrete continuous imitation off-policy" data-year="201710">
|
||||
<span class="badge">
|
||||
<a href="components/agents/imitation/cil.html">CIL</a>
|
||||
<br>
|
||||
A variant of behavioral cloning, where the learned policy is disassembled to several skills
|
||||
(such as turning left or right in an intersection), and each skill is learned separately from the
|
||||
human demonstrations.
|
||||
</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
1. Does your environment have a discrete or continuous action space?
|
||||
--------------------------------------------------------------------
|
||||
|
||||
Some reinforcement learning algorithms work only for discrete action spaces, where the agent needs to select
|
||||
one out of several possible actions. Other algorithms work only for continuous action spaces, where there are
|
||||
infinite possible actions, but there is some spatial relationship between the actions. And there are some algorithms
|
||||
that can be applied in both cases. The available algorithms highly depend on the task at hand.
|
||||
|
||||
|
||||
2. Is collecting more samples from your environment painful?
|
||||
------------------------------------------------------------
|
||||
|
||||
Reinforcement learning algorithm are notoriously known for the amount of samples they need for training.
|
||||
Typically, on-policy algorithms are much less sample efficient compared to off-policy algorithms. But there are
|
||||
other algorithmic features that allow improving the sample efficiency even more, like using a DND in NEC, or using
|
||||
Hindsight Experience Replay. It is hard to say which algorithm is the most sample efficient, but we can at least say
|
||||
which ones are not sample efficient.
|
||||
|
||||
|
||||
3. Do you have a simulator that can be parallelized across multiple processes or nodes?
|
||||
---------------------------------------------------------------------------------------
|
||||
|
||||
Parallelizing training across multiple workers which are located on the same node or on different nodes is a technique
|
||||
that has been introduced in recent years and achieved a lot of success in improving the results of multiple algorithms.
|
||||
As part of this, there are some algorithms that don't work well without being parallelized with multiple workers
|
||||
working in parallel, which requires having a simulator for each worker.
|
||||
|
||||
|
||||
4. Do you have human demonstrations for solving the task?
|
||||
---------------------------------------------------------
|
||||
|
||||
If human demonstrations are available for a task, most of the time it would be better to use those instead of training
|
||||
using regular reinforcement learning from scratch. To use human demonstrations we have implemented several tools and
|
||||
algorithms for imitation learning in Coach.
|
||||
Reference in New Issue
Block a user