Release 0.9

Main changes are detailed below: New features - * CARLA 0.7 simulator integration * Human control of the game play * Recording of human game play and storing / loading the replay buffer * Behavioral cloning agent and presets * Golden tests for several presets * Selecting between deep / shallow image embedders * Rendering through pygame (with some boost in performance) API changes - * Improved environment wrapper API * Added an evaluate flag to allow convenient evaluation of existing checkpoints * Improve frameskip definition in Gym Bug fixes - * Fixed loading of checkpoints for agents with more than one network * Fixed the N Step Q learning agent python3 compatibility
2026-03-18 15:53:35 +01:00 · 2017-12-19 19:27:16 +02:00
parent 11faf19649
commit 125c7ee38d
41 changed files with 1713 additions and 260 deletions
--- a/docs/docs/algorithms/imitation/bc.md
+++ b/docs/docs/algorithms/imitation/bc.md
@@ -0,0 +1,25 @@
+# Behavioral Cloning
+
+**Actions space:** Discrete|Continuous
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\dqn.png">
+
+</p>
+
+
+
+## Algorithm Description
+
+### Training the network
+
+The replay buffer contains the expert demonstrations for the task.
+These demonstrations are given as state, action tuples, and with no reward.
+The training goal is to reduce the difference between the actions predicted by the network and the actions taken by the expert for each state.
+
+1. Sample a batch of transitions from the replay buffer.
+2. Use the current states as input to the network, and the expert actions as the targets of the network.
+3. The loss function for the network is MSE, and therefore we use the Q head to minimize this loss.
--- a/docs/docs/algorithms/value_optimization/distributional_dqn.md
+++ b/docs/docs/algorithms/value_optimization/distributional_dqn.md
@@ -0,0 +1,33 @@
+# Distributional DQN
+
+**Actions space:** Discrete
+
+**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
+
+## Network Structure
+
+<p style="text-align: center;">
+
+<img src="..\..\design_imgs\distributional_dqn.png">
+
+</p>
+
+
+
+## Algorithmic Description
+
+### Training the network
+
+1. Sample a batch of transitions from the replay buffer. 
+2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
+   $$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
+   where:
+   	*  $[ \cdot ] $ bounds its argument in the range [a, b]
+   	*  $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: &nbsp; &nbsp;   $\hat{T}_{z_{j}} := r+\gamma z_j$
+
+
+3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution.   Only the target of the actions that were actually taken is updated. 
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+
+
--- a/docs/docs/contributing/add_env.md
+++ b/docs/docs/contributing/add_env.md
@@ -1,33 +1,53 @@
 Adding a new environment to Coach is as easy as solving CartPole. 

-There a few simple steps to follow, and we will walk through them one by one. 
+There are a few simple steps to follow, and we will walk through them one by one.

 1.  Coach defines a simple API for implementing a new environment which is defined in environment/environment_wrapper.py.
    There are several functions to implement, but only some of them are mandatory. 

-    Here are the mandatory ones:
+    Here are the important ones:

-           def step(self, action_idx):
+            def _take_action(self, action_idx):
                """
-                Perform a single step on the environment using the given action.
-                :param action_idx: the action to perform on the environment
-                :return: A dictionary containing the observation, reward, done flag, action and measurements
+                An environment dependent function that sends an action to the simulator.
+                :param action_idx: the action to perform on the environment.
+                :return: None
                """
                pass

-            def render(self):
+            def _preprocess_observation(self, observation):
                """
-                Call the environment function for rendering to the screen.
+                Do initial observation preprocessing such as cropping, rgb2gray, rescale etc.
+                Implementing this function is optional.
+                :param observation: a raw observation from the environment
+                :return: the preprocessed observation
+                """
+                return observation
+
+            def _update_state(self):
+                """
+                Updates the state from the environment.
+                Should update self.observation, self.reward, self.done, self.measurements and self.info
+                :return: None
                """
                pass
-                
+
            def _restart_environment_episode(self, force_environment_reset=False):
                """
-                :param force_environment_reset: Force the environment to reset even if the episode is not done yet. 
-                :return: 
+                :param force_environment_reset: Force the environment to reset even if the episode is not done yet.
+                :return:
                """
                pass

+            def get_rendered_image(self):
+                """
+                Return a numpy array containing the image that will be rendered to the screen.
+                This can be different from the observation. For example, mujoco's observation is a measurements vector.
+                :return: numpy array containing the image that will be rendered to the screen
+                """
+                return self.observation
+
+
 2.  Make sure to import the environment in environments/\_\_init\_\_.py:
        
        from doom_environment_wrapper import *
--- a/docs/docs/img/algorithms.png
+++ b/docs/docs/img/algorithms.png
--- a/docs/docs/usage.md
+++ b/docs/docs/usage.md
@@ -0,0 +1,133 @@
+# Coach Usage
+
+## Training an Agent
+
+### Single-threaded Algorithms
+
+This is the most common case. Just choose a preset using the `-p` flag and press enter.
+
+*Example:*
+
+`python coach.py -p CartPole_DQN`
+
+### Multi-threaded Algorithms
+
+Multi-threaded algorithms are very common this days.
+They typically achieve the best results, and scale gracefully with the number of threads.
+In Coach, running such algorithms is done by selecting a suitable preset, and choosing the number of threads to run using the `-n` flag.
+
+*Example:*
+
+`python coach.py -p CartPole_A3C -n 8`
+
+## Evaluating an Agent
+
+There are several options for evaluating an agent during the training:
+
+* For multi-threaded runs, an evaluation agent will constantly run in the background and evaluate the model during the training.
+
+* For single-threaded runs, it is possible to define an evaluation period through the preset. This will run several episodes of evaluation once in a while.
+
+Additionally, it is possible to save checkpoints of the agents networks and then run only in evaluation mode.
+Saving checkpoints can be done by specifying the number of seconds between storing checkpoints using the `-s` flag.
+The checkpoints will be saved into the experiment directory.
+Loading a model for evaluation can be done by specifying the `-crd` flag with the experiment directory, and the `--evaluate` flag to disable training.
+
+*Example:*
+
+`python coach.py -p CartPole_DQN -s 60`
+`python coach.py -p CartPole_DQN --evaluate -crd CHECKPOINT_RESTORE_DIR`
+
+## Playing with the Environment as a Human
+
+Interacting with the environment as a human can be useful for understanding its difficulties and for collecting data for imitation learning.
+In Coach, this can be easily done by selecting a preset that defines the environment to use, and specifying the `--play` flag.
+When the environment is loaded, the available keyboard buttons will be printed to the screen.
+Pressing the escape key when finished will end the simulation and store the replay buffer in the experiment dir.
+
+*Example:*
+
+`python coach.py -p Breakout_DQN --play`
+
+## Learning Through Imitation Learning
+
+Learning through imitation of human behavior is a nice way to speedup the learning.
+In Coach, this can be done in two steps -
+
+1. Create a dataset of demonstrations by playing with the environment as a human.
+   After this step, a pickle of the replay buffer containing your game play will be stored in the experiment directory.
+   The path to this replay buffer will be printed to the screen.
+   To do so, you should select an environment type and level through the command line, and specify the `--play` flag.
+
+    *Example:*
+
+    `python coach.py -et Doom -lvl Basic --play`
+
+
+2. Next, use an imitation learning preset and set the replay buffer path accordingly.
+    The path can be set either from the command line or from the preset itself.
+
+    *Example:*
+
+    `python coach.py -p Doom_Basic_BC -cp='agent.load_memory_from_file_path=\"<experiment dir>/replay_buffer.p\"'`
+
+
+## Visualizations
+
+### Rendering the Environment
+
+Rendering the environment can be done by using the `-r` flag.
+When working with multi-threaded algorithms, the rendered image will be representing the game play of the evaluation worker.
+When working with single-threaded algorithms, the rendered image will be representing the single worker which can be either training or evaluating.
+Keep in mind that rendering the environment in single-threaded algorithms may slow the training to some extent.
+When playing with the environment using the `--play` flag, the environment will be rendered automatically without the need for specifying the `-r` flag.
+
+*Example:*
+
+`python coach.py -p Breakout_DQN -r`
+
+### Dumping GIFs
+
+Coach allows storing GIFs of the agent game play.
+To dump GIF files, use the `-dg` flag.
+The files are dumped after every evaluation episode, and are saved into the experiment directory, under a gifs sub-directory.
+
+*Example:*
+
+`python coach.py -p Breakout_A3C -n 4 -dg`
+
+## Switching between deep learning frameworks
+
+Coach uses TensorFlow as its main backend framework, but it also supports neon for some of the algorithms.
+By default, TensorFlow will be used. It is possible to switch to neon using the `-f` flag.
+
+*Example:*
+
+`python coach.py -p Doom_Basic_DQN -f neon`
+
+## Additional Flags
+
+There are several convenient flags which are important to know about.
+Here we will list most of the flags, but these can be updated from time to time.
+The most up to date description can be found by using the `-h` flag.
+
+
+|Flag                           |Type      |Description   |
+|-------------------------------|----------|--------------|
+|`-p PRESET`, ``--preset PRESET`|string    |Name of a preset to run (as configured in presets.py)         |
+|`-l`, `--list`                 |flag      |List all available presets|
+|`-e EXPERIMENT_NAME`, `--experiment_name EXPERIMENT_NAME`|string|Experiment name to be used to store the results.|
+|`-r`, `--render`               |flag      |Render environment|
+|`-f FRAMEWORK`, `--framework FRAMEWORK`|string|Neural network framework. Available values: tensorflow, neon|
+|`-n NUM_WORKERS`, `--num_workers NUM_WORKERS`|int|Number of workers for multi-process based agents, e.g. A3C|
+|`--play`                       |flag      |Play as a human by controlling the game with the keyboard. This option will save a replay buffer with the game play.|
+|`--evaluate`                   |flag      |Run evaluation only. This is a convenient way to disable training in order to evaluate an existing checkpoint.|
+|`-v`, `--verbose`              |flag      |Don't suppress TensorFlow debug prints.|
+|`-s SAVE_MODEL_SEC`, `--save_model_sec SAVE_MODEL_SEC`|int|Time in seconds between saving checkpoints of the model.|
+|`-crd CHECKPOINT_RESTORE_DIR`, `--checkpoint_restore_dir CHECKPOINT_RESTORE_DIR`|string|Path to a folder containing a checkpoint to restore the model from.|
+|`-dg`, `--dump_gifs`           |flag      |Enable the gif saving functionality.|
+|`-at AGENT_TYPE`, `--agent_type AGENT_TYPE`|string|Choose an agent type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
+|`-et ENVIRONMENT_TYPE`, `--environment_type ENVIRONMENT_TYPE`|string|Choose an environment type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
+|`-ept EXPLORATION_POLICY_TYPE`, `--exploration_policy_type EXPLORATION_POLICY_TYPE`|string|Choose an exploration policy type class to override on top of the selected preset.If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
+|`-lvl LEVEL`, `--level LEVEL`  |string|Choose the level that will be played in the environment that was selected. This value will override the level parameter in the environment class.|
+|`-cp CUSTOM_PARAMETER`, `--custom_parameter CUSTOM_PARAMETER`|string| Semicolon separated parameters used to override specific parameters on top of the selected preset (or on top of the command-line assembled one). Whenever a parameter value is a string, it should be inputted as `'\"string\"'`. For ex.: `"visualization.render=False;` `num_training_iterations=500;` `optimizer='rmsprop'"`|
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -11,6 +11,7 @@ extra_css: [extra.css]
 pages:
 - Home : index.md
 - Design: design.md
+- Usage: usage.md
 - Algorithms:
        - 'DQN' : algorithms/value_optimization/dqn.md
        - 'Double DQN' : algorithms/value_optimization/double_dqn.md
@@ -28,6 +29,7 @@ pages:
        - 'Proximal Policy Optimization' : algorithms/policy_optimization/ppo.md
        - 'Clipped Proximal Policy Optimization' : algorithms/policy_optimization/cppo.md
        - 'Direct Future Prediction' : algorithms/other/dfp.md
+        - 'Behavioral Cloning' : algorithms/imitation/bc.md
        
 - Coach Dashboard : 'dashboard.md'
 - Contributing :