1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 11:10:20 +01:00

Release 0.9

Main changes are detailed below:

New features -
* CARLA 0.7 simulator integration
* Human control of the game play
* Recording of human game play and storing / loading the replay buffer
* Behavioral cloning agent and presets
* Golden tests for several presets
* Selecting between deep / shallow image embedders
* Rendering through pygame (with some boost in performance)

API changes -
* Improved environment wrapper API
* Added an evaluate flag to allow convenient evaluation of existing checkpoints
* Improve frameskip definition in Gym

Bug fixes -
* Fixed loading of checkpoints for agents with more than one network
* Fixed the N Step Q learning agent python3 compatibility
This commit is contained in:
Itai Caspi
2017-12-19 19:27:16 +02:00
committed by GitHub
parent 11faf19649
commit 125c7ee38d
41 changed files with 1713 additions and 260 deletions

View File

@@ -0,0 +1,25 @@
# Behavioral Cloning
**Actions space:** Discrete|Continuous
## Network Structure
<p style="text-align: center;">
<img src="..\..\design_imgs\dqn.png">
</p>
## Algorithm Description
### Training the network
The replay buffer contains the expert demonstrations for the task.
These demonstrations are given as state, action tuples, and with no reward.
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by the expert for each state.
1. Sample a batch of transitions from the replay buffer.
2. Use the current states as input to the network, and the expert actions as the targets of the network.
3. The loss function for the network is MSE, and therefore we use the Q head to minimize this loss.

View File

@@ -0,0 +1,33 @@
# Distributional DQN
**Actions space:** Discrete
**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
## Network Structure
<p style="text-align: center;">
<img src="..\..\design_imgs\distributional_dqn.png">
</p>
## Algorithmic Description
### Training the network
1. Sample a batch of transitions from the replay buffer.
2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
$$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
where:
* $[ \cdot ] $ bounds its argument in the range [a, b]
* $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: &nbsp; &nbsp; $\hat{T}_{z_{j}} := r+\gamma z_j$
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
4. Once in every few thousand steps, weights are copied from the online network to the target network.

View File

@@ -1,33 +1,53 @@
Adding a new environment to Coach is as easy as solving CartPole.
There a few simple steps to follow, and we will walk through them one by one.
There are a few simple steps to follow, and we will walk through them one by one.
1. Coach defines a simple API for implementing a new environment which is defined in environment/environment_wrapper.py.
There are several functions to implement, but only some of them are mandatory.
Here are the mandatory ones:
Here are the important ones:
def step(self, action_idx):
def _take_action(self, action_idx):
"""
Perform a single step on the environment using the given action.
:param action_idx: the action to perform on the environment
:return: A dictionary containing the observation, reward, done flag, action and measurements
An environment dependent function that sends an action to the simulator.
:param action_idx: the action to perform on the environment.
:return: None
"""
pass
def render(self):
def _preprocess_observation(self, observation):
"""
Call the environment function for rendering to the screen.
Do initial observation preprocessing such as cropping, rgb2gray, rescale etc.
Implementing this function is optional.
:param observation: a raw observation from the environment
:return: the preprocessed observation
"""
return observation
def _update_state(self):
"""
Updates the state from the environment.
Should update self.observation, self.reward, self.done, self.measurements and self.info
:return: None
"""
pass
def _restart_environment_episode(self, force_environment_reset=False):
"""
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
:return:
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
:return:
"""
pass
def get_rendered_image(self):
"""
Return a numpy array containing the image that will be rendered to the screen.
This can be different from the observation. For example, mujoco's observation is a measurements vector.
:return: numpy array containing the image that will be rendered to the screen
"""
return self.observation
2. Make sure to import the environment in environments/\_\_init\_\_.py:
from doom_environment_wrapper import *

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

After

Width:  |  Height:  |  Size: 35 KiB

133
docs/docs/usage.md Normal file
View File

@@ -0,0 +1,133 @@
# Coach Usage
## Training an Agent
### Single-threaded Algorithms
This is the most common case. Just choose a preset using the `-p` flag and press enter.
*Example:*
`python coach.py -p CartPole_DQN`
### Multi-threaded Algorithms
Multi-threaded algorithms are very common this days.
They typically achieve the best results, and scale gracefully with the number of threads.
In Coach, running such algorithms is done by selecting a suitable preset, and choosing the number of threads to run using the `-n` flag.
*Example:*
`python coach.py -p CartPole_A3C -n 8`
## Evaluating an Agent
There are several options for evaluating an agent during the training:
* For multi-threaded runs, an evaluation agent will constantly run in the background and evaluate the model during the training.
* For single-threaded runs, it is possible to define an evaluation period through the preset. This will run several episodes of evaluation once in a while.
Additionally, it is possible to save checkpoints of the agents networks and then run only in evaluation mode.
Saving checkpoints can be done by specifying the number of seconds between storing checkpoints using the `-s` flag.
The checkpoints will be saved into the experiment directory.
Loading a model for evaluation can be done by specifying the `-crd` flag with the experiment directory, and the `--evaluate` flag to disable training.
*Example:*
`python coach.py -p CartPole_DQN -s 60`
`python coach.py -p CartPole_DQN --evaluate -crd CHECKPOINT_RESTORE_DIR`
## Playing with the Environment as a Human
Interacting with the environment as a human can be useful for understanding its difficulties and for collecting data for imitation learning.
In Coach, this can be easily done by selecting a preset that defines the environment to use, and specifying the `--play` flag.
When the environment is loaded, the available keyboard buttons will be printed to the screen.
Pressing the escape key when finished will end the simulation and store the replay buffer in the experiment dir.
*Example:*
`python coach.py -p Breakout_DQN --play`
## Learning Through Imitation Learning
Learning through imitation of human behavior is a nice way to speedup the learning.
In Coach, this can be done in two steps -
1. Create a dataset of demonstrations by playing with the environment as a human.
After this step, a pickle of the replay buffer containing your game play will be stored in the experiment directory.
The path to this replay buffer will be printed to the screen.
To do so, you should select an environment type and level through the command line, and specify the `--play` flag.
*Example:*
`python coach.py -et Doom -lvl Basic --play`
2. Next, use an imitation learning preset and set the replay buffer path accordingly.
The path can be set either from the command line or from the preset itself.
*Example:*
`python coach.py -p Doom_Basic_BC -cp='agent.load_memory_from_file_path=\"<experiment dir>/replay_buffer.p\"'`
## Visualizations
### Rendering the Environment
Rendering the environment can be done by using the `-r` flag.
When working with multi-threaded algorithms, the rendered image will be representing the game play of the evaluation worker.
When working with single-threaded algorithms, the rendered image will be representing the single worker which can be either training or evaluating.
Keep in mind that rendering the environment in single-threaded algorithms may slow the training to some extent.
When playing with the environment using the `--play` flag, the environment will be rendered automatically without the need for specifying the `-r` flag.
*Example:*
`python coach.py -p Breakout_DQN -r`
### Dumping GIFs
Coach allows storing GIFs of the agent game play.
To dump GIF files, use the `-dg` flag.
The files are dumped after every evaluation episode, and are saved into the experiment directory, under a gifs sub-directory.
*Example:*
`python coach.py -p Breakout_A3C -n 4 -dg`
## Switching between deep learning frameworks
Coach uses TensorFlow as its main backend framework, but it also supports neon for some of the algorithms.
By default, TensorFlow will be used. It is possible to switch to neon using the `-f` flag.
*Example:*
`python coach.py -p Doom_Basic_DQN -f neon`
## Additional Flags
There are several convenient flags which are important to know about.
Here we will list most of the flags, but these can be updated from time to time.
The most up to date description can be found by using the `-h` flag.
|Flag |Type |Description |
|-------------------------------|----------|--------------|
|`-p PRESET`, ``--preset PRESET`|string |Name of a preset to run (as configured in presets.py) |
|`-l`, `--list` |flag |List all available presets|
|`-e EXPERIMENT_NAME`, `--experiment_name EXPERIMENT_NAME`|string|Experiment name to be used to store the results.|
|`-r`, `--render` |flag |Render environment|
|`-f FRAMEWORK`, `--framework FRAMEWORK`|string|Neural network framework. Available values: tensorflow, neon|
|`-n NUM_WORKERS`, `--num_workers NUM_WORKERS`|int|Number of workers for multi-process based agents, e.g. A3C|
|`--play` |flag |Play as a human by controlling the game with the keyboard. This option will save a replay buffer with the game play.|
|`--evaluate` |flag |Run evaluation only. This is a convenient way to disable training in order to evaluate an existing checkpoint.|
|`-v`, `--verbose` |flag |Don't suppress TensorFlow debug prints.|
|`-s SAVE_MODEL_SEC`, `--save_model_sec SAVE_MODEL_SEC`|int|Time in seconds between saving checkpoints of the model.|
|`-crd CHECKPOINT_RESTORE_DIR`, `--checkpoint_restore_dir CHECKPOINT_RESTORE_DIR`|string|Path to a folder containing a checkpoint to restore the model from.|
|`-dg`, `--dump_gifs` |flag |Enable the gif saving functionality.|
|`-at AGENT_TYPE`, `--agent_type AGENT_TYPE`|string|Choose an agent type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|`-et ENVIRONMENT_TYPE`, `--environment_type ENVIRONMENT_TYPE`|string|Choose an environment type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|`-ept EXPLORATION_POLICY_TYPE`, `--exploration_policy_type EXPLORATION_POLICY_TYPE`|string|Choose an exploration policy type class to override on top of the selected preset.If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|`-lvl LEVEL`, `--level LEVEL` |string|Choose the level that will be played in the environment that was selected. This value will override the level parameter in the environment class.|
|`-cp CUSTOM_PARAMETER`, `--custom_parameter CUSTOM_PARAMETER`|string| Semicolon separated parameters used to override specific parameters on top of the selected preset (or on top of the command-line assembled one). Whenever a parameter value is a string, it should be inputted as `'\"string\"'`. For ex.: `"visualization.render=False;` `num_training_iterations=500;` `optimizer='rmsprop'"`|

View File

@@ -11,6 +11,7 @@ extra_css: [extra.css]
pages:
- Home : index.md
- Design: design.md
- Usage: usage.md
- Algorithms:
- 'DQN' : algorithms/value_optimization/dqn.md
- 'Double DQN' : algorithms/value_optimization/double_dqn.md
@@ -28,6 +29,7 @@ pages:
- 'Proximal Policy Optimization' : algorithms/policy_optimization/ppo.md
- 'Clipped Proximal Policy Optimization' : algorithms/policy_optimization/cppo.md
- 'Direct Future Prediction' : algorithms/other/dfp.md
- 'Behavioral Cloning' : algorithms/imitation/bc.md
- Coach Dashboard : 'dashboard.md'
- Contributing :