mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 11:10:20 +01:00
Release 0.9
Main changes are detailed below: New features - * CARLA 0.7 simulator integration * Human control of the game play * Recording of human game play and storing / loading the replay buffer * Behavioral cloning agent and presets * Golden tests for several presets * Selecting between deep / shallow image embedders * Rendering through pygame (with some boost in performance) API changes - * Improved environment wrapper API * Added an evaluate flag to allow convenient evaluation of existing checkpoints * Improve frameskip definition in Gym Bug fixes - * Fixed loading of checkpoints for agents with more than one network * Fixed the N Step Q learning agent python3 compatibility
This commit is contained in:
25
docs/docs/algorithms/imitation/bc.md
Normal file
25
docs/docs/algorithms/imitation/bc.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Behavioral Cloning
|
||||
|
||||
**Actions space:** Discrete|Continuous
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithm Description
|
||||
|
||||
### Training the network
|
||||
|
||||
The replay buffer contains the expert demonstrations for the task.
|
||||
These demonstrations are given as state, action tuples, and with no reward.
|
||||
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by the expert for each state.
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. Use the current states as input to the network, and the expert actions as the targets of the network.
|
||||
3. The loss function for the network is MSE, and therefore we use the Q head to minimize this loss.
|
||||
@@ -0,0 +1,33 @@
|
||||
# Distributional DQN
|
||||
|
||||
**Actions space:** Discrete
|
||||
|
||||
**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
|
||||
|
||||
## Network Structure
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="..\..\design_imgs\distributional_dqn.png">
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
## Algorithmic Description
|
||||
|
||||
### Training the network
|
||||
|
||||
1. Sample a batch of transitions from the replay buffer.
|
||||
2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
|
||||
$$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
|
||||
where:
|
||||
* $[ \cdot ] $ bounds its argument in the range [a, b]
|
||||
* $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: $\hat{T}_{z_{j}} := r+\gamma z_j$
|
||||
|
||||
|
||||
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
|
||||
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
@@ -1,33 +1,53 @@
|
||||
Adding a new environment to Coach is as easy as solving CartPole.
|
||||
|
||||
There a few simple steps to follow, and we will walk through them one by one.
|
||||
There are a few simple steps to follow, and we will walk through them one by one.
|
||||
|
||||
1. Coach defines a simple API for implementing a new environment which is defined in environment/environment_wrapper.py.
|
||||
There are several functions to implement, but only some of them are mandatory.
|
||||
|
||||
Here are the mandatory ones:
|
||||
Here are the important ones:
|
||||
|
||||
def step(self, action_idx):
|
||||
def _take_action(self, action_idx):
|
||||
"""
|
||||
Perform a single step on the environment using the given action.
|
||||
:param action_idx: the action to perform on the environment
|
||||
:return: A dictionary containing the observation, reward, done flag, action and measurements
|
||||
An environment dependent function that sends an action to the simulator.
|
||||
:param action_idx: the action to perform on the environment.
|
||||
:return: None
|
||||
"""
|
||||
pass
|
||||
|
||||
def render(self):
|
||||
def _preprocess_observation(self, observation):
|
||||
"""
|
||||
Call the environment function for rendering to the screen.
|
||||
Do initial observation preprocessing such as cropping, rgb2gray, rescale etc.
|
||||
Implementing this function is optional.
|
||||
:param observation: a raw observation from the environment
|
||||
:return: the preprocessed observation
|
||||
"""
|
||||
return observation
|
||||
|
||||
def _update_state(self):
|
||||
"""
|
||||
Updates the state from the environment.
|
||||
Should update self.observation, self.reward, self.done, self.measurements and self.info
|
||||
:return: None
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
def _restart_environment_episode(self, force_environment_reset=False):
|
||||
"""
|
||||
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
|
||||
:return:
|
||||
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
|
||||
:return:
|
||||
"""
|
||||
pass
|
||||
|
||||
def get_rendered_image(self):
|
||||
"""
|
||||
Return a numpy array containing the image that will be rendered to the screen.
|
||||
This can be different from the observation. For example, mujoco's observation is a measurements vector.
|
||||
:return: numpy array containing the image that will be rendered to the screen
|
||||
"""
|
||||
return self.observation
|
||||
|
||||
|
||||
2. Make sure to import the environment in environments/\_\_init\_\_.py:
|
||||
|
||||
from doom_environment_wrapper import *
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 31 KiB After Width: | Height: | Size: 35 KiB |
133
docs/docs/usage.md
Normal file
133
docs/docs/usage.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Coach Usage
|
||||
|
||||
## Training an Agent
|
||||
|
||||
### Single-threaded Algorithms
|
||||
|
||||
This is the most common case. Just choose a preset using the `-p` flag and press enter.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p CartPole_DQN`
|
||||
|
||||
### Multi-threaded Algorithms
|
||||
|
||||
Multi-threaded algorithms are very common this days.
|
||||
They typically achieve the best results, and scale gracefully with the number of threads.
|
||||
In Coach, running such algorithms is done by selecting a suitable preset, and choosing the number of threads to run using the `-n` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p CartPole_A3C -n 8`
|
||||
|
||||
## Evaluating an Agent
|
||||
|
||||
There are several options for evaluating an agent during the training:
|
||||
|
||||
* For multi-threaded runs, an evaluation agent will constantly run in the background and evaluate the model during the training.
|
||||
|
||||
* For single-threaded runs, it is possible to define an evaluation period through the preset. This will run several episodes of evaluation once in a while.
|
||||
|
||||
Additionally, it is possible to save checkpoints of the agents networks and then run only in evaluation mode.
|
||||
Saving checkpoints can be done by specifying the number of seconds between storing checkpoints using the `-s` flag.
|
||||
The checkpoints will be saved into the experiment directory.
|
||||
Loading a model for evaluation can be done by specifying the `-crd` flag with the experiment directory, and the `--evaluate` flag to disable training.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p CartPole_DQN -s 60`
|
||||
`python coach.py -p CartPole_DQN --evaluate -crd CHECKPOINT_RESTORE_DIR`
|
||||
|
||||
## Playing with the Environment as a Human
|
||||
|
||||
Interacting with the environment as a human can be useful for understanding its difficulties and for collecting data for imitation learning.
|
||||
In Coach, this can be easily done by selecting a preset that defines the environment to use, and specifying the `--play` flag.
|
||||
When the environment is loaded, the available keyboard buttons will be printed to the screen.
|
||||
Pressing the escape key when finished will end the simulation and store the replay buffer in the experiment dir.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Breakout_DQN --play`
|
||||
|
||||
## Learning Through Imitation Learning
|
||||
|
||||
Learning through imitation of human behavior is a nice way to speedup the learning.
|
||||
In Coach, this can be done in two steps -
|
||||
|
||||
1. Create a dataset of demonstrations by playing with the environment as a human.
|
||||
After this step, a pickle of the replay buffer containing your game play will be stored in the experiment directory.
|
||||
The path to this replay buffer will be printed to the screen.
|
||||
To do so, you should select an environment type and level through the command line, and specify the `--play` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -et Doom -lvl Basic --play`
|
||||
|
||||
|
||||
2. Next, use an imitation learning preset and set the replay buffer path accordingly.
|
||||
The path can be set either from the command line or from the preset itself.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Doom_Basic_BC -cp='agent.load_memory_from_file_path=\"<experiment dir>/replay_buffer.p\"'`
|
||||
|
||||
|
||||
## Visualizations
|
||||
|
||||
### Rendering the Environment
|
||||
|
||||
Rendering the environment can be done by using the `-r` flag.
|
||||
When working with multi-threaded algorithms, the rendered image will be representing the game play of the evaluation worker.
|
||||
When working with single-threaded algorithms, the rendered image will be representing the single worker which can be either training or evaluating.
|
||||
Keep in mind that rendering the environment in single-threaded algorithms may slow the training to some extent.
|
||||
When playing with the environment using the `--play` flag, the environment will be rendered automatically without the need for specifying the `-r` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Breakout_DQN -r`
|
||||
|
||||
### Dumping GIFs
|
||||
|
||||
Coach allows storing GIFs of the agent game play.
|
||||
To dump GIF files, use the `-dg` flag.
|
||||
The files are dumped after every evaluation episode, and are saved into the experiment directory, under a gifs sub-directory.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Breakout_A3C -n 4 -dg`
|
||||
|
||||
## Switching between deep learning frameworks
|
||||
|
||||
Coach uses TensorFlow as its main backend framework, but it also supports neon for some of the algorithms.
|
||||
By default, TensorFlow will be used. It is possible to switch to neon using the `-f` flag.
|
||||
|
||||
*Example:*
|
||||
|
||||
`python coach.py -p Doom_Basic_DQN -f neon`
|
||||
|
||||
## Additional Flags
|
||||
|
||||
There are several convenient flags which are important to know about.
|
||||
Here we will list most of the flags, but these can be updated from time to time.
|
||||
The most up to date description can be found by using the `-h` flag.
|
||||
|
||||
|
||||
|Flag |Type |Description |
|
||||
|-------------------------------|----------|--------------|
|
||||
|`-p PRESET`, ``--preset PRESET`|string |Name of a preset to run (as configured in presets.py) |
|
||||
|`-l`, `--list` |flag |List all available presets|
|
||||
|`-e EXPERIMENT_NAME`, `--experiment_name EXPERIMENT_NAME`|string|Experiment name to be used to store the results.|
|
||||
|`-r`, `--render` |flag |Render environment|
|
||||
|`-f FRAMEWORK`, `--framework FRAMEWORK`|string|Neural network framework. Available values: tensorflow, neon|
|
||||
|`-n NUM_WORKERS`, `--num_workers NUM_WORKERS`|int|Number of workers for multi-process based agents, e.g. A3C|
|
||||
|`--play` |flag |Play as a human by controlling the game with the keyboard. This option will save a replay buffer with the game play.|
|
||||
|`--evaluate` |flag |Run evaluation only. This is a convenient way to disable training in order to evaluate an existing checkpoint.|
|
||||
|`-v`, `--verbose` |flag |Don't suppress TensorFlow debug prints.|
|
||||
|`-s SAVE_MODEL_SEC`, `--save_model_sec SAVE_MODEL_SEC`|int|Time in seconds between saving checkpoints of the model.|
|
||||
|`-crd CHECKPOINT_RESTORE_DIR`, `--checkpoint_restore_dir CHECKPOINT_RESTORE_DIR`|string|Path to a folder containing a checkpoint to restore the model from.|
|
||||
|`-dg`, `--dump_gifs` |flag |Enable the gif saving functionality.|
|
||||
|`-at AGENT_TYPE`, `--agent_type AGENT_TYPE`|string|Choose an agent type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||
|`-et ENVIRONMENT_TYPE`, `--environment_type ENVIRONMENT_TYPE`|string|Choose an environment type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||
|`-ept EXPLORATION_POLICY_TYPE`, `--exploration_policy_type EXPLORATION_POLICY_TYPE`|string|Choose an exploration policy type class to override on top of the selected preset.If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||
|`-lvl LEVEL`, `--level LEVEL` |string|Choose the level that will be played in the environment that was selected. This value will override the level parameter in the environment class.|
|
||||
|`-cp CUSTOM_PARAMETER`, `--custom_parameter CUSTOM_PARAMETER`|string| Semicolon separated parameters used to override specific parameters on top of the selected preset (or on top of the command-line assembled one). Whenever a parameter value is a string, it should be inputted as `'\"string\"'`. For ex.: `"visualization.render=False;` `num_training_iterations=500;` `optimizer='rmsprop'"`|
|
||||
@@ -11,6 +11,7 @@ extra_css: [extra.css]
|
||||
pages:
|
||||
- Home : index.md
|
||||
- Design: design.md
|
||||
- Usage: usage.md
|
||||
- Algorithms:
|
||||
- 'DQN' : algorithms/value_optimization/dqn.md
|
||||
- 'Double DQN' : algorithms/value_optimization/double_dqn.md
|
||||
@@ -28,6 +29,7 @@ pages:
|
||||
- 'Proximal Policy Optimization' : algorithms/policy_optimization/ppo.md
|
||||
- 'Clipped Proximal Policy Optimization' : algorithms/policy_optimization/cppo.md
|
||||
- 'Direct Future Prediction' : algorithms/other/dfp.md
|
||||
- 'Behavioral Cloning' : algorithms/imitation/bc.md
|
||||
|
||||
- Coach Dashboard : 'dashboard.md'
|
||||
- Contributing :
|
||||
|
||||
Reference in New Issue
Block a user