1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 11:10:20 +01:00

update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
This commit is contained in:
Itai Caspi
2018-11-15 15:00:13 +02:00
committed by Gal Novik
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions

View File

@@ -11,7 +11,52 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a very simple graph containing a single clipped ppo agent running with the CartPole-v0 Gym environment:"
"## Using Coach from the Command Line"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When running Coach from the command line, we use a Preset module to define the experiment parameters.\n",
"As its name implies, a preset is a predefined set of parameters to run some agent on some environment.\n",
"Coach has many predefined presets that follow the algorithms definitions in the published papers, and allows training some of the existing algorithms with essentially no coding at all. This presets can easily be run from the command line. For example:\n",
"\n",
"`coach -p CartPole_DQN`\n",
"\n",
"You can find all the predefined presets under the `presets` directory, or by listing them using the following command:\n",
"\n",
"`coach -l`\n",
"\n",
"Coach can also be used with an externally defined preset by passing the absolute path to the module and the name of the graph manager object which is defined in the preset: \n",
"\n",
"`coach -p /home/my_user/my_agent_dir/my_preset.py:graph_manager`\n",
"\n",
"Some presets are generic for multiple environment levels, and therefore require defining the specific level through the command line:\n",
"\n",
"`coach -p Atari_DQN -lvl breakout`\n",
"\n",
"There are plenty of other command line arguments you can use in order to customize the experiment. A full documentation of the available arguments can be found using the following command:\n",
"\n",
"`coach -h`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Coach as a Library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, Coach can be used a library directly from python. As described above, Coach uses the presets mechanism to define the experiments. A preset is essentially a python module which instantiates a `GraphManager` object. The graph manager is a container that holds the agents and the environments, and has some additional parameters for running the experiment, such as visualization parameters. The graph manager acts as the scheduler which orchestrates the experiment.\n",
"\n",
"Let's start with some examples.\n",
"\n",
"Creating a very simple graph containing a single Clipped PPO agent running with the CartPole-v0 Gym environment:"
]
},
{
@@ -52,7 +97,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Running each phase manually:"
"### Running each phase manually"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The graph manager simplifies the scheduling process by encapsulating the calls to each of the training phases. Sometimes, it can be beneficial to have a more fine grained control over the scheduling process. This can be easily done by calling the individual phase functions directly:"
]
},
{
@@ -77,7 +129,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Changing the default parameters"
"### Changing the default parameters\n",
"\n",
"Agents in Coach are defined along with some default parameters that follow the published paper definition. This may be sufficient when running the exact same experiments as in the paper, but otherwise, there would probably need to be some changes made to the algorithm parameters. Again, this is easily modifiable, and all the internal parameters are accessible from within the preset:"
]
},
{
@@ -117,11 +171,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using a custom gym environment\n",
"### Using a custom gym environment\n",
"\n",
"We can use a custom gym environment without registering it. \n",
"We just need the path to the environment module.\n",
"We can also pass custom parameters for the environment __init__"
"We can also pass custom parameters for the environment `__init__` function as `additional_simulator_parameters`."
]
},
{
@@ -164,12 +218,21 @@
"graph_manager.improve()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The path to the environment can also be set as an absolute path, as follows: `<absolute python module path>:<environment class>`. For example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"env_params = GymVectorEnvironment(level='/home/user/my_environment_dir/my_environment_module.py:MyEnvironmentClass')"
]
}
],
"metadata": {

View File

@@ -4,14 +4,26 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial we'll build a new agent that implements the Categorical Deep Q Network algorithm (https://arxiv.org/pdf/1707.06887.pdf), and a preset that runs the agent on the breakout game of the Atari environment."
"# Implementing an Algorithm\n",
"\n",
"In this tutorial we'll build a new agent that implements the Categorical Deep Q Network (C51) algorithm (https://arxiv.org/pdf/1707.06887.pdf), and a preset that runs the agent on the 'Breakout' game of the Atari environment.\n",
"\n",
"Implementing an algorithm typically consists of 3 main parts:\n",
"\n",
"1. Implementing the agent object\n",
"2. Implementing the network head (optional)\n",
"3. Implementing a preset to run the agent on some environment\n",
"\n",
"The entire agent can be defined outside of the Coach framework, but in Coach you can find multiple predefined agents under the `agents` directory, network heads under the `architecure/tensorflow_components/heads` directory, and presets under the `presets` directory, for you to reuse.\n",
"\n",
"For more information, we recommend going over the following page in the documentation: https://nervanasystems.github.io/coach/contributing/add_agent/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Agent"
"## The Network Head"
]
},
{
@@ -22,7 +34,9 @@
"\n",
"A head is the final part of the network. It takes the embedding from the middleware embedder and passes it through a neural network to produce the output of the network. There can be multiple heads in a network, and each one has an assigned loss function. The heads are algorithm dependent.\n",
"\n",
"It will be defined in a new file - ```architectures/tensorflow_components/heads/categorical_dqn_head.py```.\n",
"The rest of the network can be reused from the predefined parts, and the input embedder and middleware structure can also be modified, but we won't go into that in this tutorial.\n",
"\n",
"The head will typically be defined in a new file - ```architectures/tensorflow_components/heads/categorical_dqn_head.py```.\n",
"\n",
"First - some imports."
]
@@ -50,7 +64,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's define a class - ```CategoricalQHeadParameters``` - containing the head parameters and the head itself. "
"Now let's define a class - ```CategoricalQHead``` class. Each class in Coach has a complementary Parameters class which defines its constructor parameters. So we will additionally define the ```CategoricalQHeadParameters``` class. The network structure should be defined in the `_build_module` function, which gets the previous layer output as an argument. In this function there are several variables that should be defined:\n",
"* `self.input` - (optional) a list of any additional input to the head\n",
"* `self.output` - the output of the head, which is also one of the outputs of the network\n",
"* `self.target` - a placeholder for the targets that will be used to train the network\n",
"* `self.regularizations` - (optional) any additional regularization losses that will be applied to the network\n",
"* `self.loss` - the loss that will be used to train the network\n",
"\n",
"Categorical DQN uses the same network as DQN, and only changes the last layer to output #actions x #atoms elements with a softmax function. Additionally, we update the loss function to cross entropy."
]
},
{
@@ -94,7 +115,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's go ahead and define the network parameters - it will reuse the DQN network parameters but the head parameters will be our ```CategoricalQHeadParameters```"
"## The Agent\n",
"\n",
"The agent will implement the Categorical DQN algorithm. Each agent has a complementary ```AgentParameters``` class, which allows selecting the parameters of the agent sub modules: \n",
"* the **algorithm**\n",
"* the **exploration policy**\n",
"* the **memory**\n",
"* the **networks**\n",
"\n",
"Now let's go ahead and define the network parameters - it will reuse the DQN network parameters but the head parameters will be our ```CategoricalQHeadParameters```. The network parameters allows selecting any number of heads for the network by defining them in a list, but in this case we only have a single head, so we will point to its parameters class."
]
},
{
@@ -116,7 +145,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll define the algorithm parameters, which are the same as the DQN algorithm parameters, with the addition of the Categorical DQN specific v_min, v_max and number of atoms.\n",
"Next we'll define the algorithm parameters, which are the same as the DQN algorithm parameters, with the addition of the Categorical DQN specific `v_min`, `v_max` and number of atoms.\n",
"We'll also define the parameters of the exploration policy, which is epsilon greedy with epsilon starting at a value of 1.0 and decaying to 0.01 throughout 1,000,000 steps."
]
},
@@ -150,7 +179,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's define the agent parameters class which contains all the parameters to be used by the agent - the network, algorithm and exploration parameters that we defined above, and also the parameters of the memory module to be used, which is experience replay in this case."
"Now let's define the agent parameters class which contains all the parameters to be used by the agent - the network, algorithm and exploration parameters that we defined above, and also the parameters of the memory module to be used, which is the default experience replay buffer in this case. \n",
"Notice that the networks are defined as a dictionary, where the key is the name of the network and the value is the network parameters. This will allow us to later access each of the networks through `self.networks[network_name]`.\n",
"\n",
"The `path` property connects the parameters class to its corresponding class that is parameterized. In this case, it is the `CategoricalDQNAgent` class that we'll define in a moment."
]
},
{
@@ -181,7 +213,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step is to define the agent itself - ```CategoricalDQNAgent``` - which is a type of value optimization agent so it will inherit the ```ValueOptimizationAgent``` class. Our agent will implement the ```learn_from_batch``` function which updates the agent's networks according to an input batch of transitions."
"The last step is to define the agent itself - ```CategoricalDQNAgent``` - which is a type of value optimization agent so it will inherit the ```ValueOptimizationAgent``` class. It could have also inheritted ```DQNAgent```, which would result in the same functionality. Our agent will implement the ```learn_from_batch``` function which updates the agent's networks according to an input batch of transitions.\n",
"\n",
"Agents typically need to implement the training function - `learn_from_batch`, and a function that defines which actions to select given a state - `choose_action`. In our case, we will reuse the `choose_action` function implemented by the generic `ValueOptimizationAgent`, and just update the internal function for fetching q values for each of the actions - `get_all_q_values_for_states`.\n",
"\n",
"This code may look intimidating at first glance, but basically it is just following the algorithm description in the Distributional DQN paper:\n",
"<img src=\"files/categorical_dqn.png\" width=400>"
]
},
{
@@ -245,17 +282,33 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Preset"
"Some important things to notice here:\n",
"* `self.networks['main']` is a NetworkWrapper object. It holds all the copies of the 'main' network: \n",
" - a **global network** which is shared between all the workers in distributed training\n",
" - an **online network** which is a local copy of the network intended to keep the weights static between training steps\n",
" - a **target network** which is a local slow updating copy of the network, and is intended to keep the targets of the training process more stable\n",
" In this case, we have the online network and the target network. The global network will only be created if we run the algorithm with multiple workers. The A3C agent would be one kind of example. \n",
"* There are two network prediction functions available - `predict` and `parallel_prediction`. `predict` is quite straightforward - get some inputs, forward them through the network and return the output. `parallel_prediction` is an optimized variant of `predict`, which allows running a prediction on the online and target network in parallel, instead of running them sequentially.\n",
"* The network `train_and_sync_networks` function makes a single training step - running a forward pass of the online network, calculating the losses, running a backward pass to calculate the gradients and applying the gradients to the network weights. If multiple workers are used, instead of applying the gradients to the online network weights, they are applied to the global (shared) network weights, and then the weights are copied back to the online network."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The new preset will be defined in a new file - ```presets/atari_categorical_dqn.py```.\n",
"## The Preset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final part is the preset, which will run our agent on some existing environment with any custom parameters.\n",
"\n",
"The new preset will be typically be defined in a new file - ```presets/atari_categorical_dqn.py```.\n",
"\n",
"First - let's define the agent parameters"
"First - let's select the agent parameters we defined above. \n",
"It is possible to modify internal parameters such as the learning rate."
]
},
{
@@ -275,7 +328,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Environment parameters"
"Now, let's define the environment parameters. We will use the default Atari parameters (frame skip of 4, taking the max over subsequent frames, etc.), and we will select the 'Breakout' game level."
]
},
{
@@ -285,47 +338,16 @@
"outputs": [],
"source": [
"from rl_coach.environments.gym_environment import Atari, atari_deterministic_v4\n",
"from rl_coach.environments.environment import MaxDumpMethod, SelectedPhaseOnlyDumpMethod, SingleLevelSelection\n",
"\n",
"\n",
"env_params = Atari()\n",
"env_params.level = SingleLevelSelection(atari_deterministic_v4)"
"env_params = Atari(level='BreakoutDeterministic-v4')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Schedule and visualization parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from rl_coach.graph_managers.graph_manager import ScheduleParameters\n",
"from rl_coach.core_types import EnvironmentSteps, RunPhase\n",
"from rl_coach.base_parameters import VisualizationParameters\n",
"\n",
"\n",
"schedule_params = ScheduleParameters()\n",
"schedule_params.improve_steps = EnvironmentSteps(50000000)\n",
"schedule_params.steps_between_evaluation_periods = EnvironmentSteps(250000)\n",
"schedule_params.evaluation_steps = EnvironmentSteps(135000)\n",
"schedule_params.heatup_steps = EnvironmentSteps(50000)\n",
"\n",
"vis_params = VisualizationParameters()\n",
"vis_params.video_dump_methods = [SelectedPhaseOnlyDumpMethod(RunPhase.TEST), MaxDumpMethod()]\n",
"vis_params.dump_mp4 = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Connecting all the dots together - we'll define a graph manager with the Categorial DQN agent parameters, the Atari environment parameters, and the scheduling and visualization parameters defined above"
"Connecting all the dots together - we'll define a graph manager with the Categorial DQN agent parameters, the Atari environment parameters, and the scheduling and visualization parameters"
]
},
{
@@ -335,11 +357,11 @@
"outputs": [],
"source": [
"from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager\n",
"\n",
"from rl_coach.base_parameters import VisualizationParameters\n",
"from rl_coach.environments.gym_environment import atari_schedule\n",
"\n",
"graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,\n",
" schedule_params=schedule_params, vis_params=vis_params)\n",
"graph_manager.env_params.level.select('breakout')\n",
" schedule_params=atari_schedule, vis_params=VisualizationParameters())\n",
"graph_manager.visualization_parameters.render = True"
]
},
@@ -347,8 +369,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Running the Preset\n",
"(this is normally done from command line by running ```coach -p Atari_C51 ... ```)"
"## Running the Preset\n",
"(this is normally done from command line by running ```coach -p Atari_C51 -lvl breakout```)"
]
},
{
@@ -357,30 +379,9 @@
"metadata": {},
"outputs": [],
"source": [
"from rl_coach.base_parameters import TaskParameters, Frameworks\n",
"\n",
"log_path = '../experiments/atari_categorical_dqn'\n",
"if not os.path.exists(log_path):\n",
" os.makedirs(log_path)\n",
" \n",
"task_parameters = TaskParameters(framework_type=Frameworks.tensorflow, \n",
" evaluate_only=False,\n",
" experiment_path=log_path)\n",
"\n",
"task_parameters.__dict__['checkpoint_save_secs'] = None\n",
"\n",
"graph_manager.create_graph(task_parameters)\n",
"\n",
"# let the adventure begin\n",
"graph_manager.improve()\n"
"graph_manager.improve()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@@ -1,20 +1,24 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"In this tutorial we'll add the DeepMind Control Suite environment to Coach, and create a preset that trains the DDPG agent on the new environment."
"# Adding an Environment \n",
"\n",
"Adding your custom environments to Coach will allow you to solve your own tasks using any of the predefined algorithms. There are two ways for adding your own environment to Coach:\n",
"1. Implementing your environment as an OpenAI Gym environment\n",
"2. Implementing a wrapper for your environment in Coach\n",
"\n",
"In this tutorial, we'll follow the 2nd option, and add the DeepMind Control Suite environment to Coach. We will then create a preset that trains a DDPG agent on one of the levels of the new environment."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup\n",
"First, follow the installation instructions here: https://github.com/deepmind/dm_control#installation-and-requirements. \n",
"## Setup\n",
"First, we will need to install the DeepMind Control Suite library. To do this, follow the installation instructions here: https://github.com/deepmind/dm_control#installation-and-requirements. \n",
"\n",
"\n",
"Make sure your ```LD_LIBRARY_PATH``` contains the path to the GLEW and LGFW libraries (https://github.com/openai/mujoco-py/issues/110).\n",
@@ -23,81 +27,22 @@
"In addition, Mujoco rendering might need to be disabled (https://github.com/deepmind/dm_control/issues/20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"#os.environ['DISABLE_MUJOCO_RENDERING'] = '1'\n",
"\n",
"import sys\n",
"module_path = os.path.abspath(os.path.join('..'))\n",
"if module_path not in sys.path:\n",
" sys.path.append(module_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Environment Wrapper\n",
"## The Environment Wrapper\n",
"\n",
"To integrate an environment with Coach, we need to implement an environment wrapper which is placed under the environments folder. In our case, we'll implement the ```control_suite_environment.py``` file.\n",
"To integrate an environment with Coach, we need to implement an environment wrapper. Coach has several predefined environment wrappers which are placed under the environments folder, but we can place our new environment wherever we want and reference it later.\n",
"\n",
"\n",
"We'll start with some helper classes - ```ObservationType``` and ```ControlSuiteEnvironmentParameters```."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from enum import Enum\n",
"from dm_control import suite\n",
"from rl_coach.environments.environment import Environment, EnvironmentParameters, LevelSelection\n",
"from rl_coach.filters.filter import NoInputFilter, NoOutputFilter\n",
"\n",
"\n",
"\n",
"class ObservationType(Enum):\n",
" Measurements = 1\n",
" Image = 2\n",
" Image_and_Measurements = 3\n",
"\n",
"\n",
"# Parameters\n",
"class ControlSuiteEnvironmentParameters(EnvironmentParameters):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.observation_type = ObservationType.Measurements\n",
" self.default_input_filter = ControlSuiteInputFilter\n",
" self.default_output_filter = ControlSuiteOutputFilter\n",
"\n",
" @property\n",
" def path(self):\n",
" return 'environments.control_suite_environment:ControlSuiteEnvironment'\n",
"\n",
"\n",
"\"\"\"\n",
"ControlSuite Environment Components\n",
"\"\"\"\n",
"ControlSuiteInputFilter = NoInputFilter()\n",
"ControlSuiteOutputFilter = NoOutputFilter()\n",
"\n",
"control_suite_envs = {':'.join(env): ':'.join(env) for env in suite.BENCHMARKING}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's define the control suite's environment wrapper class.\n",
"\n",
"In the ```__init__``` function we'll load and initialize the environment, and the internal state and action space members which will make sure the states and actions are within their allowed limits."
"In the ```__init__``` function we'll load and initialize the simulator using the level given by `self.env_id`.\n",
"Additionally, we will define the state space and action space of the environment, through the `self.state_space` and `self.action_space` members.\n",
"In this case, the state space is a dictionary consisting of 2 observations:\n",
"* **'pixels'** - the image received from the mujoco camera, defined as an ImageObservationSpace.\n",
"* **'measurements'** - the joint measurements of the model, defined as a VectorObservationSpace.\n",
"The action space is a continuous space defined by the BoxActionSpace."
]
},
{
@@ -109,27 +54,26 @@
"import numpy as np\n",
"import random\n",
"from typing import Union\n",
"\n",
"from dm_control import suite\n",
"from dm_control.suite.wrappers import pixels\n",
"\n",
"from rl_coach.base_parameters import VisualizationParameters\n",
"from rl_coach.spaces import BoxActionSpace, ImageObservationSpace, VectorObservationSpace, StateSpace\n",
"from dm_control.suite.wrappers import pixels\n",
"from rl_coach.environments.environment import Environment, LevelSelection\n",
"\n",
"\n",
"# Environment\n",
"class ControlSuiteEnvironment(Environment):\n",
" def __init__(self, level: LevelSelection, frame_skip: int, visualization_parameters: VisualizationParameters,\n",
" seed: Union[None, int]=None, human_control: bool=False,\n",
" observation_type: ObservationType=ObservationType.Measurements,\n",
" custom_reward_threshold: Union[int, float]=None, **kwargs):\n",
" super().__init__(level, seed, frame_skip, human_control, custom_reward_threshold, visualization_parameters)\n",
"\n",
" self.observation_type = observation_type\n",
"\n",
" \n",
" # load and initialize environment\n",
" domain_name, task_name = self.env_id.split(\":\")\n",
" self.env = suite.load(domain_name=domain_name, task_name=task_name)\n",
"\n",
" if observation_type != ObservationType.Measurements:\n",
" self.env = pixels.Wrapper(self.env, pixels_only=observation_type == ObservationType.Image)\n",
" self.env = pixels.Wrapper(self.env, pixels_only=False)\n",
"\n",
" # seed\n",
" if self.seed is not None:\n",
@@ -139,24 +83,22 @@
" self.state_space = StateSpace({})\n",
"\n",
" # image observations\n",
" if observation_type != ObservationType.Measurements:\n",
" self.state_space['pixels'] = ImageObservationSpace(shape=self.env.observation_spec()['pixels'].shape,\n",
" high=255)\n",
" self.state_space['pixels'] = ImageObservationSpace(shape=self.env.observation_spec()['pixels'].shape,\n",
" high=255)\n",
"\n",
" # measurements observations\n",
" if observation_type != ObservationType.Image:\n",
" measurements_space_size = 0\n",
" measurements_names = []\n",
" for observation_space_name, observation_space in self.env.observation_spec().items():\n",
" if len(observation_space.shape) == 0:\n",
" measurements_space_size += 1\n",
" measurements_names.append(observation_space_name)\n",
" elif len(observation_space.shape) == 1:\n",
" measurements_space_size += observation_space.shape[0]\n",
" measurements_names.extend([\"{}_{}\".format(observation_space_name, i) for i in\n",
" range(observation_space.shape[0])])\n",
" self.state_space['measurements'] = VectorObservationSpace(shape=measurements_space_size,\n",
" measurements_names=measurements_names)\n",
" measurements_space_size = 0\n",
" measurements_names = []\n",
" for observation_space_name, observation_space in self.env.observation_spec().items():\n",
" if len(observation_space.shape) == 0:\n",
" measurements_space_size += 1\n",
" measurements_names.append(observation_space_name)\n",
" elif len(observation_space.shape) == 1:\n",
" measurements_space_size += observation_space.shape[0]\n",
" measurements_names.extend([\"{}_{}\".format(observation_space_name, i) for i in\n",
" range(observation_space.shape[0])])\n",
" self.state_space['measurements'] = VectorObservationSpace(shape=measurements_space_size,\n",
" measurements_names=measurements_names)\n",
"\n",
" # actions\n",
" self.action_space = BoxActionSpace(\n",
@@ -166,16 +108,7 @@
" )\n",
"\n",
" # initialize the state by getting a new state from the environment\n",
" self.reset_internal_state(True)\n",
"\n",
" # render\n",
" if self.is_rendered:\n",
" image = self.get_rendered_image()\n",
" scale = 1\n",
" if self.human_control:\n",
" scale = 2\n",
" if not self.native_rendering:\n",
" self.renderer.create_screen(image.shape[1]*scale, image.shape[0]*scale)"
" self.reset_internal_state(True)"
]
},
{
@@ -184,8 +117,15 @@
"source": [
"The following functions cover the API expected from a new environment wrapper:\n",
"\n",
"1. ```_update_state``` - update the internal state of the wrapper (to be queried by the agent)\n",
"2. ```_take_action``` - take an action on the environment \n",
"1. ```_update_state``` - update the internal state of the wrapper (to be queried by the agent),\n",
" which consists of:\n",
" * `self.state` - a dictionary containing all the observations from the environment and which follows the state space definition.\n",
" * `self.reward` - a float value containing the reward for the last step of the environment\n",
" * `self.done` - a boolean flag which signals if the environment episode has ended\n",
" * `self.goal` - a numpy array representing the goal the environment has set for the last step\n",
" * `self.info` - a dictionary that contains any additional information for the last step\n",
" \n",
"2. ```_take_action``` - gets the action from the agent, and make a single step on the environment\n",
"3. ```_restart_environment_episode``` - restart the environment on a new episode \n",
"4. ```get_rendered_image``` - get a rendered image of the environment in its current state"
]
@@ -200,18 +140,16 @@
" def _update_state(self):\n",
" self.state = {}\n",
"\n",
" if self.observation_type != ObservationType.Measurements:\n",
" self.pixels = self.last_result.observation['pixels']\n",
" self.state['pixels'] = self.pixels\n",
" self.pixels = self.last_result.observation['pixels']\n",
" self.state['pixels'] = self.pixels\n",
"\n",
" if self.observation_type != ObservationType.Image:\n",
" self.measurements = np.array([])\n",
" for sub_observation in self.last_result.observation.values():\n",
" if isinstance(sub_observation, np.ndarray) and len(sub_observation.shape) == 1:\n",
" self.measurements = np.concatenate((self.measurements, sub_observation))\n",
" else:\n",
" self.measurements = np.concatenate((self.measurements, np.array([sub_observation])))\n",
" self.state['measurements'] = self.measurements\n",
" self.measurements = np.array([])\n",
" for sub_observation in self.last_result.observation.values():\n",
" if isinstance(sub_observation, np.ndarray) and len(sub_observation.shape) == 1:\n",
" self.measurements = np.concatenate((self.measurements, sub_observation))\n",
" else:\n",
" self.measurements = np.concatenate((self.measurements, np.array([sub_observation])))\n",
" self.state['measurements'] = self.measurements\n",
"\n",
" self.reward = self.last_result.reward if self.last_result.reward is not None else 0\n",
"\n",
@@ -234,10 +172,41 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Preset\n",
"The new preset will be defined in a new file - ```presets\\ControlSuite_DDPG.py```. \n",
"Finally, we will need to define a parameters class corresponding to our environment class."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from rl_coach.environments.environment import EnvironmentParameters\n",
"from rl_coach.filters.filter import NoInputFilter, NoOutputFilter\n",
"\n",
"First - let's define the agent parameters"
"# Parameters\n",
"class ControlSuiteEnvironmentParameters(EnvironmentParameters):\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.default_input_filter = NoInputFilter()\n",
" self.default_output_filter = NoInputFilter()\n",
"\n",
" @property\n",
" def path(self):\n",
" return 'environments.control_suite_environment:ControlSuiteEnvironment'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Preset\n",
"\n",
"Now that we have our new environment, we will want to use one of the predefined algorithms to try and solve it.\n",
"In this case, since the environment defines a continuous action space, we will want to use a supporting algorithm, so we will select DDPG. To run DDPG on the environment, we will need to define a preset for it.\n",
"The new preset will typically be defined in a new file - ```presets\\ControlSuite_DDPG.py```. \n",
"\n",
"First - let's define the agent parameters. We can use the default parameters for the DDPG agent, except that we need to update the networks input embedders to point to the correct environment observation. When we defined the environment, we set it to have 2 observations - 'pixels' and 'measurements'. In this case, we will want to learn only from the measurements, so we will need to modify the default input embedders to point to 'measurements' instead of the default 'observation' defined in `DDPGAgentParameters`."
]
},
{
@@ -247,32 +216,21 @@
"outputs": [],
"source": [
"from rl_coach.agents.ddpg_agent import DDPGAgentParameters\n",
"from rl_coach.architectures.tensorflow_components.architecture import Dense\n",
"from rl_coach.base_parameters import VisualizationParameters, EmbedderScheme\n",
"from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps, RunPhase\n",
"from rl_coach.environments.gym_environment import MujocoInputFilter\n",
"from rl_coach.filters.reward.reward_rescale_filter import RewardRescaleFilter\n",
"\n",
"\n",
"agent_params = DDPGAgentParameters()\n",
"# rename the input embedder key from 'observation' to 'measurements'\n",
"agent_params.network_wrappers['actor'].input_embedders_parameters['measurements'] = \\\n",
" agent_params.network_wrappers['actor'].input_embedders_parameters.pop('observation')\n",
"agent_params.network_wrappers['critic'].input_embedders_parameters['measurements'] = \\\n",
" agent_params.network_wrappers['critic'].input_embedders_parameters.pop('observation')\n",
"agent_params.network_wrappers['actor'].input_embedders_parameters['measurements'].scheme = [Dense([300])]\n",
"agent_params.network_wrappers['actor'].middleware_parameters.scheme = [Dense([200])]\n",
"agent_params.network_wrappers['critic'].input_embedders_parameters['measurements'].scheme = [Dense([400])]\n",
"agent_params.network_wrappers['critic'].middleware_parameters.scheme = [Dense([300])]\n",
"agent_params.network_wrappers['critic'].input_embedders_parameters['action'].scheme = EmbedderScheme.Empty\n",
"agent_params.input_filter = MujocoInputFilter()\n",
"agent_params.input_filter.add_reward_filter(\"rescale\", RewardRescaleFilter(1/10.))"
" agent_params.network_wrappers['critic'].input_embedders_parameters.pop('observation')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's define the environment parameters"
"Now let's define the environment parameters. The DeepMind Control Suite environment has many levels to select from. The level can be selected either as a specific level name, for example 'cartpole:swingup', or by a list of level names from which a single level should be selected. The later can be done using the `SingleLevelSelection` class, and then the level can be selected from the command line using the `-lvl` flag."
]
},
{
@@ -282,21 +240,16 @@
"outputs": [],
"source": [
"from rl_coach.environments.control_suite_environment import ControlSuiteEnvironmentParameters, control_suite_envs\n",
"from rl_coach.environments.environment import MaxDumpMethod, SelectedPhaseOnlyDumpMethod, SingleLevelSelection\n",
"from rl_coach.environments.environment import SingleLevelSelection\n",
"\n",
"env_params = ControlSuiteEnvironmentParameters()\n",
"env_params.level = SingleLevelSelection(control_suite_envs)\n",
"\n",
"vis_params = VisualizationParameters()\n",
"vis_params.video_dump_methods = [SelectedPhaseOnlyDumpMethod(RunPhase.TEST), MaxDumpMethod()]\n",
"vis_params.dump_mp4 = False"
"env_params = ControlSuiteEnvironmentParameters(level='cartpole:balance')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The schedule parameters will define the number of heatup steps, periodice evaluation steps, training steps between evaluations."
"We will also need to define a schedule for the training. The schedule defines the number of steps we want run our experiment for and when to evaluate the trained model. In this case, we will use a simple predefined schedule, and just add some heatup steps to fill up the agent memory buffers with initial data."
]
},
{
@@ -305,16 +258,31 @@
"metadata": {},
"outputs": [],
"source": [
"from rl_coach.graph_managers.graph_manager import ScheduleParameters\n",
"from rl_coach.graph_managers.graph_manager import SimpleSchedule\n",
"from rl_coach.core_types import EnvironmentSteps\n",
"\n",
"\n",
"schedule_params = ScheduleParameters()\n",
"schedule_params.improve_steps = TrainingSteps(10000000000)\n",
"schedule_params.steps_between_evaluation_periods = EnvironmentEpisodes(20)\n",
"schedule_params.evaluation_steps = EnvironmentEpisodes(1)\n",
"schedule_params = SimpleSchedule()\n",
"schedule_params.heatup_steps = EnvironmentSteps(1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also want to see the simulator in action (otherwise we will miss all the fun), so let's set the `render` flag to True in the visualization parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from rl_coach.base_parameters import VisualizationParameters\n",
"\n",
"vis_params = VisualizationParameters(render=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -329,39 +297,13 @@
"outputs": [],
"source": [
"from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager\n",
"from rl_coach.base_parameters import TaskParameters, Frameworks\n",
"\n",
"\n",
"graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,\n",
" schedule_params=schedule_params, vis_params=vis_params)\n",
"\n",
"graph_manager.env_params.level.select('walker:walk')\n",
"graph_manager.visualization_parameters.render = True\n",
"\n",
"\n",
"log_path = '../experiments/control_suite_walker_ddpg'\n",
"if not os.path.exists(log_path):\n",
" os.makedirs(log_path)\n",
" \n",
"task_parameters = TaskParameters(framework_type=Frameworks.tensorflow, \n",
" evaluate_only=False,\n",
" experiment_path=log_path)\n",
"\n",
"task_parameters.__dict__['checkpoint_save_secs'] = None\n",
"\n",
"\n",
"graph_manager.create_graph(task_parameters)\n",
"\n",
"# let the adventure begin\n",
"graph_manager.improve()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {