pre-release 0.10.0
0
docs_raw/docs/__init__.py
Normal file
@@ -2,37 +2,67 @@
|
||||
|
||||
Coach's modularity makes adding an agent a simple and clean task, that involves the following steps:
|
||||
|
||||
1. Implement your algorithm in a new file under the agents directory. The agent can inherit base classes such as **ValueOptimizationAgent** or **ActorCriticAgent**, or the more generic **Agent** base class.
|
||||
1. Implement your algorithm in a new file. The agent can inherit base classes such as **ValueOptimizationAgent** or
|
||||
**ActorCriticAgent**, or the more generic **Agent** base class.
|
||||
|
||||
* **ValueOptimizationAgent**, **PolicyOptimizationAgent** and **Agent** are abstract classes.
|
||||
learn_from_batch() should be overriden with the desired behavior for the algorithm being implemented. If deciding to inherit from **Agent**, also choose_action() should be overriden.
|
||||
learn_from_batch() should be overriden with the desired behavior for the algorithm being implemented.
|
||||
If deciding to inherit from **Agent**, also choose_action() should be overriden.
|
||||
|
||||
|
||||
def learn_from_batch(self, batch):
|
||||
def learn_from_batch(self, batch) -> Tuple[float, List, List]:
|
||||
"""
|
||||
Given a batch of transitions, calculates their target values and updates the network.
|
||||
:param batch: A list of transitions
|
||||
:return: The loss of the training
|
||||
:return: The total loss of the training, the loss per head and the unclipped gradients
|
||||
"""
|
||||
pass
|
||||
|
||||
def choose_action(self, curr_state, phase=RunPhase.TRAIN):
|
||||
|
||||
def choose_action(self, curr_state):
|
||||
"""
|
||||
choose an action to act with in the current episode being played. Different behavior might be exhibited when training
|
||||
or testing.
|
||||
|
||||
:param curr_state: the current state to act upon.
|
||||
:param phase: the current phase: training or testing.
|
||||
|
||||
:param curr_state: the current state to act upon.
|
||||
:return: chosen action, some action value describing the action (q-value, probability, etc)
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
|
||||
* Make sure to add your new agent to **agents/\_\_init\_\_.py**
|
||||
|
||||
2. Implement your agent's specific network head, if needed, at the implementation for the framework of your choice. For example **architectures/neon_components/heads.py**. The head will inherit the generic base class Head.
|
||||
A new output type should be added to configurations.py, and a mapping between the new head and output type should be defined in the get_output_head() function at **architectures/neon_components/general_network.py**
|
||||
3. Define a new configuration class at configurations.py, which includes the new agent name in the **type** field, the new output type in the **output_types** field, and assigning default values to hyperparameters.
|
||||
4. (Optional) Define a preset using the new agent type with a given environment, and the hyperparameters that should be used for training on that environment.
|
||||
|
||||
2. Implement your agent's specific network head, if needed, at the implementation for the framework of your choice.
|
||||
For example **architectures/neon_components/heads.py**. The head will inherit the generic base class Head.
|
||||
A new output type should be added to configurations.py, and a mapping between the new head and output type should
|
||||
be defined in the get_output_head() function at **architectures/neon_components/general_network.py**
|
||||
|
||||
3. Define a new parameters class that inherits AgentParameters.
|
||||
The parameters class defines all the hyperparameters for the agent, and is initialized with 4 main components:
|
||||
* **algorithm**: A class inheriting AlgorithmParameters which defines any algorithm specific parameters
|
||||
* **exploration**: A class inheriting ExplorationParameters which defines the exploration policy parameters.
|
||||
There are several common exploration policies built-in which you can use, and are defined under
|
||||
the exploration sub directory. You can also define your own custom exploration policy.
|
||||
* **memory**: A class inheriting MemoryParameters which defined the memory parameters.
|
||||
There are several common memory types built-in which you can use, and are defined under the memories
|
||||
sub directory. You can also define your own custom memory.
|
||||
* **networks**: A dictionary defining all the networks that will be used by the agent. The keys of the dictionary
|
||||
define the network name and will be used to access each network through the agent class.
|
||||
The dictionary values are a class inheriting NetworkParameters, which define the network structure
|
||||
and parameters.
|
||||
|
||||
|
||||
Additionally, set the path property to return the path to your agent class in the following format:
|
||||
|
||||
<path to python module>:<name of agent class>
|
||||
|
||||
For example,
|
||||
|
||||
class RainbowAgentParameters(AgentParameters):
|
||||
def __init__(self):
|
||||
super().__init__(algorithm=RainbowAlgorithmParameters(),
|
||||
exploration=RainbowExplorationParameters(),
|
||||
memory=RainbowMemoryParameters(),
|
||||
networks={"main": RainbowNetworkParameters()})
|
||||
|
||||
@property
|
||||
def path(self):
|
||||
return 'rainbow.rainbow_agent:RainbowAgent'
|
||||
|
||||
4. (Optional) Define a preset using the new agent type with a given environment, and the hyper-parameters that should
|
||||
be used for training on that environment.
|
||||
|
||||
|
||||
@@ -1,70 +1,79 @@
|
||||
Adding a new environment to Coach is as easy as solving CartPole.
|
||||
|
||||
There are essentially two ways to integrate new environments to Coach:
|
||||
|
||||
## Using the OpenAI Gym API
|
||||
|
||||
If your environment is already using the OpenAI Gym API, you are already good to go.
|
||||
When selecting the environment parameters in the preset, use GymEnvironmentParameters(),
|
||||
and pass the path to your environment source code using the level parameter.
|
||||
You can specify additional parameters for your environment using the additional_simulator_parameters parameter.
|
||||
Take for example the definition used in the Pendulum_HAC preset:
|
||||
|
||||
env_params = GymEnvironmentParameters()
|
||||
env_params.level = "rl_coach.environments.mujoco.pendulum_with_goals:PendulumWithGoals"
|
||||
env_params.additional_simulator_parameters = {"time_limit": 1000}
|
||||
|
||||
## Using the Coach API
|
||||
|
||||
There are a few simple steps to follow, and we will walk through them one by one.
|
||||
|
||||
1. Coach defines a simple API for implementing a new environment which is defined in environment/environment_wrapper.py.
|
||||
There are several functions to implement, but only some of them are mandatory.
|
||||
1. Create a new class for your environment, and inherit the Environment class.
|
||||
|
||||
2. Coach defines a simple API for implementing a new environment, which are defined in environment/environment.py.
|
||||
There are several functions to implement, but only some of them are mandatory.
|
||||
|
||||
Here are the important ones:
|
||||
|
||||
def _take_action(self, action_idx):
|
||||
def _take_action(self, action_idx: ActionType) -> None:
|
||||
"""
|
||||
An environment dependent function that sends an action to the simulator.
|
||||
:param action_idx: the action to perform on the environment.
|
||||
:param action_idx: the action to perform on the environment
|
||||
:return: None
|
||||
"""
|
||||
pass
|
||||
|
||||
def _preprocess_observation(self, observation):
|
||||
"""
|
||||
Do initial observation preprocessing such as cropping, rgb2gray, rescale etc.
|
||||
Implementing this function is optional.
|
||||
:param observation: a raw observation from the environment
|
||||
:return: the preprocessed observation
|
||||
"""
|
||||
return observation
|
||||
|
||||
def _update_state(self):
|
||||
def _update_state(self) -> None:
|
||||
"""
|
||||
Updates the state from the environment.
|
||||
Should update self.observation, self.reward, self.done, self.measurements and self.info
|
||||
:return: None
|
||||
"""
|
||||
pass
|
||||
|
||||
def _restart_environment_episode(self, force_environment_reset=False):
|
||||
def _restart_environment_episode(self, force_environment_reset=False) -> None:
|
||||
"""
|
||||
Restarts the simulator episode
|
||||
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
|
||||
:return:
|
||||
:return: None
|
||||
"""
|
||||
pass
|
||||
|
||||
def get_rendered_image(self):
|
||||
def _render(self) -> None:
|
||||
"""
|
||||
Renders the environment using the native simulator renderer
|
||||
:return: None
|
||||
"""
|
||||
|
||||
def get_rendered_image(self) -> np.ndarray:
|
||||
"""
|
||||
Return a numpy array containing the image that will be rendered to the screen.
|
||||
This can be different from the observation. For example, mujoco's observation is a measurements vector.
|
||||
:return: numpy array containing the image that will be rendered to the screen
|
||||
"""
|
||||
return self.observation
|
||||
|
||||
3. Create a new parameters class for your environment, which inherits the EnvironmentParameters class.
|
||||
In the __init__ of your class, define all the parameters you used in your Environment class.
|
||||
Additionally, fill the path property of the class with the path to your Environment class.
|
||||
For example, take a look at the EnvironmentParameters class used for Doom:
|
||||
|
||||
2. Make sure to import the environment in environments/\_\_init\_\_.py:
|
||||
|
||||
from doom_environment_wrapper import *
|
||||
|
||||
Also, a new entry should be added to the EnvTypes enum mapping the environment name to the wrapper's class name:
|
||||
|
||||
Doom = "DoomEnvironmentWrapper"
|
||||
class DoomEnvironmentParameters(EnvironmentParameters):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.default_input_filter = DoomInputFilter
|
||||
self.default_output_filter = DoomOutputFilter
|
||||
self.cameras = [DoomEnvironment.CameraTypes.OBSERVATION]
|
||||
|
||||
@property
|
||||
def path(self):
|
||||
return 'rl_coach.environments.doom_environment:DoomEnvironment'
|
||||
|
||||
|
||||
3. In addition a new configuration class should be implemented for defining the environment's parameters and placed in configurations.py.
|
||||
For instance, the following is used for Doom:
|
||||
|
||||
class Doom(EnvironmentParameters):
|
||||
type = 'Doom'
|
||||
frame_skip = 4
|
||||
observation_stack_size = 3
|
||||
desired_observation_height = 60
|
||||
desired_observation_width = 76
|
||||
|
||||
4. And that's it, you're done. Now just add a new preset with your newly created environment, and start training an agent on top of it.
|
||||
4. And that's it, you're done. Now just add a new preset with your newly created environment, and start training an agent on top of it.
|
||||
|
||||
94
docs_raw/docs/design/control_flow.md
Normal file
@@ -0,0 +1,94 @@
|
||||
<!-- language-all: python -->
|
||||
|
||||
# Coach Control Flow
|
||||
|
||||
Coach is built in a modular way, encouraging modules reuse and reducing the amount of boilerplate code needed
|
||||
for developing new algorithms or integrating a new challenge as an environment.
|
||||
On the other hand, it can be overwhelming for new users to ramp up on the code.
|
||||
To help with that, here's a short overview of the control flow.
|
||||
|
||||
## Graph Manager
|
||||
|
||||
The main entry point for Coach is **coach.py**.
|
||||
The main functionality of this script is to parse the command line arguments and invoke all the sub-processes needed
|
||||
for the given experiment.
|
||||
**coach.py** executes the given **preset** file which returns a **GraphManager** object.
|
||||
|
||||
A **preset** is a design pattern that is intended for concentrating the entire definition of an experiment in a single
|
||||
file. This helps with experiments reproducibility, improves readability and prevents confusion.
|
||||
The outcome of a preset is a **GraphManager** which will usually be instantiated in the final lines of the preset.
|
||||
|
||||
A **GraphManager** is an object that holds all the agents and environments of an experiment, and is mostly responsible
|
||||
for scheduling their work. Why is it called a **graph** manager? Because agents and environments are structured into
|
||||
a graph of interactions. For example, in hierarchical reinforcement learning schemes, there will often be a master
|
||||
policy agent, that will control a sub-policy agent, which will interact with the environment. Other schemes can have
|
||||
much more complex graphs of control, such as several hierarchy layers, each with multiple agents.
|
||||
The graph manager's main loop is the improve loop.
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../img/improve.png" alt="Improve loop" style="width: 400px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
The improve loop skips between 3 main phases - heatup, training and evaluation:
|
||||
|
||||
* **Heatup** - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phase
|
||||
takes place only in the beginning of the experiment, and the agents will act completely randomly during this phase.
|
||||
Importantly, the agents do not train their networks during this phase. DQN for example, uses 50k random steps in order
|
||||
to initialize the replay buffers.
|
||||
|
||||
* **Training** - the training phase is the main phase of the experiment. This phase can change between agent types,
|
||||
but essentially consists of repeated cycles of acting, collecting data from the environment, and training the agent
|
||||
networks. During this phase, the agent will use its exploration policy in training mode, which will add noise to its
|
||||
actions in order to improve its knowledge about the environment state space.
|
||||
|
||||
* **Evaluation** - the evaluation phase is intended for evaluating the current performance of the agent. The agents
|
||||
will act greedily in order to exploit the knowledge aggregated so far and the performance over multiple episodes of
|
||||
evaluation will be averaged in order to reduce the stochasticity effects of all the components.
|
||||
|
||||
|
||||
## Level Manager
|
||||
|
||||
In each of the 3 phases described above, the graph manager will invoke all the hierarchy levels in the graph in a
|
||||
synchronized manner. In Coach, agents do not interact directly with the environment. Instead, they go through a
|
||||
*LevelManager*, which is a proxy that manages their interaction. The level manager passes the current state and reward
|
||||
from the environment to the agent, and the actions from the agent to the environment.
|
||||
|
||||
The motivation for having a level manager is to disentangle the code of the environment and the agent, so to allow more
|
||||
complex interactions. Each level can have multiple agents which interact with the environment. Who gets to choose the
|
||||
action for each step is controlled by the level manager.
|
||||
Additionally, each level manager can act as an environment for the hierarchy level above it, such that each hierarchy
|
||||
level can be seen as an interaction between an agent and an environment, even if the environment is just more agents in
|
||||
a lower hierarchy level.
|
||||
|
||||
|
||||
## Agent
|
||||
|
||||
The base agent class has 3 main function that will be used during those phases - observe, act and train.
|
||||
|
||||
* **Observe** - this function gets the latest response from the environment as input, and updates the internal state
|
||||
of the agent with the new information. The environment response will
|
||||
be first passed through the agent's **InputFilter** object, which will process the values in the response, according
|
||||
to the specific agent definition. The environment response will then be converted into a
|
||||
**Transition** which will contain the information from a single step
|
||||
($ s_{t}, a_{t}, r_{t}, s_{t+1}, terminal signal $), and store it in the memory.
|
||||
|
||||
<img src="../../img/observe.png" alt="Observe" style="width: 700px;"/>
|
||||
|
||||
* **Act** - this function uses the current internal state of the agent in order to select the next action to take on
|
||||
the environment. This function will call the per-agent custom function **choose_action** that will use the network
|
||||
and the exploration policy in order to select an action. The action will be stored, together with any additional
|
||||
information (like the action value for example) in an **ActionInfo** object. The ActionInfo object will then be
|
||||
passed through the agent's **OutputFilter** to allow any processing of the action (like discretization,
|
||||
or shifting, for example), before passing it to the environment.
|
||||
|
||||
<img src="../../img/act.png" alt="Act" style="width: 700px;"/>
|
||||
|
||||
* **Train** - this function will sample a batch from the memory and train on it. The batch of transitions will be
|
||||
first wrapped into a **Batch** object to allow efficient querying of the batch values. It will then be passed into
|
||||
the agent specific **learn_from_batch** function, that will extract network target values from the batch and will
|
||||
train the networks accordingly. Lastly, if there's a target network defined for the agent, it will sync the target
|
||||
network weights with the online network.
|
||||
|
||||
<img src="../../img/train.png" alt="Train" style="width: 700px;"/>
|
||||
44
docs_raw/docs/design/features.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Coach Features
|
||||
|
||||
## Supported Algorithms
|
||||
|
||||
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into two main classes -
|
||||
value optimization and policy optimization. A detailed description of those algorithms may be found in the algorithms
|
||||
section.
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../img/algorithms.png" alt="Supported Algorithms" style="width: 600px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
|
||||
## Supported Environments
|
||||
|
||||
Coach supports a large number of environments which can be solved using reinforcement learning:
|
||||
|
||||
* **[DeepMind Control Suite](https://github.com/deepmind/dm_control)** - a set of reinforcement learning environments
|
||||
powered by the MuJoCo physics engine.
|
||||
|
||||
* **[Blizzard Starcraft II](https://github.com/deepmind/pysc2)** - a popular strategy game which was wrapped with a
|
||||
python interface by DeepMind.
|
||||
|
||||
* **[ViZDoom](http://vizdoom.cs.put.edu.pl/)** - a Doom-based AI research platform for reinforcement learning
|
||||
from raw visual information.
|
||||
|
||||
* **[CARLA](https://github.com/carla-simulator/carla)** - an open-source simulator for autonomous driving research.
|
||||
|
||||
* **[OpenAI Gym](https://gym.openai.com/)** - a library which consists of a set of environments, from games to robotics.
|
||||
Additionally, it can be extended using the API defined by the authors.
|
||||
|
||||
In Coach, we support all the native environments in Gym, along with several extensions such as:
|
||||
|
||||
* **[Roboschool](https://github.com/openai/roboschool)** - a set of environments powered by the PyBullet engine,
|
||||
that offer a free alternative to MuJoCo.
|
||||
|
||||
* **[Gym Extensions](https://github.com/Breakend/gym-extensions)** - a set of environments that extends Gym for
|
||||
auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)
|
||||
|
||||
* **[PyBullet](https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet)** - a physics engine that
|
||||
includes a set of robotics environments.
|
||||
|
||||
116
docs_raw/docs/design/filters.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# Filters
|
||||
|
||||
Filters are a mechanism in Coach that allows doing pre-processing and post-processing of the internal agent information.
|
||||
There are two filter categories -
|
||||
|
||||
* **Input filters** - these are filters that process the information passed **into** the agent from the environment.
|
||||
This information includes the observation and the reward. Input filters therefore allow rescaling observations,
|
||||
normalizing rewards, stack observations, etc.
|
||||
|
||||
* **Output filters** - these are filters that process the information going **out** of the agent into the environment.
|
||||
This information includes the action the agent chooses to take. Output filters therefore allow conversion of
|
||||
actions from one space into another. For example, the agent can take $ N $ discrete actions, that will be mapped by
|
||||
the output filter onto $ N $ continuous actions.
|
||||
|
||||
Filters can be stacked on top of each other in order to build complex processing flows of the inputs or outputs.
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../../img/filters.png" alt="Filters mechanism" style="width: 350px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
## Input Filters
|
||||
|
||||
The input filters are separated into two categories - **observation filters** and **reward filters**.
|
||||
|
||||
### Observation Filters
|
||||
|
||||
* **ObservationClippingFilter** - Clips the observation values to a given range of values. For example, if the
|
||||
observation consists of measurements in an arbitrary range, and we want to control the minimum and maximum values
|
||||
of these observations, we can define a range and clip the values of the measurements.
|
||||
|
||||
* **ObservationCropFilter** - Crops the size of the observation to a given crop window. For example, in Atari, the
|
||||
observations are images with a shape of 210x160. Usually, we will want to crop the size of the observation to a
|
||||
square of 160x160 before rescaling them.
|
||||
|
||||
* **ObservationMoveAxisFilter** - Reorders the axes of the observation. This can be useful when the observation is an
|
||||
image, and we want to move the channel axis to be the last axis instead of the first axis.
|
||||
|
||||
* **ObservationNormalizationFilter** - Normalizes the observation values with a running mean and standard deviation of
|
||||
all the observations seen so far. The normalization is performed element-wise. Additionally, when working with
|
||||
multiple workers, the statistics used for the normalization operation are accumulated over all the workers.
|
||||
|
||||
* **ObservationReductionBySubPartsNameFilter** - Allows keeping only parts of the observation, by specifying their
|
||||
name. For example, the CARLA environment extracts multiple measurements that can be used by the agent, such as
|
||||
speed and location. If we want to only use the speed, it can be done using this filter.
|
||||
|
||||
* **ObservationRescaleSizeByFactorFilter** - Rescales an image observation by some factor. For example, the image size
|
||||
can be reduced by a factor of 2.
|
||||
|
||||
* **ObservationRescaleToSizeFilter** - Rescales an image observation to a given size. The target size does not
|
||||
necessarily keep the aspect ratio of the original observation.
|
||||
|
||||
* **ObservationRGBToYFilter** - Converts a color image observation specified using the RGB encoding into a grayscale
|
||||
image observation, by keeping only the luminance (Y) channel of the YUV encoding. This can be useful if the colors
|
||||
in the original image are not relevant for solving the task at hand.
|
||||
|
||||
* **ObservationSqueezeFilter** - Removes redundant axes from the observation, which are axes with a dimension of 1.
|
||||
|
||||
* **ObservationStackingFilter** - Stacks several observations on top of each other. For image observation this will
|
||||
create a 3D blob. The stacking is done in a lazy manner in order to reduce memory consumption. To achieve this,
|
||||
a LazyStack object is used in order to wrap the observations in the stack. For this reason, the
|
||||
ObservationStackingFilter **must** be the last filter in the inputs filters stack.
|
||||
|
||||
* **ObservationUint8Filter** - Converts a floating point observation into an unsigned int 8 bit observation. This is
|
||||
mostly useful for reducing memory consumption and is usually used for image observations. The filter will first
|
||||
spread the observation values over the range 0-255 and then discretize them into integer values.
|
||||
|
||||
### Reward Filters
|
||||
|
||||
* **RewardClippingFilter** - Clips the reward values into a given range. For example, in DQN, the Atari rewards are
|
||||
clipped into the range -1 and 1 in order to control the scale of the returns.
|
||||
|
||||
* **RewardNormalizationFilter** - Normalizes the reward values with a running mean and standard deviation of
|
||||
all the rewards seen so far. When working with multiple workers, the statistics used for the normalization operation
|
||||
are accumulated over all the workers.
|
||||
|
||||
* **RewardRescaleFilter** - Rescales the reward by a given factor. Rescaling the rewards of the environment has been
|
||||
observed to have a large effect (negative or positive) on the behavior of the learning process.
|
||||
|
||||
## Output Filters
|
||||
|
||||
The output filters only process the actions.
|
||||
|
||||
### Action Filters
|
||||
|
||||
* **AttentionDiscretization** - Discretizes an **AttentionActionSpace**. The attention action space defines the actions
|
||||
as choosing sub-boxes in a given box. For example, consider an image of size 100x100, where the action is choosing
|
||||
a crop window of size 20x20 to attend to in the image. AttentionDiscretization allows discretizing the possible crop
|
||||
windows to choose into a finite number of options, and map a discrete action space into those crop windows.
|
||||
|
||||
* **BoxDiscretization** - Discretizes a continuous action space into a discrete action space, allowing the usage of
|
||||
agents such as DQN for continuous environments such as MuJoCo. Given the number of bins to discretize into, the
|
||||
original continuous action space is uniformly separated into the given number of bins, each mapped to a discrete
|
||||
action index. For example, if the original actions space is between -1 and 1 and 5 bins were selected, the new action
|
||||
space will consist of 5 actions mapped to -1, -0.5, 0, 0.5 and 1.
|
||||
|
||||
* **BoxMasking** - Masks part of the action space to enforce the agent to work in a defined space. For example,
|
||||
if the original action space is between -1 and 1, then this filter can be used in order to constrain the agent actions
|
||||
to the range 0 and 1 instead. This essentially masks the range -1 and 0 from the agent.
|
||||
|
||||
* **PartialDiscreteActionSpaceMap** - Partial map of two countable action spaces. For example, consider an environment
|
||||
with a MultiSelect action space (select multiple actions at the same time, such as jump and go right), with 8 actual
|
||||
MultiSelect actions. If we want the agent to be able to select only 5 of those actions by their index (0-4), we can
|
||||
map a discrete action space with 5 actions into the 5 selected MultiSelect actions. This will both allow the agent to
|
||||
use regular discrete actions, and mask 3 of the actions from the agent.
|
||||
|
||||
* **FullDiscreteActionSpaceMap** - Full map of two countable action spaces. This works in a similar way to the
|
||||
PartialDiscreteActionSpaceMap, but maps the entire source action space into the entire target action space, without
|
||||
masking any actions.
|
||||
|
||||
* **LinearBoxToBoxMap** - A linear mapping of two box action spaces. For example, if the action space of the
|
||||
environment consists of continuous actions between 0 and 1, and we want the agent to choose actions between -1 and 1,
|
||||
the LinearBoxToBoxMap can be used to map the range -1 and 1 to the range 0 and 1 in a linear way. This means that the
|
||||
action -1 will be mapped to 0, the action 1 will be mapped to 1, and the rest of the actions will be linearly mapped
|
||||
between those values.
|
||||
@@ -1,6 +1,4 @@
|
||||
# Coach Design
|
||||
|
||||
## Network Design
|
||||
# Network Design
|
||||
|
||||
Each agent has at least one neural network, used as the function approximator, for choosing the actions. The network is designed in a modular way to allow reusability in different agents. It is separated into three main parts:
|
||||
|
||||
@@ -21,7 +19,7 @@ Each agent has at least one neural network, used as the function approximator, f
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../img/network.png" alt="Network Design" style="width: 400px;"/>
|
||||
<img src="../../img/network.png" alt="Network Design" style="width: 400px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
@@ -31,17 +29,7 @@ Most of the reinforcement learning agents include more than one copy of the neur
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../img/distributed.png" alt="Distributed Training" style="width: 600px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
## Supported Algorithms
|
||||
|
||||
Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into two main classes - value optimization and policy optimization. A detailed description of those algorithms may be found in the algorithms section.
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
||||
<img src="../img/algorithms.png" alt="Supported Algorithms" style="width: 600px;"/>
|
||||
<img src="../../img/distributed.png" alt="Distributed Training" style="width: 600px;"/>
|
||||
|
||||
</p>
|
||||
|
||||
1
docs_raw/docs/diagrams.xml
Normal file
BIN
docs_raw/docs/img/act.png
Normal file
|
After Width: | Height: | Size: 49 KiB |
BIN
docs_raw/docs/img/filters.png
Normal file
|
After Width: | Height: | Size: 21 KiB |
BIN
docs_raw/docs/img/graph.png
Normal file
|
After Width: | Height: | Size: 29 KiB |
BIN
docs_raw/docs/img/improve.png
Normal file
|
After Width: | Height: | Size: 32 KiB |
BIN
docs_raw/docs/img/level.png
Normal file
|
After Width: | Height: | Size: 24 KiB |
BIN
docs_raw/docs/img/observe.png
Normal file
|
After Width: | Height: | Size: 40 KiB |
BIN
docs_raw/docs/img/train.png
Normal file
|
After Width: | Height: | Size: 39 KiB |
@@ -13,7 +13,7 @@ Coach collects statistics from the training process and supports advanced visual
|
||||
|
||||
|
||||
|
||||
Blog post from the Intel® Nervana™ website can be found [here](https://www.intelnervana.com/reinforcement-learning-coach-intel).
|
||||
Blog post from the Intel® AI website can be found [here](https://ai.intel.com/reinforcement-learning-coach-intel/).
|
||||
|
||||
GitHub repository is [here](https://github.com/NervanaSystems/coach).
|
||||
|
||||
|
||||