Release 0.9
Main changes are detailed below: New features - * CARLA 0.7 simulator integration * Human control of the game play * Recording of human game play and storing / loading the replay buffer * Behavioral cloning agent and presets * Golden tests for several presets * Selecting between deep / shallow image embedders * Rendering through pygame (with some boost in performance) API changes - * Improved environment wrapper API * Added an evaluate flag to allow convenient evaluation of existing checkpoints * Improve frameskip definition in Gym Bug fixes - * Fixed loading of checkpoints for agents with more than one network * Fixed the N Step Q learning agent python3 compatibility
68
README.md
@@ -13,10 +13,16 @@ Training an agent to solve an environment is as easy as running:
|
|||||||
python3 coach.py -p CartPole_DQN -r
|
python3 coach.py -p CartPole_DQN -r
|
||||||
```
|
```
|
||||||
|
|
||||||
<img src="img/doom.gif" alt="Doom Health Gathering" width="265" height="200"/><img src="img/minitaur.gif" alt="PyBullet Minitaur" width="265" height="200"/> <img src="img/ant.gif" alt="Gym Extensions Ant" width="250" height="200"/>
|
<img src="img/doom_deathmatch.gif" alt="Doom Deathmatch" width="267" height="200"/> <img src="img/carla.gif" alt="CARLA" width="284" height="200"/> <img src="img/montezuma.gif" alt="MontezumaRevenge" width="152" height="200"/>
|
||||||
|
|
||||||
Blog post from the Intel® Nervana™ website can be found [here](https://www.intelnervana.com/reinforcement-learning-coach-intel).
|
Blog post from the Intel® Nervana™ website can be found [here](https://www.intelnervana.com/reinforcement-learning-coach-intel).
|
||||||
|
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
Framework documentation, algorithm description and instructions on how to contribute a new agent/environment can be found [here](http://coach.nervanasys.com).
|
||||||
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
Note: Coach has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.
|
Note: Coach has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.
|
||||||
@@ -103,6 +109,8 @@ For example:
|
|||||||
|
|
||||||
It is easy to create new presets for different levels or environments by following the same pattern as in presets.py
|
It is easy to create new presets for different levels or environments by following the same pattern as in presets.py
|
||||||
|
|
||||||
|
More usage examples can be found [here](http://coach.nervanasys.com/usage/index.html).
|
||||||
|
|
||||||
## Running Coach Dashboard (Visualization)
|
## Running Coach Dashboard (Visualization)
|
||||||
Training an agent to solve an environment can be tricky, at times.
|
Training an agent to solve an environment can be tricky, at times.
|
||||||
|
|
||||||
@@ -121,11 +129,6 @@ python3 dashboard.py
|
|||||||
<img src="img/dashboard.png" alt="Coach Design" style="width: 800px;"/>
|
<img src="img/dashboard.png" alt="Coach Design" style="width: 800px;"/>
|
||||||
|
|
||||||
|
|
||||||
## Documentation
|
|
||||||
|
|
||||||
Framework documentation, algoritmic description and instructions on how to contribute a new agent/environment can be found [here](http://coach.nervanasys.com).
|
|
||||||
|
|
||||||
|
|
||||||
## Parallelizing an Algorithm
|
## Parallelizing an Algorithm
|
||||||
|
|
||||||
Since the introduction of [A3C](https://arxiv.org/abs/1602.01783) in 2016, many algorithms were shown to benefit from running multiple instances in parallel, on many CPU cores. So far, these algorithms include [A3C](https://arxiv.org/abs/1602.01783), [DDPG](https://arxiv.org/pdf/1704.03073.pdf), [PPO](https://arxiv.org/pdf/1707.06347.pdf), and [NAF](https://arxiv.org/pdf/1610.00633.pdf), and this is most probably only the begining.
|
Since the introduction of [A3C](https://arxiv.org/abs/1602.01783) in 2016, many algorithms were shown to benefit from running multiple instances in parallel, on many CPU cores. So far, these algorithms include [A3C](https://arxiv.org/abs/1602.01783), [DDPG](https://arxiv.org/pdf/1704.03073.pdf), [PPO](https://arxiv.org/pdf/1707.06347.pdf), and [NAF](https://arxiv.org/pdf/1610.00633.pdf), and this is most probably only the begining.
|
||||||
@@ -150,11 +153,11 @@ python3 coach.py -p Hopper_A3C -n 16
|
|||||||
|
|
||||||
## Supported Environments
|
## Supported Environments
|
||||||
|
|
||||||
* OpenAI Gym
|
* *OpenAI Gym:*
|
||||||
|
|
||||||
Installed by default by Coach's installer.
|
Installed by default by Coach's installer.
|
||||||
|
|
||||||
* ViZDoom:
|
* *ViZDoom:*
|
||||||
|
|
||||||
Follow the instructions described in the ViZDoom repository -
|
Follow the instructions described in the ViZDoom repository -
|
||||||
|
|
||||||
@@ -162,13 +165,13 @@ python3 coach.py -p Hopper_A3C -n 16
|
|||||||
|
|
||||||
Additionally, Coach assumes that the environment variable VIZDOOM_ROOT points to the ViZDoom installation directory.
|
Additionally, Coach assumes that the environment variable VIZDOOM_ROOT points to the ViZDoom installation directory.
|
||||||
|
|
||||||
* Roboschool:
|
* *Roboschool:*
|
||||||
|
|
||||||
Follow the instructions described in the roboschool repository -
|
Follow the instructions described in the roboschool repository -
|
||||||
|
|
||||||
https://github.com/openai/roboschool
|
https://github.com/openai/roboschool
|
||||||
|
|
||||||
* GymExtensions:
|
* *GymExtensions:*
|
||||||
|
|
||||||
Follow the instructions described in the GymExtensions repository -
|
Follow the instructions described in the GymExtensions repository -
|
||||||
|
|
||||||
@@ -176,10 +179,19 @@ python3 coach.py -p Hopper_A3C -n 16
|
|||||||
|
|
||||||
Additionally, add the installation directory to the PYTHONPATH environment variable.
|
Additionally, add the installation directory to the PYTHONPATH environment variable.
|
||||||
|
|
||||||
* PyBullet
|
* *PyBullet:*
|
||||||
|
|
||||||
Follow the instructions described in the [Quick Start Guide](https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA) (basically just - 'pip install pybullet')
|
Follow the instructions described in the [Quick Start Guide](https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA) (basically just - 'pip install pybullet')
|
||||||
|
|
||||||
|
* *CARLA:*
|
||||||
|
|
||||||
|
Download release 0.7 from the CARLA repository -
|
||||||
|
|
||||||
|
https://github.com/carla-simulator/carla/releases
|
||||||
|
|
||||||
|
Create a new CARLA_ROOT environment variable pointing to CARLA's installation directory.
|
||||||
|
|
||||||
|
A simple CARLA settings file (```CarlaSettings.ini```) is supplied with Coach, and is located in the ```environments``` directory.
|
||||||
|
|
||||||
|
|
||||||
## Supported Algorithms
|
## Supported Algorithms
|
||||||
@@ -190,24 +202,24 @@ python3 coach.py -p Hopper_A3C -n 16
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
|
* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) ([code](agents/dqn_agent.py))
|
||||||
* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf)
|
* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf) ([code](agents/ddqn_agent.py))
|
||||||
* [Dueling Q Network](https://arxiv.org/abs/1511.06581)
|
* [Dueling Q Network](https://arxiv.org/abs/1511.06581)
|
||||||
* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310)
|
* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310) ([code](agents/mmc_agent.py))
|
||||||
* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860)
|
* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860) ([code](agents/pal_agent.py))
|
||||||
* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887)
|
* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887) ([code](agents/categorical_dqn_agent.py))
|
||||||
* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf)
|
* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf) ([code](agents/qr_dqn_agent.py))
|
||||||
* [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621)
|
* [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621) ([code](agents/bootstrapped_dqn_agent.py))
|
||||||
* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed**
|
* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed** ([code](agents/n_step_q_agent.py))
|
||||||
* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988)
|
* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988) ([code](agents/nec_agent.py))
|
||||||
* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed**
|
* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed** ([code](agents/naf_agent.py))
|
||||||
* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed**
|
* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed** ([code](agents/policy_gradients_agent.py))
|
||||||
* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed**
|
* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed** ([code](agents/actor_critic_agent.py))
|
||||||
* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed**
|
* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed** ([code](agents/ddpg_agent.py))
|
||||||
* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)
|
* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf) ([code](agents/ppo_agent.py))
|
||||||
* [Clipped Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed**
|
* [Clipped Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed** ([code](agents/clipped_ppo_agent.py))
|
||||||
* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed**
|
* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed** ([code](agents/dfp_agent.py))
|
||||||
|
* Behavioral Cloning (BC) ([code](agents/bc_agent.py))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -16,6 +16,7 @@
|
|||||||
|
|
||||||
from agents.actor_critic_agent import *
|
from agents.actor_critic_agent import *
|
||||||
from agents.agent import *
|
from agents.agent import *
|
||||||
|
from agents.bc_agent import *
|
||||||
from agents.bootstrapped_dqn_agent import *
|
from agents.bootstrapped_dqn_agent import *
|
||||||
from agents.clipped_ppo_agent import *
|
from agents.clipped_ppo_agent import *
|
||||||
from agents.ddpg_agent import *
|
from agents.ddpg_agent import *
|
||||||
@@ -23,6 +24,8 @@ from agents.ddqn_agent import *
|
|||||||
from agents.dfp_agent import *
|
from agents.dfp_agent import *
|
||||||
from agents.dqn_agent import *
|
from agents.dqn_agent import *
|
||||||
from agents.categorical_dqn_agent import *
|
from agents.categorical_dqn_agent import *
|
||||||
|
from agents.human_agent import *
|
||||||
|
from agents.imitation_agent import *
|
||||||
from agents.mmc_agent import *
|
from agents.mmc_agent import *
|
||||||
from agents.n_step_q_agent import *
|
from agents.n_step_q_agent import *
|
||||||
from agents.naf_agent import *
|
from agents.naf_agent import *
|
||||||
|
|||||||
@@ -50,6 +50,7 @@ class Agent(object):
|
|||||||
self.task_id = task_id
|
self.task_id = task_id
|
||||||
self.sess = tuning_parameters.sess
|
self.sess = tuning_parameters.sess
|
||||||
self.env = tuning_parameters.env_instance = env
|
self.env = tuning_parameters.env_instance = env
|
||||||
|
self.imitation = False
|
||||||
|
|
||||||
# i/o dimensions
|
# i/o dimensions
|
||||||
if not tuning_parameters.env.desired_observation_width or not tuning_parameters.env.desired_observation_height:
|
if not tuning_parameters.env.desired_observation_width or not tuning_parameters.env.desired_observation_height:
|
||||||
@@ -61,6 +62,11 @@ class Agent(object):
|
|||||||
self.measurements_size = tuning_parameters.env.measurements_size = (self.measurements_size[0] + 1,)
|
self.measurements_size = tuning_parameters.env.measurements_size = (self.measurements_size[0] + 1,)
|
||||||
|
|
||||||
# modules
|
# modules
|
||||||
|
if tuning_parameters.agent.load_memory_from_file_path:
|
||||||
|
screen.log_title("Loading replay buffer from pickle. Pickle path: {}"
|
||||||
|
.format(tuning_parameters.agent.load_memory_from_file_path))
|
||||||
|
self.memory = read_pickle(tuning_parameters.agent.load_memory_from_file_path)
|
||||||
|
else:
|
||||||
self.memory = eval(tuning_parameters.memory + '(tuning_parameters)')
|
self.memory = eval(tuning_parameters.memory + '(tuning_parameters)')
|
||||||
# self.architecture = eval(tuning_parameters.architecture)
|
# self.architecture = eval(tuning_parameters.architecture)
|
||||||
|
|
||||||
@@ -121,11 +127,12 @@ class Agent(object):
|
|||||||
|
|
||||||
def log_to_screen(self, phase):
|
def log_to_screen(self, phase):
|
||||||
# log to screen
|
# log to screen
|
||||||
if self.current_episode > 0:
|
if self.current_episode >= 0:
|
||||||
if phase == RunPhase.TEST:
|
if phase == RunPhase.TRAIN:
|
||||||
exploration = self.evaluation_exploration_policy.get_control_param()
|
|
||||||
else:
|
|
||||||
exploration = self.exploration_policy.get_control_param()
|
exploration = self.exploration_policy.get_control_param()
|
||||||
|
else:
|
||||||
|
exploration = self.evaluation_exploration_policy.get_control_param()
|
||||||
|
|
||||||
screen.log_dict(
|
screen.log_dict(
|
||||||
OrderedDict([
|
OrderedDict([
|
||||||
("Worker", self.task_id),
|
("Worker", self.task_id),
|
||||||
@@ -135,7 +142,7 @@ class Agent(object):
|
|||||||
("steps", self.total_steps_counter),
|
("steps", self.total_steps_counter),
|
||||||
("training iteration", self.training_iteration)
|
("training iteration", self.training_iteration)
|
||||||
]),
|
]),
|
||||||
prefix="Heatup" if self.in_heatup else "Training" if phase == RunPhase.TRAIN else "Testing"
|
prefix=phase
|
||||||
)
|
)
|
||||||
|
|
||||||
def update_log(self, phase=RunPhase.TRAIN):
|
def update_log(self, phase=RunPhase.TRAIN):
|
||||||
@@ -146,7 +153,7 @@ class Agent(object):
|
|||||||
# log all the signals to file
|
# log all the signals to file
|
||||||
logger.set_current_time(self.current_episode)
|
logger.set_current_time(self.current_episode)
|
||||||
logger.create_signal_value('Training Iter', self.training_iteration)
|
logger.create_signal_value('Training Iter', self.training_iteration)
|
||||||
logger.create_signal_value('In Heatup', int(self.in_heatup))
|
logger.create_signal_value('In Heatup', int(phase == RunPhase.HEATUP))
|
||||||
logger.create_signal_value('ER #Transitions', self.memory.num_transitions())
|
logger.create_signal_value('ER #Transitions', self.memory.num_transitions())
|
||||||
logger.create_signal_value('ER #Episodes', self.memory.length())
|
logger.create_signal_value('ER #Episodes', self.memory.length())
|
||||||
logger.create_signal_value('Episode Length', self.current_episode_steps_counter)
|
logger.create_signal_value('Episode Length', self.current_episode_steps_counter)
|
||||||
@@ -197,24 +204,6 @@ class Agent(object):
|
|||||||
network.curr_rnn_c_in = network.middleware_embedder.c_init
|
network.curr_rnn_c_in = network.middleware_embedder.c_init
|
||||||
network.curr_rnn_h_in = network.middleware_embedder.h_init
|
network.curr_rnn_h_in = network.middleware_embedder.h_init
|
||||||
|
|
||||||
def stack_observation(self, curr_stack, observation):
|
|
||||||
"""
|
|
||||||
Adds a new observation to an existing stack of observations from previous time-steps.
|
|
||||||
:param curr_stack: The current observations stack.
|
|
||||||
:param observation: The new observation
|
|
||||||
:return: The updated observation stack
|
|
||||||
"""
|
|
||||||
|
|
||||||
if curr_stack == []:
|
|
||||||
# starting an episode
|
|
||||||
curr_stack = np.vstack(np.expand_dims([observation] * self.tp.env.observation_stack_size, 0))
|
|
||||||
curr_stack = self.switch_axes_order(curr_stack, from_type='channels_first', to_type='channels_last')
|
|
||||||
else:
|
|
||||||
curr_stack = np.append(curr_stack, np.expand_dims(np.squeeze(observation), axis=-1), axis=-1)
|
|
||||||
curr_stack = np.delete(curr_stack, 0, -1)
|
|
||||||
|
|
||||||
return curr_stack
|
|
||||||
|
|
||||||
def preprocess_observation(self, observation):
|
def preprocess_observation(self, observation):
|
||||||
"""
|
"""
|
||||||
Preprocesses the given observation.
|
Preprocesses the given observation.
|
||||||
@@ -335,26 +324,6 @@ class Agent(object):
|
|||||||
reward = max(reward, self.tp.env.reward_clipping_min)
|
reward = max(reward, self.tp.env.reward_clipping_min)
|
||||||
return reward
|
return reward
|
||||||
|
|
||||||
def switch_axes_order(self, observation, from_type='channels_first', to_type='channels_last'):
|
|
||||||
"""
|
|
||||||
transpose an observation axes from channels_first to channels_last or vice versa
|
|
||||||
:param observation: a numpy array
|
|
||||||
:param from_type: can be 'channels_first' or 'channels_last'
|
|
||||||
:param to_type: can be 'channels_first' or 'channels_last'
|
|
||||||
:return: a new observation with the requested axes order
|
|
||||||
"""
|
|
||||||
if from_type == to_type or len(observation.shape) == 1:
|
|
||||||
return observation
|
|
||||||
assert 2 <= len(observation.shape) <= 3, 'num axes of an observation must be 2 for a vector or 3 for an image'
|
|
||||||
assert type(observation) == np.ndarray, 'observation must be a numpy array'
|
|
||||||
if len(observation.shape) == 3:
|
|
||||||
if from_type == 'channels_first' and to_type == 'channels_last':
|
|
||||||
return np.transpose(observation, (1, 2, 0))
|
|
||||||
elif from_type == 'channels_last' and to_type == 'channels_first':
|
|
||||||
return np.transpose(observation, (2, 0, 1))
|
|
||||||
else:
|
|
||||||
return np.transpose(observation, (1, 0))
|
|
||||||
|
|
||||||
def act(self, phase=RunPhase.TRAIN):
|
def act(self, phase=RunPhase.TRAIN):
|
||||||
"""
|
"""
|
||||||
Take one step in the environment according to the network prediction and store the transition in memory
|
Take one step in the environment according to the network prediction and store the transition in memory
|
||||||
@@ -370,7 +339,7 @@ class Agent(object):
|
|||||||
is_first_transition_in_episode = (self.curr_state == [])
|
is_first_transition_in_episode = (self.curr_state == [])
|
||||||
if is_first_transition_in_episode:
|
if is_first_transition_in_episode:
|
||||||
observation = self.preprocess_observation(self.env.observation)
|
observation = self.preprocess_observation(self.env.observation)
|
||||||
observation = self.stack_observation([], observation)
|
observation = stack_observation([], observation, self.tp.env.observation_stack_size)
|
||||||
|
|
||||||
self.curr_state = {'observation': observation}
|
self.curr_state = {'observation': observation}
|
||||||
if self.tp.agent.use_measurements:
|
if self.tp.agent.use_measurements:
|
||||||
@@ -378,7 +347,7 @@ class Agent(object):
|
|||||||
if self.tp.agent.use_accumulated_reward_as_measurement:
|
if self.tp.agent.use_accumulated_reward_as_measurement:
|
||||||
self.curr_state['measurements'] = np.append(self.curr_state['measurements'], 0)
|
self.curr_state['measurements'] = np.append(self.curr_state['measurements'], 0)
|
||||||
|
|
||||||
if self.in_heatup: # we do not have a stacked curr_state yet
|
if phase == RunPhase.HEATUP and not self.tp.heatup_using_network_decisions:
|
||||||
action = self.env.get_random_action()
|
action = self.env.get_random_action()
|
||||||
else:
|
else:
|
||||||
action, action_info = self.choose_action(self.curr_state, phase=phase)
|
action, action_info = self.choose_action(self.curr_state, phase=phase)
|
||||||
@@ -394,11 +363,11 @@ class Agent(object):
|
|||||||
observation = self.preprocess_observation(result['observation'])
|
observation = self.preprocess_observation(result['observation'])
|
||||||
|
|
||||||
# plot action values online
|
# plot action values online
|
||||||
if self.tp.visualization.plot_action_values_online and not self.in_heatup:
|
if self.tp.visualization.plot_action_values_online and phase != RunPhase.HEATUP:
|
||||||
self.plot_action_values_online()
|
self.plot_action_values_online()
|
||||||
|
|
||||||
# initialize the next state
|
# initialize the next state
|
||||||
observation = self.stack_observation(self.curr_state['observation'], observation)
|
observation = stack_observation(self.curr_state['observation'], observation, self.tp.env.observation_stack_size)
|
||||||
|
|
||||||
next_state = {'observation': observation}
|
next_state = {'observation': observation}
|
||||||
if self.tp.agent.use_measurements and 'measurements' in result.keys():
|
if self.tp.agent.use_measurements and 'measurements' in result.keys():
|
||||||
@@ -407,7 +376,7 @@ class Agent(object):
|
|||||||
next_state['measurements'] = np.append(next_state['measurements'], self.total_reward_in_current_episode)
|
next_state['measurements'] = np.append(next_state['measurements'], self.total_reward_in_current_episode)
|
||||||
|
|
||||||
# store the transition only if we are training
|
# store the transition only if we are training
|
||||||
if phase == RunPhase.TRAIN:
|
if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
|
||||||
transition = Transition(self.curr_state, result['action'], shaped_reward, next_state, result['done'])
|
transition = Transition(self.curr_state, result['action'], shaped_reward, next_state, result['done'])
|
||||||
for key in action_info.keys():
|
for key in action_info.keys():
|
||||||
transition.info[key] = action_info[key]
|
transition.info[key] = action_info[key]
|
||||||
@@ -427,7 +396,7 @@ class Agent(object):
|
|||||||
self.update_log(phase=phase)
|
self.update_log(phase=phase)
|
||||||
self.log_to_screen(phase=phase)
|
self.log_to_screen(phase=phase)
|
||||||
|
|
||||||
if phase == RunPhase.TRAIN:
|
if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
|
||||||
self.reset_game()
|
self.reset_game()
|
||||||
|
|
||||||
self.current_episode += 1
|
self.current_episode += 1
|
||||||
@@ -462,9 +431,10 @@ class Agent(object):
|
|||||||
for network in self.networks:
|
for network in self.networks:
|
||||||
network.sync()
|
network.sync()
|
||||||
|
|
||||||
if self.tp.visualization.dump_gifs and self.total_reward_in_current_episode > max_reward_achieved:
|
if self.total_reward_in_current_episode > max_reward_achieved:
|
||||||
max_reward_achieved = self.total_reward_in_current_episode
|
max_reward_achieved = self.total_reward_in_current_episode
|
||||||
frame_skipping = int(5/self.tp.env.frame_skip)
|
frame_skipping = int(5/self.tp.env.frame_skip)
|
||||||
|
if self.tp.visualization.dump_gifs:
|
||||||
logger.create_gif(self.last_episode_images[::frame_skipping],
|
logger.create_gif(self.last_episode_images[::frame_skipping],
|
||||||
name='score-{}'.format(max_reward_achieved), fps=10)
|
name='score-{}'.format(max_reward_achieved), fps=10)
|
||||||
|
|
||||||
@@ -496,7 +466,7 @@ class Agent(object):
|
|||||||
screen.log_title("Starting heatup {}".format(self.task_id))
|
screen.log_title("Starting heatup {}".format(self.task_id))
|
||||||
num_steps_required_for_one_training_batch = self.tp.batch_size * self.tp.env.observation_stack_size
|
num_steps_required_for_one_training_batch = self.tp.batch_size * self.tp.env.observation_stack_size
|
||||||
for step in range(max(self.tp.num_heatup_steps, num_steps_required_for_one_training_batch)):
|
for step in range(max(self.tp.num_heatup_steps, num_steps_required_for_one_training_batch)):
|
||||||
self.act()
|
self.act(phase=RunPhase.HEATUP)
|
||||||
|
|
||||||
# training phase
|
# training phase
|
||||||
self.in_heatup = False
|
self.in_heatup = False
|
||||||
@@ -509,7 +479,12 @@ class Agent(object):
|
|||||||
# evaluate
|
# evaluate
|
||||||
evaluate_agent = (self.last_episode_evaluation_ran is not self.current_episode) and \
|
evaluate_agent = (self.last_episode_evaluation_ran is not self.current_episode) and \
|
||||||
(self.current_episode % self.tp.evaluate_every_x_episodes == 0)
|
(self.current_episode % self.tp.evaluate_every_x_episodes == 0)
|
||||||
|
evaluate_agent = evaluate_agent or \
|
||||||
|
(self.imitation and self.training_iteration > 0 and
|
||||||
|
self.training_iteration % self.tp.evaluate_every_x_training_iterations == 0)
|
||||||
|
|
||||||
if evaluate_agent:
|
if evaluate_agent:
|
||||||
|
self.env.reset()
|
||||||
self.last_episode_evaluation_ran = self.current_episode
|
self.last_episode_evaluation_ran = self.current_episode
|
||||||
self.evaluate(self.tp.evaluation_episodes)
|
self.evaluate(self.tp.evaluation_episodes)
|
||||||
|
|
||||||
@@ -522,6 +497,7 @@ class Agent(object):
|
|||||||
self.save_model(model_snapshots_periods_passed)
|
self.save_model(model_snapshots_periods_passed)
|
||||||
|
|
||||||
# play and record in replay buffer
|
# play and record in replay buffer
|
||||||
|
if self.tp.agent.collect_new_data:
|
||||||
if self.tp.agent.step_until_collecting_full_episodes:
|
if self.tp.agent.step_until_collecting_full_episodes:
|
||||||
step = 0
|
step = 0
|
||||||
while step < self.tp.agent.num_consecutive_playing_steps or self.memory.get_episode(-1).length() != 0:
|
while step < self.tp.agent.num_consecutive_playing_steps or self.memory.get_episode(-1).length() != 0:
|
||||||
@@ -537,6 +513,8 @@ class Agent(object):
|
|||||||
loss = self.train()
|
loss = self.train()
|
||||||
self.loss.add_sample(loss)
|
self.loss.add_sample(loss)
|
||||||
self.training_iteration += 1
|
self.training_iteration += 1
|
||||||
|
if self.imitation:
|
||||||
|
self.log_to_screen(RunPhase.TRAIN)
|
||||||
self.post_training_commands()
|
self.post_training_commands()
|
||||||
|
|
||||||
def save_model(self, model_id):
|
def save_model(self, model_id):
|
||||||
|
|||||||
40
agents/bc_agent.py
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2017 Intel Corporation
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
from agents.imitation_agent import *
|
||||||
|
|
||||||
|
|
||||||
|
# Behavioral Cloning Agent
|
||||||
|
class BCAgent(ImitationAgent):
|
||||||
|
def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
|
||||||
|
ImitationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
|
||||||
|
|
||||||
|
def learn_from_batch(self, batch):
|
||||||
|
current_states, _, actions, _, _, _ = self.extract_batch(batch)
|
||||||
|
|
||||||
|
# create the inputs for the network
|
||||||
|
input = current_states
|
||||||
|
|
||||||
|
# the targets for the network are the actions since this is supervised learning
|
||||||
|
if self.env.discrete_controls:
|
||||||
|
targets = np.eye(self.env.action_space_size)[[actions]]
|
||||||
|
else:
|
||||||
|
targets = actions
|
||||||
|
|
||||||
|
result = self.main_network.train_and_sync_networks(input, targets)
|
||||||
|
total_loss = result[0]
|
||||||
|
|
||||||
|
return total_loss
|
||||||
60
agents/distributional_dqn_agent.py
Normal file
@@ -0,0 +1,60 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2017 Intel Corporation
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
from agents.value_optimization_agent import *
|
||||||
|
|
||||||
|
|
||||||
|
# Distributional Deep Q Network - https://arxiv.org/pdf/1707.06887.pdf
|
||||||
|
class DistributionalDQNAgent(ValueOptimizationAgent):
|
||||||
|
def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
|
||||||
|
ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
|
||||||
|
self.z_values = np.linspace(self.tp.agent.v_min, self.tp.agent.v_max, self.tp.agent.atoms)
|
||||||
|
|
||||||
|
# prediction's format is (batch,actions,atoms)
|
||||||
|
def get_q_values(self, prediction):
|
||||||
|
return np.dot(prediction, self.z_values)
|
||||||
|
|
||||||
|
def learn_from_batch(self, batch):
|
||||||
|
current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
|
||||||
|
|
||||||
|
# for the action we actually took, the error is calculated by the atoms distribution
|
||||||
|
# for all other actions, the error is 0
|
||||||
|
distributed_q_st_plus_1 = self.main_network.target_network.predict(next_states)
|
||||||
|
# initialize with the current prediction so that we will
|
||||||
|
TD_targets = self.main_network.online_network.predict(current_states)
|
||||||
|
|
||||||
|
# only update the action that we have actually done in this transition
|
||||||
|
target_actions = np.argmax(self.get_q_values(distributed_q_st_plus_1), axis=1)
|
||||||
|
m = np.zeros((self.tp.batch_size, self.z_values.size))
|
||||||
|
|
||||||
|
batches = np.arange(self.tp.batch_size)
|
||||||
|
for j in range(self.z_values.size):
|
||||||
|
tzj = np.fmax(np.fmin(rewards + (1.0 - game_overs) * self.tp.agent.discount * self.z_values[j],
|
||||||
|
self.z_values[self.z_values.size - 1]),
|
||||||
|
self.z_values[0])
|
||||||
|
bj = (tzj - self.z_values[0])/(self.z_values[1] - self.z_values[0])
|
||||||
|
u = (np.ceil(bj)).astype(int)
|
||||||
|
l = (np.floor(bj)).astype(int)
|
||||||
|
m[batches, l] = m[batches, l] + (distributed_q_st_plus_1[batches, target_actions, j] * (u - bj))
|
||||||
|
m[batches, u] = m[batches, u] + (distributed_q_st_plus_1[batches, target_actions, j] * (bj - l))
|
||||||
|
# total_loss = cross entropy between actual result above and predicted result for the given action
|
||||||
|
TD_targets[batches, actions] = m
|
||||||
|
|
||||||
|
result = self.main_network.train_and_sync_networks(current_states, TD_targets)
|
||||||
|
total_loss = result[0]
|
||||||
|
|
||||||
|
return total_loss
|
||||||
|
|
||||||
67
agents/human_agent.py
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2017 Intel Corporation
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
from agents.agent import *
|
||||||
|
import pygame
|
||||||
|
|
||||||
|
|
||||||
|
class HumanAgent(Agent):
|
||||||
|
def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
|
||||||
|
Agent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
|
||||||
|
|
||||||
|
self.clock = pygame.time.Clock()
|
||||||
|
self.max_fps = int(self.tp.visualization.max_fps_for_human_control)
|
||||||
|
|
||||||
|
screen.log_title("Human Control Mode")
|
||||||
|
available_keys = self.env.get_available_keys()
|
||||||
|
if available_keys:
|
||||||
|
screen.log("Use keyboard keys to move. Press escape to quit. Available keys:")
|
||||||
|
screen.log("")
|
||||||
|
for action, key in self.env.get_available_keys():
|
||||||
|
screen.log("\t- {}: {}".format(action, key))
|
||||||
|
screen.separator()
|
||||||
|
|
||||||
|
def train(self):
|
||||||
|
return 0
|
||||||
|
|
||||||
|
def choose_action(self, curr_state, phase=RunPhase.TRAIN):
|
||||||
|
action = self.env.get_action_from_user()
|
||||||
|
|
||||||
|
# keep constant fps
|
||||||
|
self.clock.tick(self.max_fps)
|
||||||
|
|
||||||
|
if not self.env.renderer.is_open:
|
||||||
|
self.save_replay_buffer_and_exit()
|
||||||
|
|
||||||
|
return action, {"action_value": 0}
|
||||||
|
|
||||||
|
def save_replay_buffer_and_exit(self):
|
||||||
|
replay_buffer_path = os.path.join(logger.experiments_path, 'replay_buffer.p')
|
||||||
|
self.memory.tp = None
|
||||||
|
to_pickle(self.memory, replay_buffer_path)
|
||||||
|
screen.log_title("Replay buffer was stored in {}".format(replay_buffer_path))
|
||||||
|
exit()
|
||||||
|
|
||||||
|
def log_to_screen(self, phase):
|
||||||
|
# log to screen
|
||||||
|
screen.log_dict(
|
||||||
|
OrderedDict([
|
||||||
|
("Episode", self.current_episode),
|
||||||
|
("total reward", self.total_reward_in_current_episode),
|
||||||
|
("steps", self.total_steps_counter)
|
||||||
|
]),
|
||||||
|
prefix="Recording"
|
||||||
|
)
|
||||||
70
agents/imitation_agent.py
Normal file
@@ -0,0 +1,70 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2017 Intel Corporation
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
from agents.agent import *
|
||||||
|
|
||||||
|
|
||||||
|
# Imitation Agent
|
||||||
|
class ImitationAgent(Agent):
|
||||||
|
def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
|
||||||
|
Agent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
|
||||||
|
self.main_network = NetworkWrapper(tuning_parameters, False, self.has_global, 'main',
|
||||||
|
self.replicated_device, self.worker_device)
|
||||||
|
self.networks.append(self.main_network)
|
||||||
|
self.imitation = True
|
||||||
|
|
||||||
|
def extract_action_values(self, prediction):
|
||||||
|
return prediction.squeeze()
|
||||||
|
|
||||||
|
def choose_action(self, curr_state, phase=RunPhase.TRAIN):
|
||||||
|
# convert to batch so we can run it through the network
|
||||||
|
observation = np.expand_dims(np.array(curr_state['observation']), 0)
|
||||||
|
if self.tp.agent.use_measurements:
|
||||||
|
measurements = np.expand_dims(np.array(curr_state['measurements']), 0)
|
||||||
|
prediction = self.main_network.online_network.predict([observation, measurements])
|
||||||
|
else:
|
||||||
|
prediction = self.main_network.online_network.predict(observation)
|
||||||
|
|
||||||
|
# get action values and extract the best action from it
|
||||||
|
action_values = self.extract_action_values(prediction)
|
||||||
|
if self.env.discrete_controls:
|
||||||
|
# DISCRETE
|
||||||
|
# action = np.argmax(action_values)
|
||||||
|
action = self.evaluation_exploration_policy.get_action(action_values)
|
||||||
|
action_value = {"action_probability": action_values[action]}
|
||||||
|
else:
|
||||||
|
# CONTINUOUS
|
||||||
|
action = action_values
|
||||||
|
action_value = {}
|
||||||
|
|
||||||
|
return action, action_value
|
||||||
|
|
||||||
|
def log_to_screen(self, phase):
|
||||||
|
# log to screen
|
||||||
|
if phase == RunPhase.TRAIN:
|
||||||
|
# for the training phase - we log during the episode to visualize the progress in training
|
||||||
|
screen.log_dict(
|
||||||
|
OrderedDict([
|
||||||
|
("Worker", self.task_id),
|
||||||
|
("Episode", self.current_episode),
|
||||||
|
("Loss", self.loss.values[-1]),
|
||||||
|
("Training iteration", self.training_iteration)
|
||||||
|
]),
|
||||||
|
prefix="Training"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# for the evaluation phase - logging as in regular RL
|
||||||
|
Agent.log_to_screen(self, phase)
|
||||||
@@ -45,7 +45,7 @@ class NStepQAgent(ValueOptimizationAgent, PolicyOptimizationAgent):
|
|||||||
# 1-Step Q learning
|
# 1-Step Q learning
|
||||||
q_st_plus_1 = self.main_network.target_network.predict(next_states)
|
q_st_plus_1 = self.main_network.target_network.predict(next_states)
|
||||||
|
|
||||||
for i in reversed(xrange(num_transitions)):
|
for i in reversed(range(num_transitions)):
|
||||||
state_value_head_targets[i][actions[i]] = \
|
state_value_head_targets[i][actions[i]] = \
|
||||||
rewards[i] + (1.0 - game_overs[i]) * self.tp.agent.discount * np.max(q_st_plus_1[i], 0)
|
rewards[i] + (1.0 - game_overs[i]) * self.tp.agent.discount * np.max(q_st_plus_1[i], 0)
|
||||||
|
|
||||||
@@ -56,7 +56,7 @@ class NStepQAgent(ValueOptimizationAgent, PolicyOptimizationAgent):
|
|||||||
else:
|
else:
|
||||||
R = np.max(self.main_network.target_network.predict(np.expand_dims(next_states[-1], 0)))
|
R = np.max(self.main_network.target_network.predict(np.expand_dims(next_states[-1], 0)))
|
||||||
|
|
||||||
for i in reversed(xrange(num_transitions)):
|
for i in reversed(range(num_transitions)):
|
||||||
R = rewards[i] + self.tp.agent.discount * R
|
R = rewards[i] + self.tp.agent.discount * R
|
||||||
state_value_head_targets[i][actions[i]] = R
|
state_value_head_targets[i][actions[i]] = R
|
||||||
|
|
||||||
|
|||||||
@@ -58,7 +58,7 @@ class PolicyOptimizationAgent(Agent):
|
|||||||
("steps", self.total_steps_counter),
|
("steps", self.total_steps_counter),
|
||||||
("training iteration", self.training_iteration)
|
("training iteration", self.training_iteration)
|
||||||
]),
|
]),
|
||||||
prefix="Heatup" if self.in_heatup else "Training" if phase == RunPhase.TRAIN else "Testing"
|
prefix=phase
|
||||||
)
|
)
|
||||||
|
|
||||||
def update_episode_statistics(self, episode):
|
def update_episode_statistics(self, episode):
|
||||||
|
|||||||
@@ -75,11 +75,14 @@ class NetworkWrapper(object):
|
|||||||
network_is_local=True)
|
network_is_local=True)
|
||||||
|
|
||||||
if not self.tp.distributed and self.tp.framework == Frameworks.TensorFlow:
|
if not self.tp.distributed and self.tp.framework == Frameworks.TensorFlow:
|
||||||
self.model_saver = tf.train.Saver()
|
variables_to_restore = tf.global_variables()
|
||||||
|
variables_to_restore = [v for v in variables_to_restore if '/online' in v.name]
|
||||||
|
self.model_saver = tf.train.Saver(variables_to_restore)
|
||||||
if self.tp.sess and self.tp.checkpoint_restore_dir:
|
if self.tp.sess and self.tp.checkpoint_restore_dir:
|
||||||
checkpoint = tf.train.latest_checkpoint(self.tp.checkpoint_restore_dir)
|
checkpoint = tf.train.latest_checkpoint(self.tp.checkpoint_restore_dir)
|
||||||
screen.log_title("Loading checkpoint: {}".format(checkpoint))
|
screen.log_title("Loading checkpoint: {}".format(checkpoint))
|
||||||
self.model_saver.restore(self.tp.sess, checkpoint)
|
self.model_saver.restore(self.tp.sess, checkpoint)
|
||||||
|
self.update_target_network()
|
||||||
|
|
||||||
def sync(self):
|
def sync(self):
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -15,15 +15,18 @@
|
|||||||
#
|
#
|
||||||
|
|
||||||
import tensorflow as tf
|
import tensorflow as tf
|
||||||
|
from configurations import EmbedderComplexity
|
||||||
|
|
||||||
|
|
||||||
class InputEmbedder(object):
|
class InputEmbedder(object):
|
||||||
def __init__(self, input_size, activation_function=tf.nn.relu, name="embedder"):
|
def __init__(self, input_size, activation_function=tf.nn.relu,
|
||||||
|
embedder_complexity=EmbedderComplexity.Shallow, name="embedder"):
|
||||||
self.name = name
|
self.name = name
|
||||||
self.input_size = input_size
|
self.input_size = input_size
|
||||||
self.activation_function = activation_function
|
self.activation_function = activation_function
|
||||||
self.input = None
|
self.input = None
|
||||||
self.output = None
|
self.output = None
|
||||||
|
self.embedder_complexity = embedder_complexity
|
||||||
|
|
||||||
def __call__(self, prev_input_placeholder=None):
|
def __call__(self, prev_input_placeholder=None):
|
||||||
with tf.variable_scope(self.get_name()):
|
with tf.variable_scope(self.get_name()):
|
||||||
@@ -43,13 +46,17 @@ class InputEmbedder(object):
|
|||||||
|
|
||||||
|
|
||||||
class ImageEmbedder(InputEmbedder):
|
class ImageEmbedder(InputEmbedder):
|
||||||
def __init__(self, input_size, input_rescaler=255.0, activation_function=tf.nn.relu, name="embedder"):
|
def __init__(self, input_size, input_rescaler=255.0, activation_function=tf.nn.relu,
|
||||||
InputEmbedder.__init__(self, input_size, activation_function, name)
|
embedder_complexity=EmbedderComplexity.Shallow, name="embedder"):
|
||||||
|
InputEmbedder.__init__(self, input_size, activation_function, embedder_complexity, name)
|
||||||
self.input_rescaler = input_rescaler
|
self.input_rescaler = input_rescaler
|
||||||
|
|
||||||
def _build_module(self):
|
def _build_module(self):
|
||||||
# image observation
|
# image observation
|
||||||
rescaled_observation_stack = self.input / self.input_rescaler
|
rescaled_observation_stack = self.input / self.input_rescaler
|
||||||
|
|
||||||
|
if self.embedder_complexity == EmbedderComplexity.Shallow:
|
||||||
|
# same embedder as used in the original DQN paper
|
||||||
self.observation_conv1 = tf.layers.conv2d(rescaled_observation_stack,
|
self.observation_conv1 = tf.layers.conv2d(rescaled_observation_stack,
|
||||||
filters=32, kernel_size=(8, 8), strides=(4, 4),
|
filters=32, kernel_size=(8, 8), strides=(4, 4),
|
||||||
activation=self.activation_function, data_format='channels_last')
|
activation=self.activation_function, data_format='channels_last')
|
||||||
@@ -62,12 +69,54 @@ class ImageEmbedder(InputEmbedder):
|
|||||||
|
|
||||||
self.output = tf.contrib.layers.flatten(self.observation_conv3)
|
self.output = tf.contrib.layers.flatten(self.observation_conv3)
|
||||||
|
|
||||||
|
elif self.embedder_complexity == EmbedderComplexity.Deep:
|
||||||
|
# the embedder used in the CARLA papers
|
||||||
|
self.observation_conv1 = tf.layers.conv2d(rescaled_observation_stack,
|
||||||
|
filters=32, kernel_size=(5, 5), strides=(2, 2),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
self.observation_conv2 = tf.layers.conv2d(self.observation_conv1,
|
||||||
|
filters=32, kernel_size=(3, 3), strides=(1, 1),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
self.observation_conv3 = tf.layers.conv2d(self.observation_conv2,
|
||||||
|
filters=64, kernel_size=(3, 3), strides=(2, 2),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
self.observation_conv4 = tf.layers.conv2d(self.observation_conv3,
|
||||||
|
filters=64, kernel_size=(3, 3), strides=(1, 1),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
self.observation_conv5 = tf.layers.conv2d(self.observation_conv4,
|
||||||
|
filters=128, kernel_size=(3, 3), strides=(2, 2),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
self.observation_conv6 = tf.layers.conv2d(self.observation_conv5,
|
||||||
|
filters=128, kernel_size=(3, 3), strides=(1, 1),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
self.observation_conv7 = tf.layers.conv2d(self.observation_conv6,
|
||||||
|
filters=256, kernel_size=(3, 3), strides=(2, 2),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
self.observation_conv8 = tf.layers.conv2d(self.observation_conv7,
|
||||||
|
filters=256, kernel_size=(3, 3), strides=(1, 1),
|
||||||
|
activation=self.activation_function, data_format='channels_last')
|
||||||
|
|
||||||
|
self.output = tf.contrib.layers.flatten(self.observation_conv8)
|
||||||
|
else:
|
||||||
|
raise ValueError("The defined embedder complexity value is invalid")
|
||||||
|
|
||||||
|
|
||||||
class VectorEmbedder(InputEmbedder):
|
class VectorEmbedder(InputEmbedder):
|
||||||
def __init__(self, input_size, activation_function=tf.nn.relu, name="embedder"):
|
def __init__(self, input_size, activation_function=tf.nn.relu,
|
||||||
InputEmbedder.__init__(self, input_size, activation_function, name)
|
embedder_complexity=EmbedderComplexity.Shallow, name="embedder"):
|
||||||
|
InputEmbedder.__init__(self, input_size, activation_function, embedder_complexity, name)
|
||||||
|
|
||||||
def _build_module(self):
|
def _build_module(self):
|
||||||
# vector observation
|
# vector observation
|
||||||
input_layer = tf.contrib.layers.flatten(self.input)
|
input_layer = tf.contrib.layers.flatten(self.input)
|
||||||
|
|
||||||
|
if self.embedder_complexity == EmbedderComplexity.Shallow:
|
||||||
self.output = tf.layers.dense(input_layer, 256, activation=self.activation_function)
|
self.output = tf.layers.dense(input_layer, 256, activation=self.activation_function)
|
||||||
|
|
||||||
|
elif self.embedder_complexity == EmbedderComplexity.Deep:
|
||||||
|
# the embedder used in the CARLA papers
|
||||||
|
self.observation_fc1 = tf.layers.dense(input_layer, 128, activation=self.activation_function)
|
||||||
|
self.observation_fc2 = tf.layers.dense(self.observation_fc1, 128, activation=self.activation_function)
|
||||||
|
self.output = tf.layers.dense(self.observation_fc2, 128, activation=self.activation_function)
|
||||||
|
else:
|
||||||
|
raise ValueError("The defined embedder complexity value is invalid")
|
||||||
|
|||||||
82
coach.py
@@ -37,8 +37,29 @@ time_started = datetime.datetime.now()
|
|||||||
cur_time = time_started.time()
|
cur_time = time_started.time()
|
||||||
cur_date = time_started.date()
|
cur_date = time_started.date()
|
||||||
|
|
||||||
def get_experiment_path(general_experiments_path):
|
|
||||||
if not os.path.exists(general_experiments_path):
|
def get_experiment_name(initial_experiment_name=''):
|
||||||
|
match = None
|
||||||
|
while match is None:
|
||||||
|
if initial_experiment_name == '':
|
||||||
|
experiment_name = screen.ask_input("Please enter an experiment name: ")
|
||||||
|
else:
|
||||||
|
experiment_name = initial_experiment_name
|
||||||
|
|
||||||
|
experiment_name = experiment_name.replace(" ", "_")
|
||||||
|
match = re.match("^$|^[\w -/]{1,100}$", experiment_name)
|
||||||
|
|
||||||
|
if match is None:
|
||||||
|
screen.error('Experiment name must be composed only of alphanumeric letters, '
|
||||||
|
'underscores and dashes and should not be longer than 100 characters.')
|
||||||
|
|
||||||
|
return match.group(0)
|
||||||
|
|
||||||
|
|
||||||
|
def get_experiment_path(experiment_name, create_path=True):
|
||||||
|
general_experiments_path = os.path.join('./experiments/', experiment_name)
|
||||||
|
|
||||||
|
if not os.path.exists(general_experiments_path) and create_path:
|
||||||
os.makedirs(general_experiments_path)
|
os.makedirs(general_experiments_path)
|
||||||
experiment_path = os.path.join(general_experiments_path, '{}_{}_{}-{}_{}'
|
experiment_path = os.path.join(general_experiments_path, '{}_{}_{}-{}_{}'
|
||||||
.format(logger.two_digits(cur_date.day), logger.two_digits(cur_date.month),
|
.format(logger.two_digits(cur_date.day), logger.two_digits(cur_date.month),
|
||||||
@@ -52,6 +73,7 @@ def get_experiment_path(general_experiments_path):
|
|||||||
cur_time.minute, i))
|
cur_time.minute, i))
|
||||||
i += 1
|
i += 1
|
||||||
else:
|
else:
|
||||||
|
if create_path:
|
||||||
os.makedirs(experiment_path)
|
os.makedirs(experiment_path)
|
||||||
return experiment_path
|
return experiment_path
|
||||||
|
|
||||||
@@ -96,55 +118,54 @@ def check_input_and_fill_run_dict(parser):
|
|||||||
num_workers = int(re.match("^\d+$", args.num_workers).group(0))
|
num_workers = int(re.match("^\d+$", args.num_workers).group(0))
|
||||||
except ValueError:
|
except ValueError:
|
||||||
screen.error("Parameter num_workers should be an integer.")
|
screen.error("Parameter num_workers should be an integer.")
|
||||||
exit(1)
|
|
||||||
|
|
||||||
preset_names = list_all_classes_in_module(presets)
|
preset_names = list_all_classes_in_module(presets)
|
||||||
if args.preset is not None and args.preset not in preset_names:
|
if args.preset is not None and args.preset not in preset_names:
|
||||||
screen.error("A non-existing preset was selected. ")
|
screen.error("A non-existing preset was selected. ")
|
||||||
exit(1)
|
|
||||||
|
|
||||||
if args.checkpoint_restore_dir is not None and not os.path.exists(args.checkpoint_restore_dir):
|
if args.checkpoint_restore_dir is not None and not os.path.exists(args.checkpoint_restore_dir):
|
||||||
screen.error("The requested checkpoint folder to load from does not exist. ")
|
screen.error("The requested checkpoint folder to load from does not exist. ")
|
||||||
exit(1)
|
|
||||||
|
|
||||||
if args.save_model_sec is not None:
|
if args.save_model_sec is not None:
|
||||||
try:
|
try:
|
||||||
args.save_model_sec = int(args.save_model_sec)
|
args.save_model_sec = int(args.save_model_sec)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
screen.error("Parameter save_model_sec should be an integer.")
|
screen.error("Parameter save_model_sec should be an integer.")
|
||||||
exit(1)
|
|
||||||
|
|
||||||
if args.preset is None and (args.agent_type is None or args.environment_type is None
|
if args.preset is None and (args.agent_type is None or args.environment_type is None
|
||||||
or args.exploration_policy_type is None):
|
or args.exploration_policy_type is None) and not args.play:
|
||||||
screen.error('When no preset is given for Coach to run, the user is expected to input the desired agent_type,'
|
screen.error('When no preset is given for Coach to run, the user is expected to input the desired agent_type,'
|
||||||
' environment_type and exploration_policy_type to assemble a preset. '
|
' environment_type and exploration_policy_type to assemble a preset. '
|
||||||
'\nAt least one of these parameters was not given.')
|
'\nAt least one of these parameters was not given.')
|
||||||
exit(1)
|
elif args.preset is None and args.play and args.environment_type is None:
|
||||||
|
screen.error('When no preset is given for Coach to run, and the user requests human control over the environment,'
|
||||||
|
' the user is expected to input the desired environment_type and level.'
|
||||||
|
'\nAt least one of these parameters was not given.')
|
||||||
|
elif args.preset is None and args.play and args.environment_type:
|
||||||
|
args.agent_type = 'Human'
|
||||||
|
args.exploration_policy_type = 'ExplorationParameters'
|
||||||
|
|
||||||
experiment_name = args.experiment_name
|
# get experiment name and path
|
||||||
|
experiment_name = get_experiment_name(args.experiment_name)
|
||||||
|
experiment_path = get_experiment_path(experiment_name)
|
||||||
|
|
||||||
if args.experiment_name == '':
|
if args.play and num_workers > 1:
|
||||||
experiment_name = screen.ask_input("Please enter an experiment name: ")
|
screen.warning("Playing the game as a human is only available with a single worker. "
|
||||||
|
"The number of workers will be reduced to 1")
|
||||||
experiment_name = experiment_name.replace(" ", "_")
|
num_workers = 1
|
||||||
match = re.match("^$|^\w{1,100}$", experiment_name)
|
|
||||||
|
|
||||||
if match is None:
|
|
||||||
screen.error('Experiment name must be composed only of alphanumeric letters and underscores and should not be '
|
|
||||||
'longer than 100 characters.')
|
|
||||||
exit(1)
|
|
||||||
experiment_path = os.path.join('./experiments/', match.group(0))
|
|
||||||
experiment_path = get_experiment_path(experiment_path)
|
|
||||||
|
|
||||||
# fill run_dict
|
# fill run_dict
|
||||||
run_dict = dict()
|
run_dict = dict()
|
||||||
run_dict['agent_type'] = args.agent_type
|
run_dict['agent_type'] = args.agent_type
|
||||||
run_dict['environment_type'] = args.environment_type
|
run_dict['environment_type'] = args.environment_type
|
||||||
run_dict['exploration_policy_type'] = args.exploration_policy_type
|
run_dict['exploration_policy_type'] = args.exploration_policy_type
|
||||||
|
run_dict['level'] = args.level
|
||||||
run_dict['preset'] = args.preset
|
run_dict['preset'] = args.preset
|
||||||
run_dict['custom_parameter'] = args.custom_parameter
|
run_dict['custom_parameter'] = args.custom_parameter
|
||||||
run_dict['experiment_path'] = experiment_path
|
run_dict['experiment_path'] = experiment_path
|
||||||
run_dict['framework'] = Frameworks().get(args.framework)
|
run_dict['framework'] = Frameworks().get(args.framework)
|
||||||
|
run_dict['play'] = args.play
|
||||||
|
run_dict['evaluate'] = args.evaluate# or args.play
|
||||||
|
|
||||||
# multi-threading parameters
|
# multi-threading parameters
|
||||||
run_dict['num_threads'] = num_workers
|
run_dict['num_threads'] = num_workers
|
||||||
@@ -197,6 +218,14 @@ if __name__ == "__main__":
|
|||||||
help="(int) Number of workers for multi-process based agents, e.g. A3C",
|
help="(int) Number of workers for multi-process based agents, e.g. A3C",
|
||||||
default='1',
|
default='1',
|
||||||
type=str)
|
type=str)
|
||||||
|
parser.add_argument('--play',
|
||||||
|
help="(flag) Play as a human by controlling the game with the keyboard. "
|
||||||
|
"This option will save a replay buffer with the game play.",
|
||||||
|
action='store_true')
|
||||||
|
parser.add_argument('--evaluate',
|
||||||
|
help="(flag) Run evaluation only. This is a convenient way to disable "
|
||||||
|
"training in order to evaluate an existing checkpoint.",
|
||||||
|
action='store_true')
|
||||||
parser.add_argument('-v', '--verbose',
|
parser.add_argument('-v', '--verbose',
|
||||||
help="(flag) Don't suppress TensorFlow debug prints.",
|
help="(flag) Don't suppress TensorFlow debug prints.",
|
||||||
action='store_true')
|
action='store_true')
|
||||||
@@ -230,6 +259,12 @@ if __name__ == "__main__":
|
|||||||
,
|
,
|
||||||
default=None,
|
default=None,
|
||||||
type=str)
|
type=str)
|
||||||
|
parser.add_argument('-lvl', '--level',
|
||||||
|
help="(string) Choose the level that will be played in the environment that was selected."
|
||||||
|
"This value will override the level parameter in the environment class."
|
||||||
|
,
|
||||||
|
default=None,
|
||||||
|
type=str)
|
||||||
parser.add_argument('-cp', '--custom_parameter',
|
parser.add_argument('-cp', '--custom_parameter',
|
||||||
help="(string) Semicolon separated parameters used to override specific parameters on top of"
|
help="(string) Semicolon separated parameters used to override specific parameters on top of"
|
||||||
" the selected preset (or on top of the command-line assembled one). "
|
" the selected preset (or on top of the command-line assembled one). "
|
||||||
@@ -259,6 +294,11 @@ if __name__ == "__main__":
|
|||||||
tuning_parameters.task_index = 0
|
tuning_parameters.task_index = 0
|
||||||
env_instance = create_environment(tuning_parameters)
|
env_instance = create_environment(tuning_parameters)
|
||||||
agent = eval(tuning_parameters.agent.type + '(env_instance, tuning_parameters)')
|
agent = eval(tuning_parameters.agent.type + '(env_instance, tuning_parameters)')
|
||||||
|
|
||||||
|
# Start the training or evaluation
|
||||||
|
if tuning_parameters.evaluate:
|
||||||
|
agent.evaluate(sys.maxsize, keep_networks_synced=True) # evaluate forever
|
||||||
|
else:
|
||||||
agent.improve()
|
agent.improve()
|
||||||
|
|
||||||
# Multi-threaded runs
|
# Multi-threaded runs
|
||||||
|
|||||||
@@ -32,6 +32,11 @@ class InputTypes(object):
|
|||||||
TimedObservation = 5
|
TimedObservation = 5
|
||||||
|
|
||||||
|
|
||||||
|
class EmbedderComplexity(object):
|
||||||
|
Shallow = 1
|
||||||
|
Deep = 2
|
||||||
|
|
||||||
|
|
||||||
class OutputTypes(object):
|
class OutputTypes(object):
|
||||||
Q = 1
|
Q = 1
|
||||||
DuelingQ = 2
|
DuelingQ = 2
|
||||||
@@ -60,6 +65,7 @@ class AgentParameters(object):
|
|||||||
middleware_type = MiddlewareTypes.FC
|
middleware_type = MiddlewareTypes.FC
|
||||||
loss_weights = [1.0]
|
loss_weights = [1.0]
|
||||||
stop_gradients_from_head = [False]
|
stop_gradients_from_head = [False]
|
||||||
|
embedder_complexity = EmbedderComplexity.Shallow
|
||||||
num_output_head_copies = 1
|
num_output_head_copies = 1
|
||||||
use_measurements = False
|
use_measurements = False
|
||||||
use_accumulated_reward_as_measurement = False
|
use_accumulated_reward_as_measurement = False
|
||||||
@@ -90,6 +96,8 @@ class AgentParameters(object):
|
|||||||
step_until_collecting_full_episodes = False
|
step_until_collecting_full_episodes = False
|
||||||
targets_horizon = 'N-Step'
|
targets_horizon = 'N-Step'
|
||||||
replace_mse_with_huber_loss = False
|
replace_mse_with_huber_loss = False
|
||||||
|
load_memory_from_file_path = None
|
||||||
|
collect_new_data = True
|
||||||
|
|
||||||
# PPO related params
|
# PPO related params
|
||||||
target_kl_divergence = 0.01
|
target_kl_divergence = 0.01
|
||||||
@@ -132,6 +140,7 @@ class EnvironmentParameters(object):
|
|||||||
reward_scaling = 1.0
|
reward_scaling = 1.0
|
||||||
reward_clipping_min = None
|
reward_clipping_min = None
|
||||||
reward_clipping_max = None
|
reward_clipping_max = None
|
||||||
|
human_control = False
|
||||||
|
|
||||||
|
|
||||||
class ExplorationParameters(object):
|
class ExplorationParameters(object):
|
||||||
@@ -188,6 +197,7 @@ class GeneralParameters(object):
|
|||||||
kl_divergence_constraint = 100000
|
kl_divergence_constraint = 100000
|
||||||
num_training_iterations = 10000000000
|
num_training_iterations = 10000000000
|
||||||
num_heatup_steps = 1000
|
num_heatup_steps = 1000
|
||||||
|
heatup_using_network_decisions = False
|
||||||
batch_size = 32
|
batch_size = 32
|
||||||
save_model_sec = None
|
save_model_sec = None
|
||||||
save_model_dir = None
|
save_model_dir = None
|
||||||
@@ -197,6 +207,7 @@ class GeneralParameters(object):
|
|||||||
learning_rate_decay_steps = 0
|
learning_rate_decay_steps = 0
|
||||||
evaluation_episodes = 5
|
evaluation_episodes = 5
|
||||||
evaluate_every_x_episodes = 1000000
|
evaluate_every_x_episodes = 1000000
|
||||||
|
evaluate_every_x_training_iterations = 0
|
||||||
rescaling_interpolation_type = 'bilinear'
|
rescaling_interpolation_type = 'bilinear'
|
||||||
|
|
||||||
# setting a seed will only work for non-parallel algorithms. Parallel algorithms add uncontrollable noise in
|
# setting a seed will only work for non-parallel algorithms. Parallel algorithms add uncontrollable noise in
|
||||||
@@ -224,6 +235,7 @@ class VisualizationParameters(object):
|
|||||||
dump_signals_to_csv_every_x_episodes = 10
|
dump_signals_to_csv_every_x_episodes = 10
|
||||||
render = False
|
render = False
|
||||||
dump_gifs = True
|
dump_gifs = True
|
||||||
|
max_fps_for_human_control = 10
|
||||||
|
|
||||||
|
|
||||||
class Roboschool(EnvironmentParameters):
|
class Roboschool(EnvironmentParameters):
|
||||||
@@ -252,7 +264,7 @@ class Bullet(EnvironmentParameters):
|
|||||||
|
|
||||||
class Atari(EnvironmentParameters):
|
class Atari(EnvironmentParameters):
|
||||||
type = 'Gym'
|
type = 'Gym'
|
||||||
frame_skip = 1
|
frame_skip = 4
|
||||||
observation_stack_size = 4
|
observation_stack_size = 4
|
||||||
desired_observation_height = 84
|
desired_observation_height = 84
|
||||||
desired_observation_width = 84
|
desired_observation_width = 84
|
||||||
@@ -268,6 +280,31 @@ class Doom(EnvironmentParameters):
|
|||||||
desired_observation_width = 76
|
desired_observation_width = 76
|
||||||
|
|
||||||
|
|
||||||
|
class Carla(EnvironmentParameters):
|
||||||
|
type = 'Carla'
|
||||||
|
frame_skip = 1
|
||||||
|
observation_stack_size = 4
|
||||||
|
desired_observation_height = 128
|
||||||
|
desired_observation_width = 180
|
||||||
|
normalize_observation = False
|
||||||
|
server_height = 256
|
||||||
|
server_width = 360
|
||||||
|
config = 'environments/CarlaSettings.ini'
|
||||||
|
level = 'town1'
|
||||||
|
verbose = True
|
||||||
|
stereo = False
|
||||||
|
semantic_segmentation = False
|
||||||
|
depth = False
|
||||||
|
episode_max_time = 100000 # miliseconds for each episode
|
||||||
|
continuous_to_bool_threshold = 0.5
|
||||||
|
allow_braking = False
|
||||||
|
|
||||||
|
|
||||||
|
class Human(AgentParameters):
|
||||||
|
type = 'HumanAgent'
|
||||||
|
num_episodes_in_experience_replay = 10000000
|
||||||
|
|
||||||
|
|
||||||
class NStepQ(AgentParameters):
|
class NStepQ(AgentParameters):
|
||||||
type = 'NStepQAgent'
|
type = 'NStepQAgent'
|
||||||
input_types = [InputTypes.Observation]
|
input_types = [InputTypes.Observation]
|
||||||
@@ -299,10 +336,12 @@ class DQN(AgentParameters):
|
|||||||
class DDQN(DQN):
|
class DDQN(DQN):
|
||||||
type = 'DDQNAgent'
|
type = 'DDQNAgent'
|
||||||
|
|
||||||
|
|
||||||
class DuelingDQN(DQN):
|
class DuelingDQN(DQN):
|
||||||
type = 'DQNAgent'
|
type = 'DQNAgent'
|
||||||
output_types = [OutputTypes.DuelingQ]
|
output_types = [OutputTypes.DuelingQ]
|
||||||
|
|
||||||
|
|
||||||
class BootstrappedDQN(DQN):
|
class BootstrappedDQN(DQN):
|
||||||
type = 'BootstrappedDQNAgent'
|
type = 'BootstrappedDQNAgent'
|
||||||
num_output_head_copies = 10
|
num_output_head_copies = 10
|
||||||
@@ -314,6 +353,7 @@ class CategoricalDQN(DQN):
|
|||||||
v_min = -10.0
|
v_min = -10.0
|
||||||
v_max = 10.0
|
v_max = 10.0
|
||||||
atoms = 51
|
atoms = 51
|
||||||
|
neon_support = False
|
||||||
|
|
||||||
|
|
||||||
class QuantileRegressionDQN(DQN):
|
class QuantileRegressionDQN(DQN):
|
||||||
@@ -452,6 +492,7 @@ class ClippedPPO(AgentParameters):
|
|||||||
step_until_collecting_full_episodes = True
|
step_until_collecting_full_episodes = True
|
||||||
beta_entropy = 0.01
|
beta_entropy = 0.01
|
||||||
|
|
||||||
|
|
||||||
class DFP(AgentParameters):
|
class DFP(AgentParameters):
|
||||||
type = 'DFPAgent'
|
type = 'DFPAgent'
|
||||||
input_types = [InputTypes.Observation, InputTypes.Measurements, InputTypes.GoalVector]
|
input_types = [InputTypes.Observation, InputTypes.Measurements, InputTypes.GoalVector]
|
||||||
@@ -485,6 +526,15 @@ class PAL(AgentParameters):
|
|||||||
neon_support = True
|
neon_support = True
|
||||||
|
|
||||||
|
|
||||||
|
class BC(AgentParameters):
|
||||||
|
type = 'BCAgent'
|
||||||
|
input_types = [InputTypes.Observation]
|
||||||
|
output_types = [OutputTypes.Q]
|
||||||
|
loss_weights = [1.0]
|
||||||
|
collect_new_data = False
|
||||||
|
evaluate_every_x_training_iterations = 50000
|
||||||
|
|
||||||
|
|
||||||
class EGreedyExploration(ExplorationParameters):
|
class EGreedyExploration(ExplorationParameters):
|
||||||
policy = 'EGreedy'
|
policy = 'EGreedy'
|
||||||
initial_epsilon = 0.5
|
initial_epsilon = 0.5
|
||||||
|
|||||||
25
docs/docs/algorithms/imitation/bc.md
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
# Behavioral Cloning
|
||||||
|
|
||||||
|
**Actions space:** Discrete|Continuous
|
||||||
|
|
||||||
|
## Network Structure
|
||||||
|
|
||||||
|
<p style="text-align: center;">
|
||||||
|
|
||||||
|
<img src="..\..\design_imgs\dqn.png">
|
||||||
|
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Algorithm Description
|
||||||
|
|
||||||
|
### Training the network
|
||||||
|
|
||||||
|
The replay buffer contains the expert demonstrations for the task.
|
||||||
|
These demonstrations are given as state, action tuples, and with no reward.
|
||||||
|
The training goal is to reduce the difference between the actions predicted by the network and the actions taken by the expert for each state.
|
||||||
|
|
||||||
|
1. Sample a batch of transitions from the replay buffer.
|
||||||
|
2. Use the current states as input to the network, and the expert actions as the targets of the network.
|
||||||
|
3. The loss function for the network is MSE, and therefore we use the Q head to minimize this loss.
|
||||||
@@ -0,0 +1,33 @@
|
|||||||
|
# Distributional DQN
|
||||||
|
|
||||||
|
**Actions space:** Discrete
|
||||||
|
|
||||||
|
**References:** [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)
|
||||||
|
|
||||||
|
## Network Structure
|
||||||
|
|
||||||
|
<p style="text-align: center;">
|
||||||
|
|
||||||
|
<img src="..\..\design_imgs\distributional_dqn.png">
|
||||||
|
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Algorithmic Description
|
||||||
|
|
||||||
|
### Training the network
|
||||||
|
|
||||||
|
1. Sample a batch of transitions from the replay buffer.
|
||||||
|
2. The Bellman update is projected to the set of atoms representing the $ Q $ values distribution, such that the $i-th$ component of the projected update is calculated as follows:
|
||||||
|
$$ (\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{|[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i|}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1})) $$
|
||||||
|
where:
|
||||||
|
* $[ \cdot ] $ bounds its argument in the range [a, b]
|
||||||
|
* $\hat{T}_{z_{j}}$ is the Bellman update for atom $z_j$: $\hat{T}_{z_{j}} := r+\gamma z_j$
|
||||||
|
|
||||||
|
|
||||||
|
3. Network is trained with the cross entropy loss between the resulting probability distribution and the target probability distribution. Only the target of the actions that were actually taken is updated.
|
||||||
|
4. Once in every few thousand steps, weights are copied from the online network to the target network.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -1,23 +1,34 @@
|
|||||||
Adding a new environment to Coach is as easy as solving CartPole.
|
Adding a new environment to Coach is as easy as solving CartPole.
|
||||||
|
|
||||||
There a few simple steps to follow, and we will walk through them one by one.
|
There are a few simple steps to follow, and we will walk through them one by one.
|
||||||
|
|
||||||
1. Coach defines a simple API for implementing a new environment which is defined in environment/environment_wrapper.py.
|
1. Coach defines a simple API for implementing a new environment which is defined in environment/environment_wrapper.py.
|
||||||
There are several functions to implement, but only some of them are mandatory.
|
There are several functions to implement, but only some of them are mandatory.
|
||||||
|
|
||||||
Here are the mandatory ones:
|
Here are the important ones:
|
||||||
|
|
||||||
def step(self, action_idx):
|
def _take_action(self, action_idx):
|
||||||
"""
|
"""
|
||||||
Perform a single step on the environment using the given action.
|
An environment dependent function that sends an action to the simulator.
|
||||||
:param action_idx: the action to perform on the environment
|
:param action_idx: the action to perform on the environment.
|
||||||
:return: A dictionary containing the observation, reward, done flag, action and measurements
|
:return: None
|
||||||
"""
|
"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
def render(self):
|
def _preprocess_observation(self, observation):
|
||||||
"""
|
"""
|
||||||
Call the environment function for rendering to the screen.
|
Do initial observation preprocessing such as cropping, rgb2gray, rescale etc.
|
||||||
|
Implementing this function is optional.
|
||||||
|
:param observation: a raw observation from the environment
|
||||||
|
:return: the preprocessed observation
|
||||||
|
"""
|
||||||
|
return observation
|
||||||
|
|
||||||
|
def _update_state(self):
|
||||||
|
"""
|
||||||
|
Updates the state from the environment.
|
||||||
|
Should update self.observation, self.reward, self.done, self.measurements and self.info
|
||||||
|
:return: None
|
||||||
"""
|
"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@@ -28,6 +39,15 @@ There a few simple steps to follow, and we will walk through them one by one.
|
|||||||
"""
|
"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
def get_rendered_image(self):
|
||||||
|
"""
|
||||||
|
Return a numpy array containing the image that will be rendered to the screen.
|
||||||
|
This can be different from the observation. For example, mujoco's observation is a measurements vector.
|
||||||
|
:return: numpy array containing the image that will be rendered to the screen
|
||||||
|
"""
|
||||||
|
return self.observation
|
||||||
|
|
||||||
|
|
||||||
2. Make sure to import the environment in environments/\_\_init\_\_.py:
|
2. Make sure to import the environment in environments/\_\_init\_\_.py:
|
||||||
|
|
||||||
from doom_environment_wrapper import *
|
from doom_environment_wrapper import *
|
||||||
|
|||||||
|
Before Width: | Height: | Size: 31 KiB After Width: | Height: | Size: 35 KiB |
133
docs/docs/usage.md
Normal file
@@ -0,0 +1,133 @@
|
|||||||
|
# Coach Usage
|
||||||
|
|
||||||
|
## Training an Agent
|
||||||
|
|
||||||
|
### Single-threaded Algorithms
|
||||||
|
|
||||||
|
This is the most common case. Just choose a preset using the `-p` flag and press enter.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p CartPole_DQN`
|
||||||
|
|
||||||
|
### Multi-threaded Algorithms
|
||||||
|
|
||||||
|
Multi-threaded algorithms are very common this days.
|
||||||
|
They typically achieve the best results, and scale gracefully with the number of threads.
|
||||||
|
In Coach, running such algorithms is done by selecting a suitable preset, and choosing the number of threads to run using the `-n` flag.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p CartPole_A3C -n 8`
|
||||||
|
|
||||||
|
## Evaluating an Agent
|
||||||
|
|
||||||
|
There are several options for evaluating an agent during the training:
|
||||||
|
|
||||||
|
* For multi-threaded runs, an evaluation agent will constantly run in the background and evaluate the model during the training.
|
||||||
|
|
||||||
|
* For single-threaded runs, it is possible to define an evaluation period through the preset. This will run several episodes of evaluation once in a while.
|
||||||
|
|
||||||
|
Additionally, it is possible to save checkpoints of the agents networks and then run only in evaluation mode.
|
||||||
|
Saving checkpoints can be done by specifying the number of seconds between storing checkpoints using the `-s` flag.
|
||||||
|
The checkpoints will be saved into the experiment directory.
|
||||||
|
Loading a model for evaluation can be done by specifying the `-crd` flag with the experiment directory, and the `--evaluate` flag to disable training.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p CartPole_DQN -s 60`
|
||||||
|
`python coach.py -p CartPole_DQN --evaluate -crd CHECKPOINT_RESTORE_DIR`
|
||||||
|
|
||||||
|
## Playing with the Environment as a Human
|
||||||
|
|
||||||
|
Interacting with the environment as a human can be useful for understanding its difficulties and for collecting data for imitation learning.
|
||||||
|
In Coach, this can be easily done by selecting a preset that defines the environment to use, and specifying the `--play` flag.
|
||||||
|
When the environment is loaded, the available keyboard buttons will be printed to the screen.
|
||||||
|
Pressing the escape key when finished will end the simulation and store the replay buffer in the experiment dir.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p Breakout_DQN --play`
|
||||||
|
|
||||||
|
## Learning Through Imitation Learning
|
||||||
|
|
||||||
|
Learning through imitation of human behavior is a nice way to speedup the learning.
|
||||||
|
In Coach, this can be done in two steps -
|
||||||
|
|
||||||
|
1. Create a dataset of demonstrations by playing with the environment as a human.
|
||||||
|
After this step, a pickle of the replay buffer containing your game play will be stored in the experiment directory.
|
||||||
|
The path to this replay buffer will be printed to the screen.
|
||||||
|
To do so, you should select an environment type and level through the command line, and specify the `--play` flag.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -et Doom -lvl Basic --play`
|
||||||
|
|
||||||
|
|
||||||
|
2. Next, use an imitation learning preset and set the replay buffer path accordingly.
|
||||||
|
The path can be set either from the command line or from the preset itself.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p Doom_Basic_BC -cp='agent.load_memory_from_file_path=\"<experiment dir>/replay_buffer.p\"'`
|
||||||
|
|
||||||
|
|
||||||
|
## Visualizations
|
||||||
|
|
||||||
|
### Rendering the Environment
|
||||||
|
|
||||||
|
Rendering the environment can be done by using the `-r` flag.
|
||||||
|
When working with multi-threaded algorithms, the rendered image will be representing the game play of the evaluation worker.
|
||||||
|
When working with single-threaded algorithms, the rendered image will be representing the single worker which can be either training or evaluating.
|
||||||
|
Keep in mind that rendering the environment in single-threaded algorithms may slow the training to some extent.
|
||||||
|
When playing with the environment using the `--play` flag, the environment will be rendered automatically without the need for specifying the `-r` flag.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p Breakout_DQN -r`
|
||||||
|
|
||||||
|
### Dumping GIFs
|
||||||
|
|
||||||
|
Coach allows storing GIFs of the agent game play.
|
||||||
|
To dump GIF files, use the `-dg` flag.
|
||||||
|
The files are dumped after every evaluation episode, and are saved into the experiment directory, under a gifs sub-directory.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p Breakout_A3C -n 4 -dg`
|
||||||
|
|
||||||
|
## Switching between deep learning frameworks
|
||||||
|
|
||||||
|
Coach uses TensorFlow as its main backend framework, but it also supports neon for some of the algorithms.
|
||||||
|
By default, TensorFlow will be used. It is possible to switch to neon using the `-f` flag.
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
`python coach.py -p Doom_Basic_DQN -f neon`
|
||||||
|
|
||||||
|
## Additional Flags
|
||||||
|
|
||||||
|
There are several convenient flags which are important to know about.
|
||||||
|
Here we will list most of the flags, but these can be updated from time to time.
|
||||||
|
The most up to date description can be found by using the `-h` flag.
|
||||||
|
|
||||||
|
|
||||||
|
|Flag |Type |Description |
|
||||||
|
|-------------------------------|----------|--------------|
|
||||||
|
|`-p PRESET`, ``--preset PRESET`|string |Name of a preset to run (as configured in presets.py) |
|
||||||
|
|`-l`, `--list` |flag |List all available presets|
|
||||||
|
|`-e EXPERIMENT_NAME`, `--experiment_name EXPERIMENT_NAME`|string|Experiment name to be used to store the results.|
|
||||||
|
|`-r`, `--render` |flag |Render environment|
|
||||||
|
|`-f FRAMEWORK`, `--framework FRAMEWORK`|string|Neural network framework. Available values: tensorflow, neon|
|
||||||
|
|`-n NUM_WORKERS`, `--num_workers NUM_WORKERS`|int|Number of workers for multi-process based agents, e.g. A3C|
|
||||||
|
|`--play` |flag |Play as a human by controlling the game with the keyboard. This option will save a replay buffer with the game play.|
|
||||||
|
|`--evaluate` |flag |Run evaluation only. This is a convenient way to disable training in order to evaluate an existing checkpoint.|
|
||||||
|
|`-v`, `--verbose` |flag |Don't suppress TensorFlow debug prints.|
|
||||||
|
|`-s SAVE_MODEL_SEC`, `--save_model_sec SAVE_MODEL_SEC`|int|Time in seconds between saving checkpoints of the model.|
|
||||||
|
|`-crd CHECKPOINT_RESTORE_DIR`, `--checkpoint_restore_dir CHECKPOINT_RESTORE_DIR`|string|Path to a folder containing a checkpoint to restore the model from.|
|
||||||
|
|`-dg`, `--dump_gifs` |flag |Enable the gif saving functionality.|
|
||||||
|
|`-at AGENT_TYPE`, `--agent_type AGENT_TYPE`|string|Choose an agent type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||||
|
|`-et ENVIRONMENT_TYPE`, `--environment_type ENVIRONMENT_TYPE`|string|Choose an environment type class to override on top of the selected preset. If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||||
|
|`-ept EXPLORATION_POLICY_TYPE`, `--exploration_policy_type EXPLORATION_POLICY_TYPE`|string|Choose an exploration policy type class to override on top of the selected preset.If no preset is defined, a preset can be set from the command-line by combining settings which are set by using `--agent_type`, `--experiment_type`, `--environemnt_type`|
|
||||||
|
|`-lvl LEVEL`, `--level LEVEL` |string|Choose the level that will be played in the environment that was selected. This value will override the level parameter in the environment class.|
|
||||||
|
|`-cp CUSTOM_PARAMETER`, `--custom_parameter CUSTOM_PARAMETER`|string| Semicolon separated parameters used to override specific parameters on top of the selected preset (or on top of the command-line assembled one). Whenever a parameter value is a string, it should be inputted as `'\"string\"'`. For ex.: `"visualization.render=False;` `num_training_iterations=500;` `optimizer='rmsprop'"`|
|
||||||
@@ -11,6 +11,7 @@ extra_css: [extra.css]
|
|||||||
pages:
|
pages:
|
||||||
- Home : index.md
|
- Home : index.md
|
||||||
- Design: design.md
|
- Design: design.md
|
||||||
|
- Usage: usage.md
|
||||||
- Algorithms:
|
- Algorithms:
|
||||||
- 'DQN' : algorithms/value_optimization/dqn.md
|
- 'DQN' : algorithms/value_optimization/dqn.md
|
||||||
- 'Double DQN' : algorithms/value_optimization/double_dqn.md
|
- 'Double DQN' : algorithms/value_optimization/double_dqn.md
|
||||||
@@ -28,6 +29,7 @@ pages:
|
|||||||
- 'Proximal Policy Optimization' : algorithms/policy_optimization/ppo.md
|
- 'Proximal Policy Optimization' : algorithms/policy_optimization/ppo.md
|
||||||
- 'Clipped Proximal Policy Optimization' : algorithms/policy_optimization/cppo.md
|
- 'Clipped Proximal Policy Optimization' : algorithms/policy_optimization/cppo.md
|
||||||
- 'Direct Future Prediction' : algorithms/other/dfp.md
|
- 'Direct Future Prediction' : algorithms/other/dfp.md
|
||||||
|
- 'Behavioral Cloning' : algorithms/imitation/bc.md
|
||||||
|
|
||||||
- Coach Dashboard : 'dashboard.md'
|
- Coach Dashboard : 'dashboard.md'
|
||||||
- Contributing :
|
- Contributing :
|
||||||
|
|||||||
62
environments/CarlaSettings.ini
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
[CARLA/Server]
|
||||||
|
; If set to false, a mock controller will be used instead of waiting for a real
|
||||||
|
; client to connect.
|
||||||
|
UseNetworking=true
|
||||||
|
; Ports to use for the server-client communication. This can be overridden by
|
||||||
|
; the command-line switch `-world-port=N`, write and read ports will be set to
|
||||||
|
; N+1 and N+2 respectively.
|
||||||
|
WorldPort=2000
|
||||||
|
; Time-out in milliseconds for the networking operations.
|
||||||
|
ServerTimeOut=10000000000
|
||||||
|
; In synchronous mode, CARLA waits every frame until the control from the client
|
||||||
|
; is received.
|
||||||
|
SynchronousMode=true
|
||||||
|
; Send info about every non-player agent in the scene every frame, the
|
||||||
|
; information is attached to the measurements message. This includes other
|
||||||
|
; vehicles, pedestrians and traffic signs. Disabled by default to improve
|
||||||
|
; performance.
|
||||||
|
SendNonPlayerAgentsInfo=false
|
||||||
|
|
||||||
|
[CARLA/LevelSettings]
|
||||||
|
; Path of the vehicle class to be used for the player. Leave empty for default.
|
||||||
|
; Paths follow the pattern "/Game/Blueprints/Vehicles/Mustang/Mustang.Mustang_C"
|
||||||
|
PlayerVehicle=
|
||||||
|
; Number of non-player vehicles to be spawned into the level.
|
||||||
|
NumberOfVehicles=15
|
||||||
|
; Number of non-player pedestrians to be spawned into the level.
|
||||||
|
NumberOfPedestrians=30
|
||||||
|
; Index of the weather/lighting presets to use. If negative, the default presets
|
||||||
|
; of the map will be used.
|
||||||
|
WeatherId=1
|
||||||
|
; Seeds for the pseudo-random number generators.
|
||||||
|
SeedVehicles=123456789
|
||||||
|
SeedPedestrians=123456789
|
||||||
|
|
||||||
|
[CARLA/SceneCapture]
|
||||||
|
; Names of the cameras to be attached to the player, comma-separated, each of
|
||||||
|
; them should be defined in its own subsection. E.g., Uncomment next line to add
|
||||||
|
; a camera called MyCamera to the vehicle
|
||||||
|
|
||||||
|
Cameras=CameraRGB
|
||||||
|
|
||||||
|
; Now, every camera we added needs to be defined it in its own subsection.
|
||||||
|
[CARLA/SceneCapture/CameraRGB]
|
||||||
|
; Post-processing effect to be applied. Valid values:
|
||||||
|
; * None No effects applied.
|
||||||
|
; * SceneFinal Post-processing present at scene (bloom, fog, etc).
|
||||||
|
; * Depth Depth map ground-truth only.
|
||||||
|
; * SemanticSegmentation Semantic segmentation ground-truth only.
|
||||||
|
PostProcessing=SceneFinal
|
||||||
|
; Size of the captured image in pixels.
|
||||||
|
ImageSizeX=360
|
||||||
|
ImageSizeY=256
|
||||||
|
; Camera (horizontal) field of view in degrees.
|
||||||
|
CameraFOV=90
|
||||||
|
; Position of the camera relative to the car in centimeters.
|
||||||
|
CameraPositionX=200
|
||||||
|
CameraPositionY=0
|
||||||
|
CameraPositionZ=140
|
||||||
|
; Rotation of the camera relative to the car in degrees.
|
||||||
|
CameraRotationPitch=0
|
||||||
|
CameraRotationRoll=0
|
||||||
|
CameraRotationYaw=0
|
||||||
@@ -15,13 +15,16 @@
|
|||||||
#
|
#
|
||||||
|
|
||||||
from logger import *
|
from logger import *
|
||||||
from utils import Enum
|
from utils import Enum, get_open_port
|
||||||
from environments.gym_environment_wrapper import *
|
from environments.gym_environment_wrapper import *
|
||||||
from environments.doom_environment_wrapper import *
|
from environments.doom_environment_wrapper import *
|
||||||
|
from environments.carla_environment_wrapper import *
|
||||||
|
|
||||||
|
|
||||||
class EnvTypes(Enum):
|
class EnvTypes(Enum):
|
||||||
Doom = "DoomEnvironmentWrapper"
|
Doom = "DoomEnvironmentWrapper"
|
||||||
Gym = "GymEnvironmentWrapper"
|
Gym = "GymEnvironmentWrapper"
|
||||||
|
Carla = "CarlaEnvironmentWrapper"
|
||||||
|
|
||||||
|
|
||||||
def create_environment(tuning_parameters):
|
def create_environment(tuning_parameters):
|
||||||
|
|||||||
230
environments/carla_environment_wrapper.py
Normal file
@@ -0,0 +1,230 @@
|
|||||||
|
import sys
|
||||||
|
from os import path, environ
|
||||||
|
|
||||||
|
try:
|
||||||
|
sys.path.append(path.join(environ.get('CARLA_ROOT'), 'PythonClient'))
|
||||||
|
from carla.client import CarlaClient
|
||||||
|
from carla.settings import CarlaSettings
|
||||||
|
from carla.tcp import TCPConnectionError
|
||||||
|
from carla.sensor import Camera
|
||||||
|
from carla.client import VehicleControl
|
||||||
|
except ImportError:
|
||||||
|
from logger import failed_imports
|
||||||
|
failed_imports.append("CARLA")
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import time
|
||||||
|
import logging
|
||||||
|
import subprocess
|
||||||
|
import signal
|
||||||
|
from environments.environment_wrapper import EnvironmentWrapper
|
||||||
|
from utils import *
|
||||||
|
from logger import screen, logger
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
|
||||||
|
# enum of the available levels and their path
|
||||||
|
class CarlaLevel(Enum):
|
||||||
|
TOWN1 = "/Game/Maps/Town01"
|
||||||
|
TOWN2 = "/Game/Maps/Town02"
|
||||||
|
|
||||||
|
key_map = {
|
||||||
|
'BRAKE': (274,), # down arrow
|
||||||
|
'GAS': (273,), # up arrow
|
||||||
|
'TURN_LEFT': (276,), # left arrow
|
||||||
|
'TURN_RIGHT': (275,), # right arrow
|
||||||
|
'GAS_AND_TURN_LEFT': (273, 276),
|
||||||
|
'GAS_AND_TURN_RIGHT': (273, 275),
|
||||||
|
'BRAKE_AND_TURN_LEFT': (274, 276),
|
||||||
|
'BRAKE_AND_TURN_RIGHT': (274, 275),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class CarlaEnvironmentWrapper(EnvironmentWrapper):
|
||||||
|
def __init__(self, tuning_parameters):
|
||||||
|
EnvironmentWrapper.__init__(self, tuning_parameters)
|
||||||
|
|
||||||
|
self.tp = tuning_parameters
|
||||||
|
|
||||||
|
# server configuration
|
||||||
|
self.server_height = self.tp.env.server_height
|
||||||
|
self.server_width = self.tp.env.server_width
|
||||||
|
self.port = get_open_port()
|
||||||
|
self.host = 'localhost'
|
||||||
|
self.map = CarlaLevel().get(self.tp.env.level)
|
||||||
|
|
||||||
|
# client configuration
|
||||||
|
self.verbose = self.tp.env.verbose
|
||||||
|
self.depth = self.tp.env.depth
|
||||||
|
self.stereo = self.tp.env.stereo
|
||||||
|
self.semantic_segmentation = self.tp.env.semantic_segmentation
|
||||||
|
self.height = self.server_height * (1 + int(self.depth) + int(self.semantic_segmentation))
|
||||||
|
self.width = self.server_width * (1 + int(self.stereo))
|
||||||
|
self.size = (self.width, self.height)
|
||||||
|
|
||||||
|
self.config = self.tp.env.config
|
||||||
|
if self.config:
|
||||||
|
# load settings from file
|
||||||
|
with open(self.config, 'r') as fp:
|
||||||
|
self.settings = fp.read()
|
||||||
|
else:
|
||||||
|
# hard coded settings
|
||||||
|
self.settings = CarlaSettings()
|
||||||
|
self.settings.set(
|
||||||
|
SynchronousMode=True,
|
||||||
|
SendNonPlayerAgentsInfo=False,
|
||||||
|
NumberOfVehicles=15,
|
||||||
|
NumberOfPedestrians=30,
|
||||||
|
WeatherId=1)
|
||||||
|
self.settings.randomize_seeds()
|
||||||
|
|
||||||
|
# add cameras
|
||||||
|
camera = Camera('CameraRGB')
|
||||||
|
camera.set_image_size(self.width, self.height)
|
||||||
|
camera.set_position(200, 0, 140)
|
||||||
|
camera.set_rotation(0, 0, 0)
|
||||||
|
self.settings.add_sensor(camera)
|
||||||
|
|
||||||
|
# open the server
|
||||||
|
self.server = self._open_server()
|
||||||
|
|
||||||
|
logging.disable(40)
|
||||||
|
|
||||||
|
# open the client
|
||||||
|
self.game = CarlaClient(self.host, self.port, timeout=99999999)
|
||||||
|
self.game.connect()
|
||||||
|
scene = self.game.load_settings(self.settings)
|
||||||
|
|
||||||
|
# get available start positions
|
||||||
|
positions = scene.player_start_spots
|
||||||
|
self.num_pos = len(positions)
|
||||||
|
self.iterator_start_positions = 0
|
||||||
|
|
||||||
|
# action space
|
||||||
|
self.discrete_controls = False
|
||||||
|
self.action_space_size = 2
|
||||||
|
self.action_space_high = [1, 1]
|
||||||
|
self.action_space_low = [-1, -1]
|
||||||
|
self.action_space_abs_range = np.maximum(np.abs(self.action_space_low), np.abs(self.action_space_high))
|
||||||
|
self.steering_strength = 0.5
|
||||||
|
self.gas_strength = 1.0
|
||||||
|
self.brake_strength = 0.5
|
||||||
|
self.actions = {0: [0., 0.],
|
||||||
|
1: [0., -self.steering_strength],
|
||||||
|
2: [0., self.steering_strength],
|
||||||
|
3: [self.gas_strength, 0.],
|
||||||
|
4: [-self.brake_strength, 0],
|
||||||
|
5: [self.gas_strength, -self.steering_strength],
|
||||||
|
6: [self.gas_strength, self.steering_strength],
|
||||||
|
7: [self.brake_strength, -self.steering_strength],
|
||||||
|
8: [self.brake_strength, self.steering_strength]}
|
||||||
|
self.actions_description = ['NO-OP', 'TURN_LEFT', 'TURN_RIGHT', 'GAS', 'BRAKE',
|
||||||
|
'GAS_AND_TURN_LEFT', 'GAS_AND_TURN_RIGHT',
|
||||||
|
'BRAKE_AND_TURN_LEFT', 'BRAKE_AND_TURN_RIGHT']
|
||||||
|
for idx, action in enumerate(self.actions_description):
|
||||||
|
for key in key_map.keys():
|
||||||
|
if action == key:
|
||||||
|
self.key_to_action[key_map[key]] = idx
|
||||||
|
self.num_speedup_steps = 30
|
||||||
|
|
||||||
|
# measurements
|
||||||
|
self.measurements_size = (1,)
|
||||||
|
self.autopilot = None
|
||||||
|
|
||||||
|
# env initialization
|
||||||
|
self.reset(True)
|
||||||
|
|
||||||
|
# render
|
||||||
|
if self.is_rendered:
|
||||||
|
image = self.get_rendered_image()
|
||||||
|
self.renderer.create_screen(image.shape[1], image.shape[0])
|
||||||
|
|
||||||
|
def _open_server(self):
|
||||||
|
log_path = path.join(logger.experiments_path, "CARLA_LOG_{}.txt".format(self.port))
|
||||||
|
with open(log_path, "wb") as out:
|
||||||
|
cmd = [path.join(environ.get('CARLA_ROOT'), 'CarlaUE4.sh'), self.map,
|
||||||
|
"-benchmark", "-carla-server", "-fps=10", "-world-port={}".format(self.port),
|
||||||
|
"-windowed -ResX={} -ResY={}".format(self.server_width, self.server_height),
|
||||||
|
"-carla-no-hud"]
|
||||||
|
if self.config:
|
||||||
|
cmd.append("-carla-settings={}".format(self.config))
|
||||||
|
p = subprocess.Popen(cmd, stdout=out, stderr=out)
|
||||||
|
|
||||||
|
return p
|
||||||
|
|
||||||
|
def _close_server(self):
|
||||||
|
os.killpg(os.getpgid(self.server.pid), signal.SIGKILL)
|
||||||
|
|
||||||
|
def _update_state(self):
|
||||||
|
# get measurements and observations
|
||||||
|
measurements = []
|
||||||
|
while type(measurements) == list:
|
||||||
|
measurements, sensor_data = self.game.read_data()
|
||||||
|
self.observation = sensor_data['CameraRGB'].data
|
||||||
|
|
||||||
|
self.location = (measurements.player_measurements.transform.location.x,
|
||||||
|
measurements.player_measurements.transform.location.y,
|
||||||
|
measurements.player_measurements.transform.location.z)
|
||||||
|
|
||||||
|
is_collision = measurements.player_measurements.collision_vehicles != 0 \
|
||||||
|
or measurements.player_measurements.collision_pedestrians != 0 \
|
||||||
|
or measurements.player_measurements.collision_other != 0
|
||||||
|
|
||||||
|
speed_reward = measurements.player_measurements.forward_speed - 1
|
||||||
|
if speed_reward > 30.:
|
||||||
|
speed_reward = 30.
|
||||||
|
self.reward = speed_reward \
|
||||||
|
- (measurements.player_measurements.intersection_otherlane * 5) \
|
||||||
|
- (measurements.player_measurements.intersection_offroad * 5) \
|
||||||
|
- is_collision * 100 \
|
||||||
|
- np.abs(self.control.steer) * 10
|
||||||
|
|
||||||
|
# update measurements
|
||||||
|
self.measurements = [measurements.player_measurements.forward_speed]
|
||||||
|
self.autopilot = measurements.player_measurements.autopilot_control
|
||||||
|
|
||||||
|
# action_p = ['%.2f' % member for member in [self.control.throttle, self.control.steer]]
|
||||||
|
# screen.success('REWARD: %.2f, ACTIONS: %s' % (self.reward, action_p))
|
||||||
|
|
||||||
|
if (measurements.game_timestamp >= self.tp.env.episode_max_time) or is_collision:
|
||||||
|
# screen.success('EPISODE IS DONE. GameTime: {}, Collision: {}'.format(str(measurements.game_timestamp),
|
||||||
|
# str(is_collision)))
|
||||||
|
self.done = True
|
||||||
|
|
||||||
|
def _take_action(self, action_idx):
|
||||||
|
if type(action_idx) == int:
|
||||||
|
action = self.actions[action_idx]
|
||||||
|
else:
|
||||||
|
action = action_idx
|
||||||
|
self.last_action_idx = action
|
||||||
|
|
||||||
|
self.control = VehicleControl()
|
||||||
|
self.control.throttle = np.clip(action[0], 0, 1)
|
||||||
|
self.control.steer = np.clip(action[1], -1, 1)
|
||||||
|
self.control.brake = np.abs(np.clip(action[0], -1, 0))
|
||||||
|
if not self.tp.env.allow_braking:
|
||||||
|
self.control.brake = 0
|
||||||
|
self.control.hand_brake = False
|
||||||
|
self.control.reverse = False
|
||||||
|
|
||||||
|
self.game.send_control(self.control)
|
||||||
|
|
||||||
|
def _restart_environment_episode(self, force_environment_reset=False):
|
||||||
|
self.iterator_start_positions += 1
|
||||||
|
if self.iterator_start_positions >= self.num_pos:
|
||||||
|
self.iterator_start_positions = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
self.game.start_episode(self.iterator_start_positions)
|
||||||
|
except:
|
||||||
|
self.game.connect()
|
||||||
|
self.game.start_episode(self.iterator_start_positions)
|
||||||
|
|
||||||
|
# start the game with some initial speed
|
||||||
|
observation = None
|
||||||
|
for i in range(self.num_speedup_steps):
|
||||||
|
observation = self.step([1.0, 0])['observation']
|
||||||
|
self.observation = observation
|
||||||
|
|
||||||
|
return observation
|
||||||
|
|
||||||
@@ -25,6 +25,7 @@ import numpy as np
|
|||||||
from environments.environment_wrapper import EnvironmentWrapper
|
from environments.environment_wrapper import EnvironmentWrapper
|
||||||
from os import path, environ
|
from os import path, environ
|
||||||
from utils import *
|
from utils import *
|
||||||
|
from logger import *
|
||||||
|
|
||||||
|
|
||||||
# enum of the available levels and their path
|
# enum of the available levels and their path
|
||||||
@@ -39,6 +40,43 @@ class DoomLevel(Enum):
|
|||||||
DEFEND_THE_LINE = "defend_the_line.cfg"
|
DEFEND_THE_LINE = "defend_the_line.cfg"
|
||||||
DEADLY_CORRIDOR = "deadly_corridor.cfg"
|
DEADLY_CORRIDOR = "deadly_corridor.cfg"
|
||||||
|
|
||||||
|
key_map = {
|
||||||
|
'NO-OP': 96, # `
|
||||||
|
'ATTACK': 13, # enter
|
||||||
|
'CROUCH': 306, # ctrl
|
||||||
|
'DROP_SELECTED_ITEM': ord("t"),
|
||||||
|
'DROP_SELECTED_WEAPON': ord("t"),
|
||||||
|
'JUMP': 32, # spacebar
|
||||||
|
'LAND': ord("l"),
|
||||||
|
'LOOK_DOWN': 274, # down arrow
|
||||||
|
'LOOK_UP': 273, # up arrow
|
||||||
|
'MOVE_BACKWARD': ord("s"),
|
||||||
|
'MOVE_DOWN': ord("s"),
|
||||||
|
'MOVE_FORWARD': ord("w"),
|
||||||
|
'MOVE_LEFT': 276,
|
||||||
|
'MOVE_RIGHT': 275,
|
||||||
|
'MOVE_UP': ord("w"),
|
||||||
|
'RELOAD': ord("r"),
|
||||||
|
'SELECT_NEXT_WEAPON': ord("q"),
|
||||||
|
'SELECT_PREV_WEAPON': ord("e"),
|
||||||
|
'SELECT_WEAPON0': ord("0"),
|
||||||
|
'SELECT_WEAPON1': ord("1"),
|
||||||
|
'SELECT_WEAPON2': ord("2"),
|
||||||
|
'SELECT_WEAPON3': ord("3"),
|
||||||
|
'SELECT_WEAPON4': ord("4"),
|
||||||
|
'SELECT_WEAPON5': ord("5"),
|
||||||
|
'SELECT_WEAPON6': ord("6"),
|
||||||
|
'SELECT_WEAPON7': ord("7"),
|
||||||
|
'SELECT_WEAPON8': ord("8"),
|
||||||
|
'SELECT_WEAPON9': ord("9"),
|
||||||
|
'SPEED': 304, # shift
|
||||||
|
'STRAFE': 9, # tab
|
||||||
|
'TURN180': ord("u"),
|
||||||
|
'TURN_LEFT': ord("a"), # left arrow
|
||||||
|
'TURN_RIGHT': ord("d"), # right arrow
|
||||||
|
'USE': ord("f"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
class DoomEnvironmentWrapper(EnvironmentWrapper):
|
class DoomEnvironmentWrapper(EnvironmentWrapper):
|
||||||
def __init__(self, tuning_parameters):
|
def __init__(self, tuning_parameters):
|
||||||
@@ -49,26 +87,42 @@ class DoomEnvironmentWrapper(EnvironmentWrapper):
|
|||||||
self.scenarios_dir = path.join(environ.get('VIZDOOM_ROOT'), 'scenarios')
|
self.scenarios_dir = path.join(environ.get('VIZDOOM_ROOT'), 'scenarios')
|
||||||
self.game = vizdoom.DoomGame()
|
self.game = vizdoom.DoomGame()
|
||||||
self.game.load_config(path.join(self.scenarios_dir, self.level))
|
self.game.load_config(path.join(self.scenarios_dir, self.level))
|
||||||
self.game.set_window_visible(self.is_rendered)
|
self.game.set_window_visible(False)
|
||||||
self.game.add_game_args("+vid_forcesurface 1")
|
self.game.add_game_args("+vid_forcesurface 1")
|
||||||
if self.is_rendered:
|
|
||||||
|
self.wait_for_explicit_human_action = True
|
||||||
|
if self.human_control:
|
||||||
|
self.game.set_screen_resolution(vizdoom.ScreenResolution.RES_640X480)
|
||||||
|
self.renderer.create_screen(640, 480)
|
||||||
|
elif self.is_rendered:
|
||||||
self.game.set_screen_resolution(vizdoom.ScreenResolution.RES_320X240)
|
self.game.set_screen_resolution(vizdoom.ScreenResolution.RES_320X240)
|
||||||
|
self.renderer.create_screen(320, 240)
|
||||||
else:
|
else:
|
||||||
# lower resolution since we actually take only 76x60 and we don't need to render
|
# lower resolution since we actually take only 76x60 and we don't need to render
|
||||||
self.game.set_screen_resolution(vizdoom.ScreenResolution.RES_160X120)
|
self.game.set_screen_resolution(vizdoom.ScreenResolution.RES_160X120)
|
||||||
|
|
||||||
self.game.set_render_hud(False)
|
self.game.set_render_hud(False)
|
||||||
self.game.set_render_crosshair(False)
|
self.game.set_render_crosshair(False)
|
||||||
self.game.set_render_decals(False)
|
self.game.set_render_decals(False)
|
||||||
self.game.set_render_particles(False)
|
self.game.set_render_particles(False)
|
||||||
self.game.init()
|
self.game.init()
|
||||||
|
|
||||||
|
# action space
|
||||||
self.action_space_abs_range = 0
|
self.action_space_abs_range = 0
|
||||||
self.actions = {}
|
self.actions = {}
|
||||||
self.action_space_size = self.game.get_available_buttons_size()
|
self.action_space_size = self.game.get_available_buttons_size() + 1
|
||||||
for action_idx in range(self.action_space_size):
|
self.action_vector_size = self.action_space_size - 1
|
||||||
self.actions[action_idx] = [0] * self.action_space_size
|
self.actions[0] = [0] * self.action_vector_size
|
||||||
self.actions[action_idx][action_idx] = 1
|
for action_idx in range(self.action_vector_size):
|
||||||
self.actions_description = [str(action) for action in self.game.get_available_buttons()]
|
self.actions[action_idx + 1] = [0] * self.action_vector_size
|
||||||
|
self.actions[action_idx + 1][action_idx] = 1
|
||||||
|
self.actions_description = ['NO-OP']
|
||||||
|
self.actions_description += [str(action).split(".")[1] for action in self.game.get_available_buttons()]
|
||||||
|
for idx, action in enumerate(self.actions_description):
|
||||||
|
if action in key_map.keys():
|
||||||
|
self.key_to_action[(key_map[action],)] = idx
|
||||||
|
|
||||||
|
# measurement
|
||||||
self.measurements_size = self.game.get_state().game_variables.shape
|
self.measurements_size = self.game.get_state().game_variables.shape
|
||||||
|
|
||||||
self.width = self.game.get_screen_width()
|
self.width = self.game.get_screen_width()
|
||||||
@@ -77,27 +131,17 @@ class DoomEnvironmentWrapper(EnvironmentWrapper):
|
|||||||
self.game.set_seed(self.tp.seed)
|
self.game.set_seed(self.tp.seed)
|
||||||
self.reset()
|
self.reset()
|
||||||
|
|
||||||
def _update_observation_and_measurements(self):
|
def _update_state(self):
|
||||||
# extract all data from the current state
|
# extract all data from the current state
|
||||||
state = self.game.get_state()
|
state = self.game.get_state()
|
||||||
if state is not None and state.screen_buffer is not None:
|
if state is not None and state.screen_buffer is not None:
|
||||||
self.observation = self._preprocess_observation(state.screen_buffer)
|
self.observation = state.screen_buffer
|
||||||
self.measurements = state.game_variables
|
self.measurements = state.game_variables
|
||||||
|
self.reward = self.game.get_last_reward()
|
||||||
self.done = self.game.is_episode_finished()
|
self.done = self.game.is_episode_finished()
|
||||||
|
|
||||||
def step(self, action_idx):
|
def _take_action(self, action_idx):
|
||||||
self.reward = 0
|
self.game.make_action(self._idx_to_action(action_idx), self.frame_skip)
|
||||||
for frame in range(self.tp.env.frame_skip):
|
|
||||||
self.reward += self.game.make_action(self._idx_to_action(action_idx))
|
|
||||||
self._update_observation_and_measurements()
|
|
||||||
if self.done:
|
|
||||||
break
|
|
||||||
|
|
||||||
return {'observation': self.observation,
|
|
||||||
'reward': self.reward,
|
|
||||||
'done': self.done,
|
|
||||||
'action': action_idx,
|
|
||||||
'measurements': self.measurements}
|
|
||||||
|
|
||||||
def _preprocess_observation(self, observation):
|
def _preprocess_observation(self, observation):
|
||||||
if observation is None:
|
if observation is None:
|
||||||
@@ -108,3 +152,5 @@ class DoomEnvironmentWrapper(EnvironmentWrapper):
|
|||||||
|
|
||||||
def _restart_environment_episode(self, force_environment_reset=False):
|
def _restart_environment_episode(self, force_environment_reset=False):
|
||||||
self.game.new_episode()
|
self.game.new_episode()
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -17,6 +17,9 @@
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
from utils import *
|
from utils import *
|
||||||
from configurations import Preset
|
from configurations import Preset
|
||||||
|
from renderer import Renderer
|
||||||
|
import operator
|
||||||
|
import time
|
||||||
|
|
||||||
|
|
||||||
class EnvironmentWrapper(object):
|
class EnvironmentWrapper(object):
|
||||||
@@ -31,13 +34,19 @@ class EnvironmentWrapper(object):
|
|||||||
self.observation = []
|
self.observation = []
|
||||||
self.reward = 0
|
self.reward = 0
|
||||||
self.done = False
|
self.done = False
|
||||||
|
self.default_action = 0
|
||||||
self.last_action_idx = 0
|
self.last_action_idx = 0
|
||||||
|
self.episode_idx = 0
|
||||||
|
self.last_episode_time = time.time()
|
||||||
self.measurements = []
|
self.measurements = []
|
||||||
|
self.info = []
|
||||||
self.action_space_low = 0
|
self.action_space_low = 0
|
||||||
self.action_space_high = 0
|
self.action_space_high = 0
|
||||||
self.action_space_abs_range = 0
|
self.action_space_abs_range = 0
|
||||||
|
self.actions_description = {}
|
||||||
self.discrete_controls = True
|
self.discrete_controls = True
|
||||||
self.action_space_size = 0
|
self.action_space_size = 0
|
||||||
|
self.key_to_action = {}
|
||||||
self.width = 1
|
self.width = 1
|
||||||
self.height = 1
|
self.height = 1
|
||||||
self.is_state_type_image = True
|
self.is_state_type_image = True
|
||||||
@@ -50,17 +59,11 @@ class EnvironmentWrapper(object):
|
|||||||
self.is_rendered = self.tp.visualization.render
|
self.is_rendered = self.tp.visualization.render
|
||||||
self.seed = self.tp.seed
|
self.seed = self.tp.seed
|
||||||
self.frame_skip = self.tp.env.frame_skip
|
self.frame_skip = self.tp.env.frame_skip
|
||||||
|
self.human_control = self.tp.env.human_control
|
||||||
def _update_observation_and_measurements(self):
|
self.wait_for_explicit_human_action = False
|
||||||
# extract all the available measurments (ovservation, depthmap, lives, ammo etc.)
|
self.is_rendered = self.is_rendered or self.human_control
|
||||||
pass
|
self.game_is_open = True
|
||||||
|
self.renderer = Renderer()
|
||||||
def _restart_environment_episode(self, force_environment_reset=False):
|
|
||||||
"""
|
|
||||||
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
|
|
||||||
:return:
|
|
||||||
"""
|
|
||||||
pass
|
|
||||||
|
|
||||||
def _idx_to_action(self, action_idx):
|
def _idx_to_action(self, action_idx):
|
||||||
"""
|
"""
|
||||||
@@ -71,13 +74,43 @@ class EnvironmentWrapper(object):
|
|||||||
"""
|
"""
|
||||||
return self.actions[action_idx]
|
return self.actions[action_idx]
|
||||||
|
|
||||||
def _preprocess_observation(self, observation):
|
def _action_to_idx(self, action):
|
||||||
"""
|
"""
|
||||||
Do initial observation preprocessing such as cropping, rgb2gray, rescale etc.
|
Convert an environment action to one of the available actions of the wrapper.
|
||||||
:param observation: a raw observation from the environment
|
For example, if the available actions are 4,5,6 then this function will map 4->0, 5->1, 6->2
|
||||||
:return: the preprocessed observation
|
:param action: the environment action
|
||||||
|
:return: an action index between 0 and self.action_space_size - 1, or -1 if the action does not exist
|
||||||
"""
|
"""
|
||||||
pass
|
for key, val in self.actions.items():
|
||||||
|
if val == action:
|
||||||
|
return key
|
||||||
|
return -1
|
||||||
|
|
||||||
|
def get_action_from_user(self):
|
||||||
|
"""
|
||||||
|
Get an action from the user keyboard
|
||||||
|
:return: action index
|
||||||
|
"""
|
||||||
|
if self.wait_for_explicit_human_action:
|
||||||
|
while len(self.renderer.pressed_keys) == 0:
|
||||||
|
self.renderer.get_events()
|
||||||
|
|
||||||
|
if self.key_to_action == {}:
|
||||||
|
# the keys are the numbers on the keyboard corresponding to the action index
|
||||||
|
if len(self.renderer.pressed_keys) > 0:
|
||||||
|
action_idx = self.renderer.pressed_keys[0] - ord("1")
|
||||||
|
if 0 <= action_idx < self.action_space_size:
|
||||||
|
return action_idx
|
||||||
|
else:
|
||||||
|
# the keys are mapped through the environment to more intuitive keyboard keys
|
||||||
|
# key = tuple(self.renderer.pressed_keys)
|
||||||
|
# for key in self.renderer.pressed_keys:
|
||||||
|
for env_keys in self.key_to_action.keys():
|
||||||
|
if set(env_keys) == set(self.renderer.pressed_keys):
|
||||||
|
return self.key_to_action[env_keys]
|
||||||
|
|
||||||
|
# return the default action 0 so that the environment will continue running
|
||||||
|
return self.default_action
|
||||||
|
|
||||||
def step(self, action_idx):
|
def step(self, action_idx):
|
||||||
"""
|
"""
|
||||||
@@ -85,13 +118,29 @@ class EnvironmentWrapper(object):
|
|||||||
:param action_idx: the action to perform on the environment
|
:param action_idx: the action to perform on the environment
|
||||||
:return: A dictionary containing the observation, reward, done flag, action and measurements
|
:return: A dictionary containing the observation, reward, done flag, action and measurements
|
||||||
"""
|
"""
|
||||||
pass
|
self.last_action_idx = action_idx
|
||||||
|
|
||||||
|
self._take_action(action_idx)
|
||||||
|
|
||||||
|
self._update_state()
|
||||||
|
|
||||||
|
if self.is_rendered:
|
||||||
|
self.render()
|
||||||
|
|
||||||
|
self.observation = self._preprocess_observation(self.observation)
|
||||||
|
|
||||||
|
return {'observation': self.observation,
|
||||||
|
'reward': self.reward,
|
||||||
|
'done': self.done,
|
||||||
|
'action': self.last_action_idx,
|
||||||
|
'measurements': self.measurements,
|
||||||
|
'info': self.info}
|
||||||
|
|
||||||
def render(self):
|
def render(self):
|
||||||
"""
|
"""
|
||||||
Call the environment function for rendering to the screen
|
Call the environment function for rendering to the screen
|
||||||
"""
|
"""
|
||||||
pass
|
self.renderer.render_image(self.get_rendered_image())
|
||||||
|
|
||||||
def reset(self, force_environment_reset=False):
|
def reset(self, force_environment_reset=False):
|
||||||
"""
|
"""
|
||||||
@@ -100,15 +149,25 @@ class EnvironmentWrapper(object):
|
|||||||
:return: A dictionary containing the observation, reward, done flag, action and measurements
|
:return: A dictionary containing the observation, reward, done flag, action and measurements
|
||||||
"""
|
"""
|
||||||
self._restart_environment_episode(force_environment_reset)
|
self._restart_environment_episode(force_environment_reset)
|
||||||
|
self.last_episode_time = time.time()
|
||||||
self.done = False
|
self.done = False
|
||||||
|
self.episode_idx += 1
|
||||||
self.reward = 0.0
|
self.reward = 0.0
|
||||||
self.last_action_idx = 0
|
self.last_action_idx = 0
|
||||||
self._update_observation_and_measurements()
|
self._update_state()
|
||||||
|
|
||||||
|
# render before the preprocessing of the observation, so that the image will be in its original quality
|
||||||
|
if self.is_rendered:
|
||||||
|
self.render()
|
||||||
|
|
||||||
|
self.observation = self._preprocess_observation(self.observation)
|
||||||
|
|
||||||
return {'observation': self.observation,
|
return {'observation': self.observation,
|
||||||
'reward': self.reward,
|
'reward': self.reward,
|
||||||
'done': self.done,
|
'done': self.done,
|
||||||
'action': self.last_action_idx,
|
'action': self.last_action_idx,
|
||||||
'measurements': self.measurements}
|
'measurements': self.measurements,
|
||||||
|
'info': self.info}
|
||||||
|
|
||||||
def get_random_action(self):
|
def get_random_action(self):
|
||||||
"""
|
"""
|
||||||
@@ -129,6 +188,58 @@ class EnvironmentWrapper(object):
|
|||||||
"""
|
"""
|
||||||
self.phase = phase
|
self.phase = phase
|
||||||
|
|
||||||
|
def get_available_keys(self):
|
||||||
|
"""
|
||||||
|
Return a list of tuples mapping between action names and the keyboard key that triggers them
|
||||||
|
:return: a list of tuples mapping between action names and the keyboard key that triggers them
|
||||||
|
"""
|
||||||
|
available_keys = []
|
||||||
|
if self.key_to_action != {}:
|
||||||
|
for key, idx in sorted(self.key_to_action.items(), key=operator.itemgetter(1)):
|
||||||
|
if key != ():
|
||||||
|
key_names = [self.renderer.get_key_names([k])[0] for k in key]
|
||||||
|
available_keys.append((self.actions_description[idx], ' + '.join(key_names)))
|
||||||
|
elif self.discrete_controls:
|
||||||
|
for action in range(self.action_space_size):
|
||||||
|
available_keys.append(("Action {}".format(action + 1), action + 1))
|
||||||
|
return available_keys
|
||||||
|
|
||||||
|
# The following functions define the interaction with the environment.
|
||||||
|
# Any new environment that inherits the EnvironmentWrapper class should use these signatures.
|
||||||
|
# Some of these functions are optional - please read their description for more details.
|
||||||
|
|
||||||
|
def _take_action(self, action_idx):
|
||||||
|
"""
|
||||||
|
An environment dependent function that sends an action to the simulator.
|
||||||
|
:param action_idx: the action to perform on the environment
|
||||||
|
:return: None
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def _preprocess_observation(self, observation):
|
||||||
|
"""
|
||||||
|
Do initial observation preprocessing such as cropping, rgb2gray, rescale etc.
|
||||||
|
Implementing this function is optional.
|
||||||
|
:param observation: a raw observation from the environment
|
||||||
|
:return: the preprocessed observation
|
||||||
|
"""
|
||||||
|
return observation
|
||||||
|
|
||||||
|
def _update_state(self):
|
||||||
|
"""
|
||||||
|
Updates the state from the environment.
|
||||||
|
Should update self.observation, self.reward, self.done, self.measurements and self.info
|
||||||
|
:return: None
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def _restart_environment_episode(self, force_environment_reset=False):
|
||||||
|
"""
|
||||||
|
:param force_environment_reset: Force the environment to reset even if the episode is not done yet.
|
||||||
|
:return:
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
def get_rendered_image(self):
|
def get_rendered_image(self):
|
||||||
"""
|
"""
|
||||||
Return a numpy array containing the image that will be rendered to the screen.
|
Return a numpy array containing the image that will be rendered to the screen.
|
||||||
|
|||||||
@@ -15,8 +15,10 @@
|
|||||||
#
|
#
|
||||||
|
|
||||||
import sys
|
import sys
|
||||||
|
from logger import *
|
||||||
import gym
|
import gym
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
import time
|
||||||
try:
|
try:
|
||||||
import roboschool
|
import roboschool
|
||||||
from OpenGL import GL
|
from OpenGL import GL
|
||||||
@@ -40,8 +42,6 @@ from gym import wrappers
|
|||||||
from utils import force_list, RunPhase
|
from utils import force_list, RunPhase
|
||||||
from environments.environment_wrapper import EnvironmentWrapper
|
from environments.environment_wrapper import EnvironmentWrapper
|
||||||
|
|
||||||
i = 0
|
|
||||||
|
|
||||||
|
|
||||||
class GymEnvironmentWrapper(EnvironmentWrapper):
|
class GymEnvironmentWrapper(EnvironmentWrapper):
|
||||||
def __init__(self, tuning_parameters):
|
def __init__(self, tuning_parameters):
|
||||||
@@ -53,29 +53,30 @@ class GymEnvironmentWrapper(EnvironmentWrapper):
|
|||||||
self.env.seed(self.seed)
|
self.env.seed(self.seed)
|
||||||
|
|
||||||
# self.env_spec = gym.spec(self.env_id)
|
# self.env_spec = gym.spec(self.env_id)
|
||||||
|
self.env.frameskip = self.frame_skip
|
||||||
self.discrete_controls = type(self.env.action_space) != gym.spaces.box.Box
|
self.discrete_controls = type(self.env.action_space) != gym.spaces.box.Box
|
||||||
|
|
||||||
# pybullet requires rendering before resetting the environment, but other gym environments (Pendulum) will crash
|
self.observation = self.reset(True)['observation']
|
||||||
try:
|
|
||||||
if self.is_rendered:
|
|
||||||
self.render()
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
o = self.reset(True)['observation']
|
|
||||||
|
|
||||||
# render
|
# render
|
||||||
if self.is_rendered:
|
if self.is_rendered:
|
||||||
self.render()
|
image = self.get_rendered_image()
|
||||||
|
scale = 1
|
||||||
|
if self.human_control:
|
||||||
|
scale = 2
|
||||||
|
self.renderer.create_screen(image.shape[1]*scale, image.shape[0]*scale)
|
||||||
|
|
||||||
self.is_state_type_image = len(o.shape) > 1
|
self.is_state_type_image = len(self.observation.shape) > 1
|
||||||
if self.is_state_type_image:
|
if self.is_state_type_image:
|
||||||
self.width = o.shape[1]
|
self.width = self.observation.shape[1]
|
||||||
self.height = o.shape[0]
|
self.height = self.observation.shape[0]
|
||||||
else:
|
else:
|
||||||
self.width = o.shape[0]
|
self.width = self.observation.shape[0]
|
||||||
|
|
||||||
|
# action space
|
||||||
self.actions_description = {}
|
self.actions_description = {}
|
||||||
|
if hasattr(self.env.unwrapped, 'get_action_meanings'):
|
||||||
|
self.actions_description = self.env.unwrapped.get_action_meanings()
|
||||||
if self.discrete_controls:
|
if self.discrete_controls:
|
||||||
self.action_space_size = self.env.action_space.n
|
self.action_space_size = self.env.action_space.n
|
||||||
self.action_space_abs_range = 0
|
self.action_space_abs_range = 0
|
||||||
@@ -85,34 +86,31 @@ class GymEnvironmentWrapper(EnvironmentWrapper):
|
|||||||
self.action_space_low = self.env.action_space.low
|
self.action_space_low = self.env.action_space.low
|
||||||
self.action_space_abs_range = np.maximum(np.abs(self.action_space_low), np.abs(self.action_space_high))
|
self.action_space_abs_range = np.maximum(np.abs(self.action_space_low), np.abs(self.action_space_high))
|
||||||
self.actions = {i: i for i in range(self.action_space_size)}
|
self.actions = {i: i for i in range(self.action_space_size)}
|
||||||
|
self.key_to_action = {}
|
||||||
|
if hasattr(self.env.unwrapped, 'get_keys_to_action'):
|
||||||
|
self.key_to_action = self.env.unwrapped.get_keys_to_action()
|
||||||
|
|
||||||
|
# measurements
|
||||||
self.timestep_limit = self.env.spec.timestep_limit
|
self.timestep_limit = self.env.spec.timestep_limit
|
||||||
self.current_ale_lives = 0
|
|
||||||
self.measurements_size = len(self.step(0)['info'].keys())
|
self.measurements_size = len(self.step(0)['info'].keys())
|
||||||
|
|
||||||
# env intialization
|
def _update_state(self):
|
||||||
self.observation = o
|
if hasattr(self.env.env, 'ale'):
|
||||||
self.reward = 0
|
if self.phase == RunPhase.TRAIN and hasattr(self, 'current_ale_lives'):
|
||||||
self.done = False
|
# signal termination for life loss
|
||||||
self.last_action = self.actions[0]
|
if self.current_ale_lives != self.env.env.ale.lives():
|
||||||
|
self.done = True
|
||||||
def render(self):
|
self.current_ale_lives = self.env.env.ale.lives()
|
||||||
self.env.render()
|
|
||||||
|
|
||||||
def step(self, action_idx):
|
|
||||||
|
|
||||||
|
def _take_action(self, action_idx):
|
||||||
if action_idx is None:
|
if action_idx is None:
|
||||||
action_idx = self.last_action_idx
|
action_idx = self.last_action_idx
|
||||||
|
|
||||||
self.last_action_idx = action_idx
|
|
||||||
|
|
||||||
if self.discrete_controls:
|
if self.discrete_controls:
|
||||||
action = self.actions[action_idx]
|
action = self.actions[action_idx]
|
||||||
else:
|
else:
|
||||||
action = action_idx
|
action = action_idx
|
||||||
|
|
||||||
if hasattr(self.env.env, 'ale'):
|
|
||||||
prev_ale_lives = self.env.env.ale.lives()
|
|
||||||
|
|
||||||
# pendulum-v0 for example expects a list
|
# pendulum-v0 for example expects a list
|
||||||
if not self.discrete_controls:
|
if not self.discrete_controls:
|
||||||
# catching cases where the action for continuous control is a number instead of a list the
|
# catching cases where the action for continuous control is a number instead of a list the
|
||||||
@@ -128,42 +126,26 @@ class GymEnvironmentWrapper(EnvironmentWrapper):
|
|||||||
|
|
||||||
self.observation, self.reward, self.done, self.info = self.env.step(action)
|
self.observation, self.reward, self.done, self.info = self.env.step(action)
|
||||||
|
|
||||||
if hasattr(self.env.env, 'ale') and self.phase == RunPhase.TRAIN:
|
def _preprocess_observation(self, observation):
|
||||||
# signal termination for breakout life loss
|
|
||||||
if prev_ale_lives != self.env.env.ale.lives():
|
|
||||||
self.done = True
|
|
||||||
|
|
||||||
if any(env in self.env_id for env in ["Breakout", "Pong"]):
|
if any(env in self.env_id for env in ["Breakout", "Pong"]):
|
||||||
# crop image
|
# crop image
|
||||||
self.observation = self.observation[34:195, :, :]
|
observation = observation[34:195, :, :]
|
||||||
|
return observation
|
||||||
if self.is_rendered:
|
|
||||||
self.render()
|
|
||||||
|
|
||||||
return {'observation': self.observation,
|
|
||||||
'reward': self.reward,
|
|
||||||
'done': self.done,
|
|
||||||
'action': self.last_action_idx,
|
|
||||||
'info': self.info}
|
|
||||||
|
|
||||||
def _restart_environment_episode(self, force_environment_reset=False):
|
def _restart_environment_episode(self, force_environment_reset=False):
|
||||||
# prevent reset of environment if there are ale lives left
|
# prevent reset of environment if there are ale lives left
|
||||||
if "Breakout" in self.env_id and self.env.env.ale.lives() > 0 and not force_environment_reset:
|
if (hasattr(self.env.env, 'ale') and self.env.env.ale.lives() > 0) \
|
||||||
|
and not force_environment_reset and not self.env._past_limit():
|
||||||
return self.observation
|
return self.observation
|
||||||
|
|
||||||
if self.seed:
|
if self.seed:
|
||||||
self.env.seed(self.seed)
|
self.env.seed(self.seed)
|
||||||
observation = self.env.reset()
|
|
||||||
while observation is None:
|
|
||||||
observation = self.step(0)['observation']
|
|
||||||
|
|
||||||
if "Breakout" in self.env_id:
|
self.observation = self.env.reset()
|
||||||
# crop image
|
while self.observation is None:
|
||||||
observation = observation[34:195, :, :]
|
self.step(0)
|
||||||
|
|
||||||
self.observation = observation
|
return self.observation
|
||||||
|
|
||||||
return observation
|
|
||||||
|
|
||||||
def get_rendered_image(self):
|
def get_rendered_image(self):
|
||||||
return self.env.render(mode='rgb_array')
|
return self.env.render(mode='rgb_array')
|
||||||
|
|||||||
|
Before Width: | Height: | Size: 31 KiB After Width: | Height: | Size: 35 KiB |
BIN
img/ant.gif
|
Before Width: | Height: | Size: 7.3 MiB |
BIN
img/carla.gif
Normal file
|
After Width: | Height: | Size: 3.6 MiB |
BIN
img/doom.gif
|
Before Width: | Height: | Size: 4.7 MiB |
BIN
img/doom_deathmatch.gif
Normal file
|
After Width: | Height: | Size: 3.0 MiB |
BIN
img/minitaur.gif
|
Before Width: | Height: | Size: 3.0 MiB |
BIN
img/montezuma.gif
Normal file
|
After Width: | Height: | Size: 278 KiB |
10
install.sh
@@ -192,10 +192,14 @@ if [ ${INSTALL_NEON} -eq 1 ]; then
|
|||||||
|
|
||||||
# Neon
|
# Neon
|
||||||
sudo -E apt-get install libhdf5-dev libyaml-dev pkg-config clang virtualenv libcurl4-openssl-dev libopencv-dev libsox-dev -y
|
sudo -E apt-get install libhdf5-dev libyaml-dev pkg-config clang virtualenv libcurl4-openssl-dev libopencv-dev libsox-dev -y
|
||||||
git clone https://github.com/NervanaSystems/neon.git
|
pip3 install nervananeon
|
||||||
cd neon && make sysinstall -j
|
|
||||||
cd ..
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
if ! [ -x "$(command -v nvidia-smi)" ]; then
|
||||||
# Intel Optimized TensorFlow
|
# Intel Optimized TensorFlow
|
||||||
pip3 install https://anaconda.org/intel/tensorflow/1.3.0/download/tensorflow-1.3.0-cp35-cp35m-linux_x86_64.whl
|
pip3 install https://anaconda.org/intel/tensorflow/1.3.0/download/tensorflow-1.3.0-cp35-cp35m-linux_x86_64.whl
|
||||||
|
else
|
||||||
|
# GPU supported TensorFlow
|
||||||
|
pip3 install tensorflow-gpu
|
||||||
|
fi
|
||||||
|
|
||||||
|
|||||||
@@ -18,6 +18,7 @@ from pandas import *
|
|||||||
import os
|
import os
|
||||||
from pprint import pprint
|
from pprint import pprint
|
||||||
import threading
|
import threading
|
||||||
|
from subprocess import Popen, PIPE
|
||||||
import time
|
import time
|
||||||
import datetime
|
import datetime
|
||||||
from six.moves import input
|
from six.moves import input
|
||||||
@@ -61,7 +62,7 @@ class ScreenLogger(object):
|
|||||||
print("")
|
print("")
|
||||||
|
|
||||||
def log(self, data):
|
def log(self, data):
|
||||||
print(self.name + ": " + data)
|
print(data)
|
||||||
|
|
||||||
def log_dict(self, dict, prefix=""):
|
def log_dict(self, dict, prefix=""):
|
||||||
str = "{}{}{} - ".format(Colors.PURPLE, prefix, Colors.END)
|
str = "{}{}{} - ".format(Colors.PURPLE, prefix, Colors.END)
|
||||||
@@ -78,8 +79,10 @@ class ScreenLogger(object):
|
|||||||
def warning(self, text):
|
def warning(self, text):
|
||||||
print("{}{}{}".format(Colors.YELLOW, text, Colors.END))
|
print("{}{}{}".format(Colors.YELLOW, text, Colors.END))
|
||||||
|
|
||||||
def error(self, text):
|
def error(self, text, crash=True):
|
||||||
print("{}{}{}".format(Colors.RED, text, Colors.END))
|
print("{}{}{}".format(Colors.RED, text, Colors.END))
|
||||||
|
if crash:
|
||||||
|
exit(1)
|
||||||
|
|
||||||
def ask_input(self, title):
|
def ask_input(self, title):
|
||||||
return input("{}{}{}".format(Colors.BG_CYAN, title, Colors.END))
|
return input("{}{}{}".format(Colors.BG_CYAN, title, Colors.END))
|
||||||
|
|||||||
@@ -74,7 +74,9 @@ class EpisodicExperienceReplay(Memory):
|
|||||||
|
|
||||||
def sample(self, size):
|
def sample(self, size):
|
||||||
assert self.num_transitions_in_complete_episodes() > size, \
|
assert self.num_transitions_in_complete_episodes() > size, \
|
||||||
'There are not enough transitions in the replay buffer'
|
'There are not enough transitions in the replay buffer. ' \
|
||||||
|
'Available transitions: {}. Requested transitions: {}.'\
|
||||||
|
.format(self.num_transitions_in_complete_episodes(), size)
|
||||||
batch = []
|
batch = []
|
||||||
transitions_idx = np.random.randint(self.num_transitions_in_complete_episodes(), size=size)
|
transitions_idx = np.random.randint(self.num_transitions_in_complete_episodes(), size=size)
|
||||||
for i in transitions_idx:
|
for i in transitions_idx:
|
||||||
|
|||||||
@@ -73,6 +73,7 @@ class Episode(object):
|
|||||||
if n_step_return == -1 or n_step_return > self.length():
|
if n_step_return == -1 or n_step_return > self.length():
|
||||||
n_step_return = self.length()
|
n_step_return = self.length()
|
||||||
rewards = np.array([t.reward for t in self.transitions])
|
rewards = np.array([t.reward for t in self.transitions])
|
||||||
|
rewards = rewards.astype('float')
|
||||||
total_return = rewards.copy()
|
total_return = rewards.copy()
|
||||||
current_discount = discount
|
current_discount = discount
|
||||||
for i in range(1, n_step_return):
|
for i in range(1, n_step_return):
|
||||||
@@ -123,12 +124,30 @@ class Episode(object):
|
|||||||
|
|
||||||
|
|
||||||
class Transition(object):
|
class Transition(object):
|
||||||
def __init__(self, state, action, reward, next_state, game_over):
|
def __init__(self, state, action, reward=0, next_state=None, game_over=False):
|
||||||
|
"""
|
||||||
|
A transition is a tuple containing the information of a single step of interaction
|
||||||
|
between the agent and the environment. The most basic version should contain the following values:
|
||||||
|
(current state, action, reward, next state, game over)
|
||||||
|
For imitation learning algorithms, if the reward, next state or game over is not known,
|
||||||
|
it is sufficient to store the current state and action taken by the expert.
|
||||||
|
|
||||||
|
:param state: The current state. Assumed to be a dictionary where the observation
|
||||||
|
is located at state['observation']
|
||||||
|
:param action: The current action that was taken
|
||||||
|
:param reward: The reward received from the environment
|
||||||
|
:param next_state: The next state of the environment after applying the action.
|
||||||
|
The next state should be similar to the state in its structure.
|
||||||
|
:param game_over: A boolean which should be True if the episode terminated after
|
||||||
|
the execution of the action.
|
||||||
|
"""
|
||||||
self.state = copy.deepcopy(state)
|
self.state = copy.deepcopy(state)
|
||||||
self.state['observation'] = np.array(self.state['observation'], copy=False)
|
self.state['observation'] = np.array(self.state['observation'], copy=False)
|
||||||
self.action = action
|
self.action = action
|
||||||
self.reward = reward
|
self.reward = reward
|
||||||
self.total_return = None
|
self.total_return = None
|
||||||
|
if not next_state:
|
||||||
|
next_state = state
|
||||||
self.next_state = copy.deepcopy(next_state)
|
self.next_state = copy.deepcopy(next_state)
|
||||||
self.next_state['observation'] = np.array(self.next_state['observation'], copy=False)
|
self.next_state['observation'] = np.array(self.next_state['observation'], copy=False)
|
||||||
self.game_over = game_over
|
self.game_over = game_over
|
||||||
|
|||||||
103
presets.py
@@ -38,6 +38,15 @@ def json_to_preset(json_path):
|
|||||||
if run_dict['exploration_policy_type'] is not None:
|
if run_dict['exploration_policy_type'] is not None:
|
||||||
tuning_parameters.exploration = eval(run_dict['exploration_policy_type'])()
|
tuning_parameters.exploration = eval(run_dict['exploration_policy_type'])()
|
||||||
|
|
||||||
|
# human control
|
||||||
|
if run_dict['play']:
|
||||||
|
tuning_parameters.agent.type = 'HumanAgent'
|
||||||
|
tuning_parameters.env.human_control = True
|
||||||
|
tuning_parameters.num_heatup_steps = 0
|
||||||
|
|
||||||
|
if run_dict['level']:
|
||||||
|
tuning_parameters.env.level = run_dict['level']
|
||||||
|
|
||||||
if run_dict['custom_parameter'] is not None:
|
if run_dict['custom_parameter'] is not None:
|
||||||
unstripped_key_value_pairs = [pair.split('=') for pair in run_dict['custom_parameter'].split(';')]
|
unstripped_key_value_pairs = [pair.split('=') for pair in run_dict['custom_parameter'].split(';')]
|
||||||
stripped_key_value_pairs = [tuple([pair[0].strip(), ast.literal_eval(pair[1].strip())]) for pair in
|
stripped_key_value_pairs = [tuple([pair[0].strip(), ast.literal_eval(pair[1].strip())]) for pair in
|
||||||
@@ -331,7 +340,7 @@ class CartPole_NStepQ(Preset):
|
|||||||
self.agent.num_steps_between_gradient_updates = 5
|
self.agent.num_steps_between_gradient_updates = 5
|
||||||
|
|
||||||
self.test = True
|
self.test = True
|
||||||
self.test_max_step_threshold = 1000
|
self.test_max_step_threshold = 2000
|
||||||
self.test_min_return_threshold = 150
|
self.test_min_return_threshold = 150
|
||||||
self.test_num_workers = 8
|
self.test_num_workers = 8
|
||||||
|
|
||||||
@@ -926,7 +935,7 @@ class CartPole_A3C(Preset):
|
|||||||
self.agent.middleware_type = MiddlewareTypes.FC
|
self.agent.middleware_type = MiddlewareTypes.FC
|
||||||
|
|
||||||
self.test = True
|
self.test = True
|
||||||
self.test_max_step_threshold = 200
|
self.test_max_step_threshold = 1000
|
||||||
self.test_min_return_threshold = 150
|
self.test_min_return_threshold = 150
|
||||||
self.test_num_workers = 8
|
self.test_num_workers = 8
|
||||||
|
|
||||||
@@ -1182,3 +1191,93 @@ class Breakout_A3C(Preset):
|
|||||||
self.agent.beta_entropy = 0.05
|
self.agent.beta_entropy = 0.05
|
||||||
self.clip_gradients = 40.0
|
self.clip_gradients = 40.0
|
||||||
self.agent.middleware_type = MiddlewareTypes.FC
|
self.agent.middleware_type = MiddlewareTypes.FC
|
||||||
|
|
||||||
|
|
||||||
|
class Carla_A3C(Preset):
|
||||||
|
def __init__(self):
|
||||||
|
Preset.__init__(self, ActorCritic, Carla, EntropyExploration)
|
||||||
|
self.agent.embedder_complexity = EmbedderComplexity.Deep
|
||||||
|
self.agent.policy_gradient_rescaler = 'GAE'
|
||||||
|
self.learning_rate = 0.0001
|
||||||
|
self.num_heatup_steps = 0
|
||||||
|
# self.env.reward_scaling = 1.0e9
|
||||||
|
self.agent.discount = 0.99
|
||||||
|
self.agent.apply_gradients_every_x_episodes = 1
|
||||||
|
self.agent.num_steps_between_gradient_updates = 30
|
||||||
|
self.agent.gae_lambda = 1
|
||||||
|
self.agent.beta_entropy = 0.01
|
||||||
|
self.clip_gradients = 40
|
||||||
|
self.agent.middleware_type = MiddlewareTypes.FC
|
||||||
|
|
||||||
|
|
||||||
|
class Carla_DDPG(Preset):
|
||||||
|
def __init__(self):
|
||||||
|
Preset.__init__(self, DDPG, Carla, OUExploration)
|
||||||
|
self.agent.embedder_complexity = EmbedderComplexity.Deep
|
||||||
|
self.learning_rate = 0.0001
|
||||||
|
self.num_heatup_steps = 1000
|
||||||
|
self.agent.num_consecutive_training_steps = 5
|
||||||
|
|
||||||
|
|
||||||
|
class Carla_BC(Preset):
|
||||||
|
def __init__(self):
|
||||||
|
Preset.__init__(self, BC, Carla, ExplorationParameters)
|
||||||
|
self.agent.embedder_complexity = EmbedderComplexity.Deep
|
||||||
|
self.agent.load_memory_from_file_path = 'datasets/carla_town1.p'
|
||||||
|
self.learning_rate = 0.0005
|
||||||
|
self.num_heatup_steps = 0
|
||||||
|
self.evaluation_episodes = 5
|
||||||
|
self.batch_size = 120
|
||||||
|
self.evaluate_every_x_training_iterations = 5000
|
||||||
|
|
||||||
|
|
||||||
|
class Doom_Basic_BC(Preset):
|
||||||
|
def __init__(self):
|
||||||
|
Preset.__init__(self, BC, Doom, ExplorationParameters)
|
||||||
|
self.env.level = 'basic'
|
||||||
|
self.agent.load_memory_from_file_path = 'datasets/doom_basic.p'
|
||||||
|
self.learning_rate = 0.0005
|
||||||
|
self.num_heatup_steps = 0
|
||||||
|
self.evaluation_episodes = 5
|
||||||
|
self.batch_size = 120
|
||||||
|
self.evaluate_every_x_training_iterations = 100
|
||||||
|
self.num_training_iterations = 2000
|
||||||
|
|
||||||
|
|
||||||
|
class Doom_Defend_BC(Preset):
|
||||||
|
def __init__(self):
|
||||||
|
Preset.__init__(self, BC, Doom, ExplorationParameters)
|
||||||
|
self.env.level = 'defend'
|
||||||
|
self.agent.load_memory_from_file_path = 'datasets/doom_defend.p'
|
||||||
|
self.learning_rate = 0.0005
|
||||||
|
self.num_heatup_steps = 0
|
||||||
|
self.evaluation_episodes = 5
|
||||||
|
self.batch_size = 120
|
||||||
|
self.evaluate_every_x_training_iterations = 100
|
||||||
|
|
||||||
|
|
||||||
|
class Doom_Deathmatch_BC(Preset):
|
||||||
|
def __init__(self):
|
||||||
|
Preset.__init__(self, BC, Doom, ExplorationParameters)
|
||||||
|
self.env.level = 'deathmatch'
|
||||||
|
self.agent.load_memory_from_file_path = 'datasets/doom_deathmatch.p'
|
||||||
|
self.learning_rate = 0.0005
|
||||||
|
self.num_heatup_steps = 0
|
||||||
|
self.evaluation_episodes = 5
|
||||||
|
self.batch_size = 120
|
||||||
|
self.evaluate_every_x_training_iterations = 100
|
||||||
|
|
||||||
|
|
||||||
|
class MontezumaRevenge_BC(Preset):
|
||||||
|
def __init__(self):
|
||||||
|
Preset.__init__(self, BC, Atari, ExplorationParameters)
|
||||||
|
self.env.level = 'MontezumaRevenge-v0'
|
||||||
|
self.agent.load_memory_from_file_path = 'datasets/montezuma_revenge.p'
|
||||||
|
self.learning_rate = 0.0005
|
||||||
|
self.num_heatup_steps = 0
|
||||||
|
self.evaluation_episodes = 5
|
||||||
|
self.batch_size = 120
|
||||||
|
self.evaluate_every_x_training_iterations = 100
|
||||||
|
self.exploration.evaluation_epsilon = 0.05
|
||||||
|
self.exploration.evaluation_policy = 'EGreedy'
|
||||||
|
self.env.frame_skip = 1
|
||||||
|
|||||||
85
renderer.py
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
import pygame
|
||||||
|
from pygame.locals import *
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
|
||||||
|
class Renderer(object):
|
||||||
|
def __init__(self):
|
||||||
|
self.size = (1, 1)
|
||||||
|
self.screen = None
|
||||||
|
self.clock = pygame.time.Clock()
|
||||||
|
self.display = pygame.display
|
||||||
|
self.fps = 30
|
||||||
|
self.pressed_keys = []
|
||||||
|
self.is_open = False
|
||||||
|
|
||||||
|
def create_screen(self, width, height):
|
||||||
|
"""
|
||||||
|
Creates a pygame window
|
||||||
|
:param width: the width of the window
|
||||||
|
:param height: the height of the window
|
||||||
|
:return: None
|
||||||
|
"""
|
||||||
|
self.size = (width, height)
|
||||||
|
self.screen = self.display.set_mode(self.size, HWSURFACE | DOUBLEBUF)
|
||||||
|
self.display.set_caption("Coach")
|
||||||
|
self.is_open = True
|
||||||
|
|
||||||
|
def normalize_image(self, image):
|
||||||
|
"""
|
||||||
|
Normalize image values to be between 0 and 255
|
||||||
|
:param image: 2D/3D array containing an image with arbitrary values
|
||||||
|
:return: the input image with values rescaled to 0-255
|
||||||
|
"""
|
||||||
|
image_min, image_max = image.min(), image.max()
|
||||||
|
return 255.0 * (image - image_min) / (image_max - image_min)
|
||||||
|
|
||||||
|
def render_image(self, image):
|
||||||
|
"""
|
||||||
|
Render the given image to the pygame window
|
||||||
|
:param image: a grayscale or color image in an arbitrary size. assumes that the channels are the last axis
|
||||||
|
:return: None
|
||||||
|
"""
|
||||||
|
if self.is_open:
|
||||||
|
if len(image.shape) == 3:
|
||||||
|
if image.shape[0] == 3 or image.shape[0] == 1:
|
||||||
|
image = np.transpose(image, (1, 2, 0))
|
||||||
|
surface = pygame.surfarray.make_surface(image.swapaxes(0, 1))
|
||||||
|
surface = pygame.transform.scale(surface, self.size)
|
||||||
|
self.screen.blit(surface, (0, 0))
|
||||||
|
self.display.flip()
|
||||||
|
self.clock.tick()
|
||||||
|
self.get_events()
|
||||||
|
|
||||||
|
def get_events(self):
|
||||||
|
"""
|
||||||
|
Get all the window events in the last tick and reponse accordingly
|
||||||
|
:return: None
|
||||||
|
"""
|
||||||
|
for event in pygame.event.get():
|
||||||
|
if event.type == pygame.KEYDOWN:
|
||||||
|
self.pressed_keys.append(event.key)
|
||||||
|
# esc pressed
|
||||||
|
if event.key == pygame.K_ESCAPE:
|
||||||
|
self.close()
|
||||||
|
elif event.type == pygame.KEYUP:
|
||||||
|
if event.key in self.pressed_keys:
|
||||||
|
self.pressed_keys.remove(event.key)
|
||||||
|
elif event.type == pygame.QUIT:
|
||||||
|
self.close()
|
||||||
|
|
||||||
|
def get_key_names(self, key_ids):
|
||||||
|
"""
|
||||||
|
Get the key name for each key index in the list
|
||||||
|
:param key_ids: a list of key id's
|
||||||
|
:return: a list of key names corresponding to the key id's
|
||||||
|
"""
|
||||||
|
return [pygame.key.name(key_id) for key_id in key_ids]
|
||||||
|
|
||||||
|
def close(self):
|
||||||
|
"""
|
||||||
|
Close the pygame window
|
||||||
|
:return: None
|
||||||
|
"""
|
||||||
|
self.is_open = False
|
||||||
|
pygame.quit()
|
||||||
@@ -3,6 +3,7 @@ Pillow==4.3.0
|
|||||||
matplotlib==2.0.2
|
matplotlib==2.0.2
|
||||||
numpy==1.13.0
|
numpy==1.13.0
|
||||||
pandas==0.20.2
|
pandas==0.20.2
|
||||||
|
pygame==1.9.3
|
||||||
PyOpenGL==3.1.0
|
PyOpenGL==3.1.0
|
||||||
scipy==0.19.0
|
scipy==0.19.0
|
||||||
scikit-image==0.13.0
|
scikit-image==0.13.0
|
||||||
|
|||||||
164
run_test.py
Normal file
@@ -0,0 +1,164 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2017 Intel Corporation
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
import presets
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from os import path
|
||||||
|
import os
|
||||||
|
import glob
|
||||||
|
import shutil
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from logger import screen
|
||||||
|
from utils import list_all_classes_in_module, threaded_cmd_line_run, killed_processes
|
||||||
|
from subprocess import Popen
|
||||||
|
import signal
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument('-p', '--preset',
|
||||||
|
help="(string) Name of a preset to run (as configured in presets.py)",
|
||||||
|
default=None,
|
||||||
|
type=str)
|
||||||
|
parser.add_argument('-itf', '--ignore_tensorflow',
|
||||||
|
help="(flag) Don't test TensorFlow presets.",
|
||||||
|
action='store_true')
|
||||||
|
parser.add_argument('-in', '--ignore_neon',
|
||||||
|
help="(flag) Don't test neon presets.",
|
||||||
|
action='store_true')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
if args.preset is not None:
|
||||||
|
presets_lists = [args.preset]
|
||||||
|
else:
|
||||||
|
presets_lists = list_all_classes_in_module(presets)
|
||||||
|
win_size = 10
|
||||||
|
fail_count = 0
|
||||||
|
test_count = 0
|
||||||
|
read_csv_tries = 70
|
||||||
|
|
||||||
|
# create a clean experiment directory
|
||||||
|
test_name = '__test'
|
||||||
|
test_path = os.path.join('./experiments', test_name)
|
||||||
|
if path.exists(test_path):
|
||||||
|
shutil.rmtree(test_path)
|
||||||
|
|
||||||
|
for idx, preset_name in enumerate(presets_lists):
|
||||||
|
preset = eval('presets.{}()'.format(preset_name))
|
||||||
|
if preset.test:
|
||||||
|
frameworks = []
|
||||||
|
if preset.agent.tensorflow_support and not args.ignore_tensorflow:
|
||||||
|
frameworks.append('tensorflow')
|
||||||
|
if preset.agent.neon_support and not args.ignore_neon:
|
||||||
|
frameworks.append('neon')
|
||||||
|
|
||||||
|
for framework in frameworks:
|
||||||
|
test_count += 1
|
||||||
|
|
||||||
|
# run the experiment in a separate thread
|
||||||
|
screen.log_title("Running test {} - {}".format(preset_name, framework))
|
||||||
|
cmd = 'CUDA_VISIBLE_DEVICES='' python3 coach.py -p {} -f {} -e {} -n {} -cp "seed=0" &> test_log_{}_{}.txt '\
|
||||||
|
.format(preset_name, framework, test_name, preset.test_num_workers, preset_name, framework)
|
||||||
|
p = Popen(cmd, shell=True, executable="/bin/bash", preexec_fn=os.setsid)
|
||||||
|
|
||||||
|
# get the csv with the results
|
||||||
|
csv_path = None
|
||||||
|
csv_paths = []
|
||||||
|
|
||||||
|
if preset.test_num_workers > 1:
|
||||||
|
# we have an evaluator
|
||||||
|
reward_str = 'Evaluation Reward'
|
||||||
|
filename_pattern = 'evaluator*.csv'
|
||||||
|
else:
|
||||||
|
reward_str = 'Training Reward'
|
||||||
|
filename_pattern = 'worker*.csv'
|
||||||
|
|
||||||
|
initialization_error = False
|
||||||
|
test_passed = False
|
||||||
|
|
||||||
|
tries_counter = 0
|
||||||
|
while not csv_paths:
|
||||||
|
csv_paths = glob.glob(path.join(test_path, '*', filename_pattern))
|
||||||
|
if tries_counter > read_csv_tries:
|
||||||
|
break
|
||||||
|
tries_counter += 1
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
if csv_paths:
|
||||||
|
csv_path = csv_paths[0]
|
||||||
|
|
||||||
|
# verify results
|
||||||
|
csv = None
|
||||||
|
time.sleep(1)
|
||||||
|
averaged_rewards = [0]
|
||||||
|
|
||||||
|
last_num_episodes = 0
|
||||||
|
while csv is None or csv['Episode #'].values[-1] < preset.test_max_step_threshold:
|
||||||
|
try:
|
||||||
|
csv = pd.read_csv(csv_path)
|
||||||
|
except:
|
||||||
|
# sometimes the csv is being written at the same time we are
|
||||||
|
# trying to read it. no problem -> try again
|
||||||
|
continue
|
||||||
|
|
||||||
|
if reward_str not in csv.keys():
|
||||||
|
continue
|
||||||
|
|
||||||
|
rewards = csv[reward_str].values
|
||||||
|
rewards = rewards[~np.isnan(rewards)]
|
||||||
|
|
||||||
|
if len(rewards) >= win_size:
|
||||||
|
averaged_rewards = np.convolve(rewards, np.ones(win_size) / win_size, mode='valid')
|
||||||
|
else:
|
||||||
|
time.sleep(1)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# print progress
|
||||||
|
percentage = int((100*last_num_episodes)/preset.test_max_step_threshold)
|
||||||
|
sys.stdout.write("\rReward: ({}/{})".format(round(averaged_rewards[-1], 1), preset.test_min_return_threshold))
|
||||||
|
sys.stdout.write(' Episode: ({}/{})'.format(last_num_episodes, preset.test_max_step_threshold))
|
||||||
|
sys.stdout.write(' {}%|{}{}| '.format(percentage, '#'*int(percentage/10), ' '*(10-int(percentage/10))))
|
||||||
|
sys.stdout.flush()
|
||||||
|
|
||||||
|
if csv['Episode #'].shape[0] - last_num_episodes <= 0:
|
||||||
|
continue
|
||||||
|
|
||||||
|
last_num_episodes = csv['Episode #'].values[-1]
|
||||||
|
|
||||||
|
# check if reward is enough
|
||||||
|
if np.any(averaged_rewards > preset.test_min_return_threshold):
|
||||||
|
test_passed = True
|
||||||
|
break
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
# kill test and print result
|
||||||
|
os.killpg(os.getpgid(p.pid), signal.SIGTERM)
|
||||||
|
if test_passed:
|
||||||
|
screen.success("Passed successfully")
|
||||||
|
else:
|
||||||
|
screen.error("Failed due to a mismatch with the golden", crash=False)
|
||||||
|
fail_count += 1
|
||||||
|
shutil.rmtree(test_path)
|
||||||
|
|
||||||
|
screen.separator()
|
||||||
|
if fail_count == 0:
|
||||||
|
screen.success(" Summary: " + str(test_count) + "/" + str(test_count) + " tests passed successfully")
|
||||||
|
else:
|
||||||
|
screen.error(" Summary: " + str(test_count - fail_count) + "/" + str(test_count) + " tests passed successfully")
|
||||||
63
utils.py
@@ -20,6 +20,7 @@ import os
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
import threading
|
import threading
|
||||||
from subprocess import call, Popen
|
from subprocess import call, Popen
|
||||||
|
import signal
|
||||||
|
|
||||||
killed_processes = []
|
killed_processes = []
|
||||||
|
|
||||||
@@ -54,9 +55,9 @@ class Enum(object):
|
|||||||
|
|
||||||
|
|
||||||
class RunPhase(Enum):
|
class RunPhase(Enum):
|
||||||
HEATUP = 0
|
HEATUP = "Heatup"
|
||||||
TRAIN = 1
|
TRAIN = "Training"
|
||||||
TEST = 2
|
TEST = "Testing"
|
||||||
|
|
||||||
|
|
||||||
def list_all_classes_in_module(module):
|
def list_all_classes_in_module(module):
|
||||||
@@ -292,3 +293,59 @@ def get_open_port():
|
|||||||
s.close()
|
s.close()
|
||||||
return port
|
return port
|
||||||
|
|
||||||
|
|
||||||
|
class timeout:
|
||||||
|
def __init__(self, seconds=1, error_message='Timeout'):
|
||||||
|
self.seconds = seconds
|
||||||
|
self.error_message = error_message
|
||||||
|
|
||||||
|
def _handle_timeout(self, signum, frame):
|
||||||
|
raise TimeoutError(self.error_message)
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
signal.signal(signal.SIGALRM, self._handle_timeout)
|
||||||
|
signal.alarm(self.seconds)
|
||||||
|
|
||||||
|
def __exit__(self, type, value, traceback):
|
||||||
|
signal.alarm(0)
|
||||||
|
|
||||||
|
|
||||||
|
def switch_axes_order(observation, from_type='channels_first', to_type='channels_last'):
|
||||||
|
"""
|
||||||
|
transpose an observation axes from channels_first to channels_last or vice versa
|
||||||
|
:param observation: a numpy array
|
||||||
|
:param from_type: can be 'channels_first' or 'channels_last'
|
||||||
|
:param to_type: can be 'channels_first' or 'channels_last'
|
||||||
|
:return: a new observation with the requested axes order
|
||||||
|
"""
|
||||||
|
if from_type == to_type or len(observation.shape) == 1:
|
||||||
|
return observation
|
||||||
|
assert 2 <= len(observation.shape) <= 3, 'num axes of an observation must be 2 for a vector or 3 for an image'
|
||||||
|
assert type(observation) == np.ndarray, 'observation must be a numpy array'
|
||||||
|
if len(observation.shape) == 3:
|
||||||
|
if from_type == 'channels_first' and to_type == 'channels_last':
|
||||||
|
return np.transpose(observation, (1, 2, 0))
|
||||||
|
elif from_type == 'channels_last' and to_type == 'channels_first':
|
||||||
|
return np.transpose(observation, (2, 0, 1))
|
||||||
|
else:
|
||||||
|
return np.transpose(observation, (1, 0))
|
||||||
|
|
||||||
|
|
||||||
|
def stack_observation(curr_stack, observation, stack_size):
|
||||||
|
"""
|
||||||
|
Adds a new observation to an existing stack of observations from previous time-steps.
|
||||||
|
:param curr_stack: The current observations stack.
|
||||||
|
:param observation: The new observation
|
||||||
|
:param stack_size: The required stack size
|
||||||
|
:return: The updated observation stack
|
||||||
|
"""
|
||||||
|
|
||||||
|
if curr_stack == []:
|
||||||
|
# starting an episode
|
||||||
|
curr_stack = np.vstack(np.expand_dims([observation] * stack_size, 0))
|
||||||
|
curr_stack = switch_axes_order(curr_stack, from_type='channels_first', to_type='channels_last')
|
||||||
|
else:
|
||||||
|
curr_stack = np.append(curr_stack, np.expand_dims(np.squeeze(observation), axis=-1), axis=-1)
|
||||||
|
curr_stack = np.delete(curr_stack, 0, -1)
|
||||||
|
|
||||||
|
return curr_stack
|
||||||
|
|||||||