pre-release 0.10.0

2026-02-17 23:05:51 +01:00 · 2018-08-13 17:11:34 +03:00
parent d44c329bb8
commit 19ca5c24b1
485 changed files with 33292 additions and 16770 deletions
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1,3 @@
 include *.txt
 include rl_coach/environments/CarlaSettings.ini
 include rl_coach/dashboard_components/spinner.css
--- a/README.md
+++ b/README.md
@@ -1,10 +1,10 @@
 # Coach
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/NervanaSystems/coach/blob/master/LICENSE)
-[![Docs](https://readthedocs.org/projects/pip/badge/?version=latest&style=flat)](http://NervanaSystems.github.io/coach/)
+[![Docs](https://media.readthedocs.org/static/projects/badges/passing-flat.svg)](https://nervanasystems.github.io/coach/)
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1134898.svg)](https://doi.org/10.5281/zenodo.1134898)
-##  Overview
+<p align="center"><img src="img/coach_logo.png" alt="Coach Logo" width="200"/></p>
 Coach is a python reinforcement learning research framework containing implementation of many state-of-the-art algorithms.
@@ -36,7 +36,6 @@ Contacting the Coach development team is also possible through the email [coach@
  * [Usage](#usage)
    + [Running Coach](#running-coach)
    + [Running Coach Dashboard (Visualization)](#running-coach-dashboard-visualization)
    + [Parallelizing an Algorithm](#parallelizing-an-algorithm)
  * [Supported Environments](#supported-environments)
  * [Supported Algorithms](#supported-algorithms)
  * [Citation](#citation)
@@ -44,56 +43,69 @@ Contacting the Coach development team is also possible through the email [coach@
 ## Documentation
-Framework documentation, algorithm description and instructions on how to contribute a new agent/environment can be found [here](http://NervanaSystems.github.io/coach/).
+Framework documentation, algorithm description and instructions on how to contribute a new agent/environment can be found [here](https://nervanasystems.github.io/coach/).
 ## Installation
 Note: Coach has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.
-### Coach Installer
+For some information on installing on Ubuntu 17.10 with Python 3.6.3, please refer to the following issue: https://github.com/NervanaSystems/coach/issues/54
-Coach's installer will setup all the basics needed to get the user going with running Coach on top of [OpenAI Gym](https://github.com/openai/gym) environments.  This can be done by running the following command and then following the on-screen printed instructions:
+In order to install coach, there are a few prerequisites required. This will setup all the basics needed to get the user going with running Coach on top of [OpenAI Gym](https://github.com/openai/gym) environments:
-```bash
+```
-./install.sh
+# General
 sudo -E apt-get install python3-pip cmake zlib1g-dev python3-tk python-opencv -y
 # Boost libraries
 sudo -E apt-get install libboost-all-dev -y
 # Scipy requirements
 sudo -E apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran -y
 # PyGame
 sudo -E apt-get install libsdl-dev libsdl-image1.2-dev libsdl-mixer1.2-dev libsdl-ttf2.0-dev
 libsmpeg-dev libportmidi-dev libavformat-dev libswscale-dev -y
 # Dashboard
 sudo -E apt-get install dpkg-dev build-essential python3.5-dev libjpeg-dev  libtiff-dev libsdl1.2-dev libnotify-dev 
 freeglut3 freeglut3-dev libsm-dev libgtk2.0-dev libgtk-3-dev libwebkitgtk-dev libgtk-3-dev libwebkitgtk-3.0-dev
 libgstreamer-plugins-base1.0-dev -y
 # Gym
 sudo -E apt-get install libav-tools libsdl2-dev swig cmake -y
 ```
-Coach creates a virtual environment and installs in it to avoid changes to the user's system.
+We recommend installing coach in a virtualenv:
-In order to activate and deactivate Coach's virtual environment:
+```
-
+sudo -E pip3 install virtualenv
-```bash
+virtualenv -p python3 coach_env
-source coach_env/bin/activate
+. coach_env/bin/activate
 ```
-```bash
+Finally, install coach using pip:
 deactivate
 ```
 pip3 install rl_coach
 ```
 Or alternatively, for a development environment, install coach from the cloned repository:
 ```
 cd coach
 pip3 install -e .
 ```
 If a GPU is present, Coach's pip package will install tensorflow-gpu, by default. If a GPU is not present, an [Intel-Optimized TensorFlow](https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available), will be installed. 
 In addition to OpenAI Gym, several other environments were tested and are supported. Please follow the instructions in the Supported Environments section below in order to install more environments.
 ### TensorFlow GPU Support
 Coach's installer installs [Intel-Optimized TensorFlow](https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available), which does not support GPU, by default. In order to have Coach running with GPU, a GPU supported TensorFlow version must be installed. This can be done by overriding the TensorFlow version: 
 ```bash
 pip3 install tensorflow-gpu
 ```
 ## Usage
 ### Running Coach
-Coach supports both TensorFlow and neon deep learning frameworks.
+To allow reproducing results in Coach, we defined a mechanism called _preset_. 
-
+There are several available presets under the `presets` directory.
 Switching between TensorFlow and neon backends is possible by using the `-f` flag.
 Using TensorFlow (default): `-f tensorflow`
 Using neon: `-f neon`
 There are several available presets in presets.py.
 To list all the available presets use the `-l` flag.
 To run a preset, use:
@@ -103,39 +115,44 @@ python3 coach.py -r -p <preset_name>
 ```
 For example:
-1. CartPole environment using Policy Gradients:
+* CartPole environment using Policy Gradients (PG):
  ```bash
  python3 coach.py -r -p CartPole_PG
  ```
-
+  
-2. Pendulum using Clipped PPO:
+* Basic level of Doom using Dueling network and Double DQN (DDQN) algorithm:
  ```bash
-  python3 coach.py -r -p Pendulum_ClippedPPO -n 8
+  python3 coach.py -r -p Doom_Basic_Dueling_DDQN
  ```
-3. MountainCar using A3C:
+Some presets apply to a group of environment levels, like the entire Atari or Mujoco suites for example.
 To use these presets, the requeseted level should be defined using the `-lvl` flag.
 For example:
 * Pong using the Nerual Episodic Control (NEC) algorithm:
  ```bash
-  python3 coach.py -r -p MountainCar_A3C -n 8
+  python3 coach.py -r -p Atari_NEC -lvl pong
  ```
-4. Doom basic level using Dueling network and Double DQN algorithm:
+There are several types of agents that can benefit from running them in a distrbitued fashion with multiple workers in parallel. Each worker interacts with its own copy of the environment but updates a shared network, which improves the data collection speed and the stability of the learning process.
 To specify the number of workers to run, use the `-n` flag.
 For example:
 * Breakout using Asynchronous Advantage Actor-Critic (A3C) with 8 workers:
  ```bash
-   python3 coach.py -r -p Doom_Basic_Dueling_DDQN
+  python3 coach.py -r -p Atari_A3C -lvl breakout -n 8
  ```
 5. Doom health gathering level using Mixed Monte Carlo:
  ```bash
  python3 coach.py -r -p Doom_Health_MMC
  ```
 It is easy to create new presets for different levels or environments by following the same pattern as in presets.py
-More usage examples can be found [here](http://NervanaSystems.github.io/coach/usage/index.html).
+More usage examples can be found [here](https://nervanasystems.github.io/coach/usage/index.html).
 ### Running Coach Dashboard (Visualization)
 Training an agent to solve an environment can be tricky, at times. 
@@ -152,36 +169,14 @@ python3 dashboard.py
-<img src="img/dashboard.png" alt="Coach Design" style="width: 800px;"/>
+<img src="img/dashboard.gif" alt="Coach Design" style="width: 800px;"/>
 ### Parallelizing an Algorithm
 Since the introduction of [A3C](https://arxiv.org/abs/1602.01783) in 2016, many algorithms were shown to benefit from running multiple instances in parallel, on many CPU cores. So far, these algorithms include [A3C](https://arxiv.org/abs/1602.01783), [DDPG](https://arxiv.org/pdf/1704.03073.pdf), [PPO](https://arxiv.org/pdf/1707.06347.pdf), and [NAF](https://arxiv.org/pdf/1610.00633.pdf), and this is most probably only the begining. 
 Parallelizing an algorithm using Coach is straight-forward. 
 The following method of NetworkWrapper parallelizes an algorithm seamlessly:
 ```python
 network.train_and_sync_networks(current_states, targets)
 ```
 Once a parallelized run is started, the ```train_and_sync_networks``` API will apply gradients from each local worker's network to the main global network, allowing for parallel training to take place.
 Then, it merely requires running Coach with the ``` -n``` flag and with the number of workers to run with. For instance, the following command  will set 16 workers to work together to train a MuJoCo Hopper:
 ```bash
 python3 coach.py -p Hopper_A3C -n 16
 ```
 ## Supported Environments
 * *OpenAI Gym:*
-    Installed by default by Coach's installer.
+    Installed by default by Coach's installer. The version used by Coach is 0.10.5.
 * *ViZDoom:*
@@ -189,6 +184,7 @@ python3 coach.py -p Hopper_A3C -n 16
    https://github.com/mwydmuch/ViZDoom
    The version currently used by Coach is 1.1.4.
    Additionally, Coach assumes that the environment variable VIZDOOM_ROOT points to the ViZDoom installation directory.
 * *Roboschool:*
@@ -211,7 +207,7 @@ python3 coach.py -p Hopper_A3C -n 16
 * *CARLA:*
-    Download release 0.7 from the CARLA repository -
+    Download release 0.8.4 from the CARLA repository -
    https://github.com/carla-simulator/carla/releases
@@ -219,6 +215,22 @@ python3 coach.py -p Hopper_A3C -n 16
    A simple CARLA settings file (```CarlaSettings.ini```) is supplied with Coach, and is located in the ```environments``` directory.
 * *Starcraft:*
    Follow the instructions described in the PySC2 repository - 
    https://github.com/deepmind/pysc2
    The version used by Coach is 2.0.1
 * *DeepMind Control Suite:*
    Follow the instructions described in the DeepMind Control Suite repository - 
    https://github.com/deepmind/dm_control
    The version used by Coach is 0.0.0
 ## Supported Algorithms
@@ -227,25 +239,47 @@ python3 coach.py -p Hopper_A3C -n 16
-
+### Value Optimization Agents
-* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)  ([code](agents/dqn_agent.py))
+* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)  ([code](rl_coach/agents/dqn_agent.py))
-* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf)  ([code](agents/ddqn_agent.py))
+* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf)  ([code](rl_coach/agents/ddqn_agent.py))
 * [Dueling Q Network](https://arxiv.org/abs/1511.06581)
-* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310)  ([code](agents/mmc_agent.py))
+* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310)  ([code](rl_coach/agents/mmc_agent.py))
-* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860)  ([code](agents/pal_agent.py))
+* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860)  ([code](rl_coach/agents/pal_agent.py))
-* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887)  ([code](agents/categorical_dqn_agent.py))
+* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887)  ([code](rl_coach/agents/categorical_dqn_agent.py))
-* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf)  ([code](agents/qr_dqn_agent.py))
+* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf)  ([code](rl_coach/agents/qr_dqn_agent.py))
-* [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621)  ([code](agents/bootstrapped_dqn_agent.py))
+* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed**  ([code](rl_coach/agents/n_step_q_agent.py))
-* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed**  ([code](agents/n_step_q_agent.py))
+* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988)  ([code](rl_coach/agents/nec_agent.py))
-* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988)  ([code](agents/nec_agent.py))
+* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed**  ([code](rl_coach/agents/naf_agent.py))
-* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed**  ([code](agents/naf_agent.py))
+
-* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed**  ([code](agents/policy_gradients_agent.py))
+### Policy Optimization Agents
-* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed**  ([code](agents/actor_critic_agent.py))
+* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed**  ([code](rl_coach/agents/policy_gradients_agent.py))
-* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed**  ([code](agents/ddpg_agent.py))
+* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed**  ([code](rl_coach/agents/actor_critic_agent.py))
-* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)  ([code](agents/ppo_agent.py))
+* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed**  ([code](rl_coach/agents/ddpg_agent.py))
-* [Clipped Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed**  ([code](agents/clipped_ppo_agent.py))
+* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)  ([code](rl_coach/agents/ppo_agent.py))
-* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed**  ([code](agents/dfp_agent.py))
+* [Clipped Proximal Policy Optimization (CPPO)](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed**  ([code](rl_coach/agents/clipped_ppo_agent.py))
-* Behavioral Cloning (BC)  ([code](agents/bc_agent.py))
+* [Generalized Advantage Estimation (GAE)](https://arxiv.org/abs/1506.02438) ([code](rl_coach/agents/actor_critic_agent.py#L86))
 ### General Agents
 * [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed**  ([code](rl_coach/agents/dfp_agent.py))
 ### Imitation Learning Agents
 * Behavioral Cloning (BC)  ([code](rl_coach/agents/bc_agent.py))
 ### Hierarchical Reinforcement Learning Agents
 * [Hierarchical Actor Critic (HAC)](https://arxiv.org/abs/1712.00948.pdf) ([code](rl_coach/agents/ddpg_hac_agent.py))
 ### Memory Types
 * [Hindsight Experience Replay (HER)](https://arxiv.org/abs/1707.01495.pdf) ([code](rl_coach/memories/episodic/episodic_hindsight_experience_replay.py))
 * [Prioritized Experience Replay (PER)](https://arxiv.org/abs/1511.05952) ([code](rl_coach/memories/non_episodic/prioritized_experience_replay.py))
 ### Exploration Techniques
 * E-Greedy ([code](rl_coach/exploration_policies/e_greedy.py))
 * Boltzmann ([code](rl_coach/exploration_policies/boltzmann.py))
 * Ornstein–Uhlenbeck process ([code](rl_coach/exploration_policies/ou_process.py))
 * Normal Noise ([code](rl_coach/exploration_policies/additive_noise.py))
 * Truncated Normal Noise ([code](rl_coach/exploration_policies/truncated_normal.py))
 * [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621)  ([code](rl_coach/agents/bootstrapped_dqn_agent.py))
 * [UCB Exploration via Q-Ensembles (UCB)](https://arxiv.org/abs/1706.01502) ([code](rl_coach/exploration_policies/ucb.py))
 ## Citation
--- a/agents/init.py
+++ b/agents/init.py
@@ -1,38 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.actor_critic_agent import *
 from agents.agent import *
 from agents.bc_agent import *
 from agents.bootstrapped_dqn_agent import *
 from agents.clipped_ppo_agent import *
 from agents.ddpg_agent import *
 from agents.ddqn_agent import *
 from agents.dfp_agent import *
 from agents.dqn_agent import *
 from agents.categorical_dqn_agent import *
 from agents.human_agent import *
 from agents.imitation_agent import *
 from agents.mmc_agent import *
 from agents.n_step_q_agent import *
 from agents.naf_agent import *
 from agents.nec_agent import *
 from agents.pal_agent import *
 from agents.policy_gradients_agent import *
 from agents.policy_optimization_agent import *
 from agents.ppo_agent import *
 from agents.value_optimization_agent import *
 from agents.qr_dqn_agent import *
--- a/agents/actor_critic_agent.py
+++ b/agents/actor_critic_agent.py
@@ -1,146 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.policy_optimization_agent import *
 from logger import *
 from utils import *
 import scipy.signal
 # Actor Critic - https://arxiv.org/abs/1602.01783
 class ActorCriticAgent(PolicyOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0, create_target_network = False):
        PolicyOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id, create_target_network)
        self.last_gradient_update_step_idx = 0
        self.action_advantages = Signal('Advantages')
        self.state_values = Signal('Values')
        self.unclipped_grads = Signal('Grads (unclipped)')
        self.value_loss = Signal('Value Loss')
        self.policy_loss = Signal('Policy Loss')
        self.signals.append(self.action_advantages)
        self.signals.append(self.state_values)
        self.signals.append(self.unclipped_grads)
        self.signals.append(self.value_loss)
        self.signals.append(self.policy_loss)
    # Discounting function used to calculate discounted returns.
    def discount(self, x, gamma):
        return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]
    def get_general_advantage_estimation_values(self, rewards, values):
        # values contain n+1 elements (t ... t+n+1), rewards contain n elements (t ... t + n)
        bootstrap_extended_rewards = np.array(rewards.tolist() + [values[-1]])
        # Approximation based calculation of GAE (mathematically correct only when Tmax = inf,
        # although in practice works even in much smaller Tmax values, e.g. 20)
        deltas = rewards + self.tp.agent.discount * values[1:] - values[:-1]
        gae = self.discount(deltas, self.tp.agent.discount * self.tp.agent.gae_lambda)
        if self.tp.agent.estimate_value_using_gae:
            discounted_returns = np.expand_dims(gae + values[:-1], -1)
        else:
            discounted_returns = np.expand_dims(np.array(self.discount(bootstrap_extended_rewards,
                                                                       self.tp.agent.discount)), 1)[:-1]
        return gae, discounted_returns
    def learn_from_batch(self, batch):
        # batch contains a list of episodes to learn from
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # get the values for the current states
        result = self.main_network.online_network.predict(current_states)
        current_state_values = result[0]
        self.state_values.add_sample(current_state_values)
        # the targets for the state value estimator
        num_transitions = len(game_overs)
        state_value_head_targets = np.zeros((num_transitions, 1))
        # estimate the advantage function
        action_advantages = np.zeros((num_transitions, 1))
        if self.policy_gradient_rescaler == PolicyGradientRescaler.A_VALUE:
            if game_overs[-1]:
                R = 0
            else:
                R = self.main_network.online_network.predict(last_sample(next_states))[0]
            for i in reversed(range(num_transitions)):
                R = rewards[i] + self.tp.agent.discount * R
                state_value_head_targets[i] = R
                action_advantages[i] = R - current_state_values[i]
        elif self.policy_gradient_rescaler == PolicyGradientRescaler.GAE:
            # get bootstraps
            bootstrapped_value = self.main_network.online_network.predict(last_sample(next_states))[0]
            values = np.append(current_state_values, bootstrapped_value)
            if game_overs[-1]:
                values[-1] = 0
            # get general discounted returns table
            gae_values, state_value_head_targets = self.get_general_advantage_estimation_values(rewards, values)
            action_advantages = np.vstack(gae_values)
        else:
            screen.warning("WARNING: The requested policy gradient rescaler is not available")
        action_advantages = action_advantages.squeeze(axis=-1)
        if not self.env.discrete_controls and len(actions.shape) < 2:
            actions = np.expand_dims(actions, -1)
        # train
        result = self.main_network.online_network.accumulate_gradients({**current_states, 'output_1_0': actions},
                                                                       [state_value_head_targets, action_advantages])
        # logging
        total_loss, losses, unclipped_grads = result[:3]
        self.action_advantages.add_sample(action_advantages)
        self.unclipped_grads.add_sample(unclipped_grads)
        self.value_loss.add_sample(losses[0])
        self.policy_loss.add_sample(losses[1])
        return total_loss
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        # TODO: rename curr_state -> state
        # convert to batch so we can run it through the network
        curr_state = {
            k: np.expand_dims(np.array(curr_state[k]), 0)
            for k in curr_state.keys()
        }
        if self.env.discrete_controls:
            # DISCRETE
            state_value, action_probabilities = self.main_network.online_network.predict(curr_state)
            action_probabilities = action_probabilities.squeeze()
            if phase == RunPhase.TRAIN:
                action = self.exploration_policy.get_action(action_probabilities)
            else:
                action = np.argmax(action_probabilities)
            action_info = {"action_probability": action_probabilities[action], "state_value": state_value}
            self.entropy.add_sample(-np.sum(action_probabilities * np.log(action_probabilities + eps)))
        else:
            # CONTINUOUS
            state_value, action_values_mean, action_values_std = self.main_network.online_network.predict(curr_state)
            action_values_mean = action_values_mean.squeeze()
            action_values_std = action_values_std.squeeze()
            if phase == RunPhase.TRAIN:
                action = np.squeeze(np.random.randn(1, self.action_space_size) * action_values_std + action_values_mean)
            else:
                action = action_values_mean
            action_info = {"action_probability": action, "state_value": state_value}
        return action, action_info
--- a/agents/agent.py
+++ b/agents/agent.py
@@ -1,580 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import scipy.ndimage
 try:
    import matplotlib.pyplot as plt
 except:
    from logger import failed_imports
    failed_imports.append("matplotlib")
 import copy
 from renderer import Renderer
 from configurations import Preset
 from collections import deque
 from utils import LazyStack
 from collections import OrderedDict
 from utils import RunPhase, Signal, is_empty, RunningStat
 from architectures import *
 from exploration_policies import *
 from memories import *
 from memories.memory import *
 from logger import logger, screen
 import random
 import time
 import os
 import itertools
 from architectures.tensorflow_components.shared_variables import SharedRunningStats
 from six.moves import range
 class Agent(object):
    def __init__(self, env, tuning_parameters, replicated_device=None, task_id=0):
        """
        :param env: An environment instance
        :type env: EnvironmentWrapper
        :param tuning_parameters: A Preset class instance with all the running paramaters
        :type tuning_parameters: Preset
        :param replicated_device: A tensorflow device for distributed training (optional)
        :type replicated_device: instancemethod
        :param thread_id: The current thread id
        :param thread_id: int
        """
        screen.log_title("Creating agent {}".format(task_id))
        self.task_id = task_id
        self.sess = tuning_parameters.sess
        self.env = tuning_parameters.env_instance = env
        self.imitation = False
        # i/o dimensions
        if not tuning_parameters.env.desired_observation_width or not tuning_parameters.env.desired_observation_height:
            tuning_parameters.env.desired_observation_width = self.env.width
            tuning_parameters.env.desired_observation_height = self.env.height
        self.action_space_size = tuning_parameters.env.action_space_size = self.env.action_space_size
        self.measurements_size = tuning_parameters.env.measurements_size = self.env.measurements_size
        if tuning_parameters.agent.use_accumulated_reward_as_measurement:
            self.measurements_size = tuning_parameters.env.measurements_size = (self.measurements_size[0] + 1,)
        # modules
        if tuning_parameters.agent.load_memory_from_file_path:
            screen.log_title("Loading replay buffer from pickle. Pickle path: {}"
                             .format(tuning_parameters.agent.load_memory_from_file_path))
            self.memory = read_pickle(tuning_parameters.agent.load_memory_from_file_path)
        else:
            self.memory = eval(tuning_parameters.memory + '(tuning_parameters)')
        # self.architecture = eval(tuning_parameters.architecture)
        self.has_global = replicated_device is not None
        self.replicated_device = replicated_device
        self.worker_device = "/job:worker/task:{}/cpu:0".format(task_id) if replicated_device is not None else "/gpu:0"
        self.exploration_policy = eval(tuning_parameters.exploration.policy + '(tuning_parameters)')
        self.evaluation_exploration_policy = eval(tuning_parameters.exploration.evaluation_policy
                                                  + '(tuning_parameters)')
        self.evaluation_exploration_policy.change_phase(RunPhase.TEST)
        # initialize all internal variables
        self.tp = tuning_parameters
        self.in_heatup = False
        self.total_reward_in_current_episode = 0
        self.total_steps_counter = 0
        self.running_reward = None
        self.training_iteration = 0
        self.current_episode = self.tp.current_episode = 0
        self.curr_state = {}
        self.current_episode_steps_counter = 0
        self.episode_running_info = {}
        self.last_episode_evaluation_ran = 0
        self.running_observations = []
        logger.set_current_time(self.current_episode)
        self.main_network = None
        self.networks = []
        self.last_episode_images = []
        self.renderer = Renderer()
        # signals
        self.signals = []
        self.loss = Signal('Loss')
        self.signals.append(self.loss)
        self.curr_learning_rate = Signal('Learning Rate')
        self.signals.append(self.curr_learning_rate)
        if self.tp.env.normalize_observation and not self.env.is_state_type_image:
            if not self.tp.distributed or not self.tp.agent.share_statistics_between_workers:
                self.running_observation_stats = RunningStat((self.tp.env.desired_observation_width,))
                self.running_reward_stats = RunningStat(())
                if self.tp.checkpoint_restore_dir:
                    checkpoint_path = os.path.join(self.tp.checkpoint_restore_dir, "running_stats.p")
                    self.running_observation_stats = read_pickle(checkpoint_path)
                else:
                    self.running_observation_stats = RunningStat((self.tp.env.desired_observation_width,))
                    self.running_reward_stats = RunningStat(())
            else:
                self.running_observation_stats = SharedRunningStats(self.tp, replicated_device,
                                                                    shape=(self.tp.env.desired_observation_width,),
                                                                    name='observation_stats')
                self.running_reward_stats = SharedRunningStats(self.tp, replicated_device,
                                                               shape=(),
                                                               name='reward_stats')
        # env is already reset at this point. Otherwise we're getting an error where you cannot
        # reset an env which is not done
        self.reset_game(do_not_reset_env=True)
        # use seed
        if self.tp.seed is not None:
            random.seed(self.tp.seed)
            np.random.seed(self.tp.seed)
    def log_to_screen(self, phase):
        # log to screen
        if self.current_episode >= 0:
            if phase == RunPhase.TRAIN:
                exploration = self.exploration_policy.get_control_param()
            else:
                exploration = self.evaluation_exploration_policy.get_control_param()
            screen.log_dict(
                OrderedDict([
                    ("Worker", self.task_id),
                    ("Episode", self.current_episode),
                    ("total reward", self.total_reward_in_current_episode),
                    ("exploration", exploration),
                    ("steps", self.total_steps_counter),
                    ("training iteration", self.training_iteration)
                ]),
                prefix=phase
            )
    def update_log(self, phase=RunPhase.TRAIN):
        """
        Writes logging messages to screen and updates the log file with all the signal values.
        :return: None
        """
        # log all the signals to file
        logger.set_current_time(self.current_episode)
        logger.create_signal_value('Training Iter', self.training_iteration)
        logger.create_signal_value('In Heatup', int(phase == RunPhase.HEATUP))
        logger.create_signal_value('ER #Transitions', self.memory.num_transitions())
        logger.create_signal_value('ER #Episodes', self.memory.length())
        logger.create_signal_value('Episode Length', self.current_episode_steps_counter)
        logger.create_signal_value('Total steps', self.total_steps_counter)
        logger.create_signal_value("Epsilon", self.exploration_policy.get_control_param())
        logger.create_signal_value("Training Reward", self.total_reward_in_current_episode
                                   if phase == RunPhase.TRAIN else np.nan)
        logger.create_signal_value('Evaluation Reward', self.total_reward_in_current_episode
                                   if phase == RunPhase.TEST else np.nan)
        logger.create_signal_value('Update Target Network', 0, overwrite=False)
        logger.update_wall_clock_time(self.current_episode)
        for signal in self.signals:
            logger.create_signal_value("{}/Mean".format(signal.name), signal.get_mean())
            logger.create_signal_value("{}/Stdev".format(signal.name), signal.get_stdev())
            logger.create_signal_value("{}/Max".format(signal.name), signal.get_max())
            logger.create_signal_value("{}/Min".format(signal.name), signal.get_min())
        # dump
        if self.current_episode % self.tp.visualization.dump_signals_to_csv_every_x_episodes == 0 \
                and self.current_episode > 0:
            logger.dump_output_csv()
    def reset_game(self, do_not_reset_env=False):
        """
        Resets all the episodic parameters and start a new environment episode.
        :param do_not_reset_env: A boolean that allows prevention of environment reset
        :return: None
        """
        for signal in self.signals:
            signal.reset()
        self.total_reward_in_current_episode = 0
        self.curr_state = {}
        self.last_episode_images = []
        self.current_episode_steps_counter = 0
        self.episode_running_info = {}
        if not do_not_reset_env:
            self.env.reset()
        self.exploration_policy.reset()
        # required for online plotting
        if self.tp.visualization.plot_action_values_online:
            if hasattr(self, 'episode_running_info') and hasattr(self.env, 'actions_description'):
                for action in self.env.actions_description:
                    self.episode_running_info[action] = []
            plt.clf()
        if self.tp.agent.middleware_type == MiddlewareTypes.LSTM:
            for network in self.networks:
                network.online_network.curr_rnn_c_in = network.online_network.middleware_embedder.c_init
                network.online_network.curr_rnn_h_in = network.online_network.middleware_embedder.h_init
        self.prepare_initial_state()
    def preprocess_observation(self, observation):
        """
        Preprocesses the given observation.
        For images - convert to grayscale, resize and convert to int.
        For measurements vectors - normalize by a running average and std.
        :param observation: The agents observation
        :return: A processed version of the observation
        """
        if self.env.is_state_type_image:
            # rescale
            observation = scipy.misc.imresize(observation,
                                              (self.tp.env.desired_observation_height,
                                               self.tp.env.desired_observation_width),
                                              interp=self.tp.rescaling_interpolation_type)
            # rgb to y
            if len(observation.shape) > 2 and observation.shape[2] > 1:
                r, g, b = observation[:, :, 0], observation[:, :, 1], observation[:, :, 2]
                observation = 0.2989 * r + 0.5870 * g + 0.1140 * b
            # Render the processed observation which is how the agent will see it
            # Warning: this cannot currently be done in parallel to rendering the environment
            if self.tp.visualization.render_observation:
                if not self.renderer.is_open:
                    self.renderer.create_screen(observation.shape[0], observation.shape[1])
                self.renderer.render_image(observation)
            return observation.astype('uint8')
        else:
            if self.tp.env.normalize_observation and self.sess is not None:
                # standardize the input observation using a running mean and std
                if not self.tp.distributed or not self.tp.agent.share_statistics_between_workers:
                    self.running_observation_stats.push(observation)
                observation = (observation - self.running_observation_stats.mean) / \
                              (self.running_observation_stats.std + 1e-15)
                observation = np.clip(observation, -5.0, 5.0)
            return observation
    def learn_from_batch(self, batch):
        """
        Given a batch of transitions, calculates their target values and updates the network.
        :param batch: A list of transitions
        :return: The loss of the training
        """
        pass
    def train(self):
        """
        A single training iteration. Sample a batch, train on it and update target networks.
        :return: The training loss.
        """
        batch = self.memory.sample(self.tp.batch_size)
        loss = self.learn_from_batch(batch)
        if self.tp.learning_rate_decay_rate != 0:
            self.curr_learning_rate.add_sample(self.tp.sess.run(self.tp.learning_rate))
        else:
            self.curr_learning_rate.add_sample(self.tp.learning_rate)
        # update the target network of every network that has a target network
        if self.total_steps_counter % self.tp.agent.num_steps_between_copying_online_weights_to_target == 0:
            for network in self.networks:
                network.update_target_network(self.tp.agent.rate_for_copying_weights_to_target)
            logger.create_signal_value('Update Target Network', 1)
        else:
            logger.create_signal_value('Update Target Network', 0, overwrite=False)
        return loss
    def extract_batch(self, batch):
        """
        Extracts a single numpy array for each object in a batch of transitions (state, action, etc.)
        :param batch: An array of transitions
        :return: For each transition element, returns a numpy array of all the transitions in the batch
        """
        current_states = {}
        next_states = {}
        current_states['observation'] = np.array([np.array(transition.state['observation']) for transition in batch])
        next_states['observation'] = np.array([np.array(transition.next_state['observation']) for transition in batch])
        actions = np.array([transition.action for transition in batch])
        rewards = np.array([transition.reward for transition in batch])
        game_overs = np.array([transition.game_over for transition in batch])
        total_return = np.array([transition.total_return for transition in batch])
        # get the entire state including measurements if available
        if self.tp.agent.use_measurements:
            current_states['measurements'] = np.array([transition.state['measurements'] for transition in batch])
            next_states['measurements'] = np.array([transition.next_state['measurements'] for transition in batch])
        return current_states, next_states, actions, rewards, game_overs, total_return
    def plot_action_values_online(self):
        """
        Plot an animated graph of the value of each possible action during the episode
        :return: None
        """
        plt.clf()
        for key, data_list in self.episode_running_info.items():
            plt.plot(data_list, label=key)
        plt.legend()
        plt.pause(0.00000001)
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        """
        choose an action to act with in the current episode being played. Different behavior might be exhibited when training
         or testing.
        :param curr_state: the current state to act upon.
        :param phase: the current phase: training or testing.
        :return: chosen action, some action value describing the action (q-value, probability, etc)
        """
        pass
    def preprocess_reward(self, reward):
        if self.tp.env.reward_scaling:
            reward /= float(self.tp.env.reward_scaling)
        if self.tp.env.reward_clipping_max:
            reward = min(reward, self.tp.env.reward_clipping_max)
        if self.tp.env.reward_clipping_min:
            reward = max(reward, self.tp.env.reward_clipping_min)
        return reward
    def tf_input_state(self, curr_state):
        """
        convert curr_state into input tensors tensorflow is expecting.
        """
        # add batch axis with length 1 onto each value
        # extract values from the state based on agent.input_types
        input_state = {}
        for input_name in self.tp.agent.input_types.keys():
            input_state[input_name] = np.expand_dims(np.array(curr_state[input_name]), 0)
        return input_state
    def prepare_initial_state(self):
        """
        Create an initial state when starting a new episode
        :return: None
        """
        observation = self.preprocess_observation(self.env.state['observation'])
        self.curr_stack = deque([observation]*self.tp.env.observation_stack_size, maxlen=self.tp.env.observation_stack_size)
        observation = LazyStack(self.curr_stack, -1)
        self.curr_state = {
            'observation': observation
        }
        if self.tp.agent.use_measurements:
            if 'measurements' in self.env.state.keys():
                self.curr_state['measurements'] = self.env.state['measurements']
            else:
                self.curr_state['measurements'] = np.zeros(0)
            if self.tp.agent.use_accumulated_reward_as_measurement:
                self.curr_state['measurements'] = np.append(self.curr_state['measurements'], 0)
    def act(self, phase=RunPhase.TRAIN):
        """
        Take one step in the environment according to the network prediction and store the transition in memory
        :param phase: Either Train or Test to specify if greedy actions should be used and if transitions should be stored
        :return: A boolean value that signals an episode termination
        """
        if phase != RunPhase.TEST:
            self.total_steps_counter += 1
        self.current_episode_steps_counter += 1
        # get new action
        action_info = {"action_probability": 1.0 / self.env.action_space_size, "action_value": 0, "max_action_value": 0}
        if phase == RunPhase.HEATUP and not self.tp.heatup_using_network_decisions:
            action = self.env.get_random_action()
        else:
            action, action_info = self.choose_action(self.curr_state, phase=phase)
        # perform action
        if type(action) == np.ndarray:
            action = action.squeeze()
        result = self.env.step(action)
        shaped_reward = self.preprocess_reward(result['reward'])
        if 'action_intrinsic_reward' in action_info.keys():
            shaped_reward += action_info['action_intrinsic_reward']
        # TODO: should total_reward_in_current_episode include shaped_reward?
        self.total_reward_in_current_episode += result['reward']
        next_state = copy.copy(result['state'])
        next_state['observation'] = self.preprocess_observation(next_state['observation'])
        # plot action values online
        if self.tp.visualization.plot_action_values_online and phase != RunPhase.HEATUP:
            self.plot_action_values_online()
        # initialize the next state
        # TODO: provide option to stack more than just the observation
        self.curr_stack.append(next_state['observation'])
        observation = LazyStack(self.curr_stack, -1)
        next_state['observation'] = observation
        if self.tp.agent.use_measurements:
            if 'measurements' in result['state'].keys():
                next_state['measurements'] = result['state']['measurements']
            else:
                next_state['measurements'] = np.zeros(0)
            if self.tp.agent.use_accumulated_reward_as_measurement:
                next_state['measurements'] = np.append(next_state['measurements'], self.total_reward_in_current_episode)
        # store the transition only if we are training
        if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
            transition = Transition(self.curr_state, result['action'], shaped_reward, next_state, result['done'])
            for key in action_info.keys():
                transition.info[key] = action_info[key]
            if self.tp.agent.add_a_normalized_timestep_to_the_observation:
                transition.info['timestep'] = float(self.current_episode_steps_counter) / self.env.timestep_limit
            self.memory.store(transition)
        elif phase == RunPhase.TEST and self.tp.visualization.dump_gifs:
            # we store the transitions only for saving gifs
            self.last_episode_images.append(self.env.get_rendered_image())
        # update the current state for the next step
        self.curr_state = next_state
        # deal with episode termination
        if result['done']:
            if self.tp.visualization.dump_csv:
                self.update_log(phase=phase)
            self.log_to_screen(phase=phase)
            if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
                self.reset_game()
            self.current_episode += 1
            self.tp.current_episode = self.current_episode
        # return episode really ended
        return result['done']
    def evaluate(self, num_episodes, keep_networks_synced=False):
        """
        Run in an evaluation mode for several episodes. Actions will be chosen greedily.
        :param keep_networks_synced: keep the online network in sync with the global network after every episode
        :param num_episodes: The number of episodes to evaluate on
        :return: None
        """
        max_reward_achieved = -float('inf')
        average_evaluation_reward = 0
        screen.log_title("Running evaluation")
        self.env.change_phase(RunPhase.TEST)
        for i in range(num_episodes):
            # keep the online network in sync with the global network
            if keep_networks_synced:
                for network in self.networks:
                    network.sync()
            episode_ended = False
            while not episode_ended:
                episode_ended = self.act(phase=RunPhase.TEST)
                if keep_networks_synced \
                   and self.total_steps_counter % self.tp.agent.update_evaluation_agent_network_after_every_num_steps:
                    for network in self.networks:
                        network.sync()
            if self.total_reward_in_current_episode > max_reward_achieved:
                max_reward_achieved = self.total_reward_in_current_episode
                frame_skipping = int(5/self.tp.env.frame_skip)
                if self.tp.visualization.dump_gifs:
                    logger.create_gif(self.last_episode_images[::frame_skipping],
                                      name='score-{}'.format(max_reward_achieved), fps=10)
            average_evaluation_reward += self.total_reward_in_current_episode
            self.reset_game()
        average_evaluation_reward /= float(num_episodes)
        self.env.change_phase(RunPhase.TRAIN)
        screen.log_title("Evaluation done. Average reward = {}.".format(average_evaluation_reward))
    def post_training_commands(self):
        pass
    def improve(self):
        """
        Training algorithms wrapper. Heatup >> [ Evaluate >> Play >> Train >> Save checkpoint ]
        :return: None
        """
        # synchronize the online network weights with the global network
        for network in self.networks:
            network.sync()
        # heatup phase
        if self.tp.num_heatup_steps != 0:
            self.in_heatup = True
            screen.log_title("Starting heatup {}".format(self.task_id))
            num_steps_required_for_one_training_batch = self.tp.batch_size * self.tp.env.observation_stack_size
            for step in range(max(self.tp.num_heatup_steps, num_steps_required_for_one_training_batch)):
                self.act(phase=RunPhase.HEATUP)
        # training phase
        self.in_heatup = False
        screen.log_title("Starting training {}".format(self.task_id))
        self.exploration_policy.change_phase(RunPhase.TRAIN)
        training_start_time = time.time()
        model_snapshots_periods_passed = -1
        self.reset_game()
        while self.training_iteration < self.tp.num_training_iterations:
            # evaluate
            evaluate_agent = (self.last_episode_evaluation_ran is not self.current_episode) and \
                             (self.current_episode % self.tp.evaluate_every_x_episodes == 0)
            evaluate_agent = evaluate_agent or \
                             (self.imitation and self.training_iteration > 0 and
                              self.training_iteration % self.tp.evaluate_every_x_training_iterations == 0)
            if evaluate_agent:
                self.env.reset(force_environment_reset=True)
                self.last_episode_evaluation_ran = self.current_episode
                self.evaluate(self.tp.evaluation_episodes)
            # snapshot model
            if self.tp.save_model_sec and self.tp.save_model_sec > 0 and not self.tp.distributed:
                total_training_time = time.time() - training_start_time
                current_snapshot_period = (int(total_training_time) // self.tp.save_model_sec)
                if current_snapshot_period > model_snapshots_periods_passed:
                    model_snapshots_periods_passed = current_snapshot_period
                    self.save_model(model_snapshots_periods_passed)
                    if hasattr(self, 'running_observation_state') and self.running_observation_stats is not None:
                        to_pickle(self.running_observation_stats,
                                  os.path.join(self.tp.save_model_dir,
                                               "running_stats.p".format(model_snapshots_periods_passed)))
            # play and record in replay buffer
            if self.tp.agent.collect_new_data:
                if self.tp.agent.step_until_collecting_full_episodes:
                    step = 0
                    while step < self.tp.agent.num_consecutive_playing_steps or self.memory.get_episode(-1).length() != 0:
                        self.act()
                        step += 1
                else:
                    for step in range(self.tp.agent.num_consecutive_playing_steps):
                        self.act()
            # train
            if self.tp.train:
                for step in range(self.tp.agent.num_consecutive_training_steps):
                    loss = self.train()
                    self.loss.add_sample(loss)
                    self.training_iteration += 1
                    if self.imitation:
                        self.log_to_screen(RunPhase.TRAIN)
                self.post_training_commands()
    def save_model(self, model_id):
        self.main_network.save_model(model_id)
--- a/agents/bc_agent.py
+++ b/agents/bc_agent.py
@@ -1,39 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import numpy as np
 from agents.imitation_agent import ImitationAgent
 # Behavioral Cloning Agent
 class BCAgent(ImitationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ImitationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
    def learn_from_batch(self, batch):
        current_states, _, actions, _, _, _ = self.extract_batch(batch)
        # the targets for the network are the actions since this is supervised learning
        if self.env.discrete_controls:
            targets = np.eye(self.env.action_space_size)[[actions]]
        else:
            targets = actions
        result = self.main_network.train_and_sync_networks(current_states, targets)
        total_loss = result[0]
        return total_loss
--- a/agents/bootstrapped_dqn_agent.py
+++ b/agents/bootstrapped_dqn_agent.py
@@ -1,58 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 # Bootstrapped DQN - https://arxiv.org/pdf/1602.04621.pdf
 class BootstrappedDQNAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
    def reset_game(self, do_not_reset_env=False):
        ValueOptimizationAgent.reset_game(self, do_not_reset_env)
        self.exploration_policy.select_head()
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # for the action we actually took, the error is:
        # TD error = r + discount*max(q_st_plus_1) - q_st
        # for all other actions, the error is 0
        q_st_plus_1 = self.main_network.target_network.predict(next_states)
        # initialize with the current prediction so that we will
        TD_targets = self.main_network.online_network.predict(current_states)
        #  only update the action that we have actually done in this transition
        for i in range(self.tp.batch_size):
            mask = batch[i].info['mask']
            for head_idx in range(self.tp.exploration.architecture_num_q_heads):
                if mask[head_idx] == 1:
                    TD_targets[head_idx][i, actions[i]] = rewards[i] + \
                                                          (1.0 - game_overs[i]) * self.tp.agent.discount * np.max(
                                                              q_st_plus_1[head_idx][i], 0)
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
    def act(self, phase=RunPhase.TRAIN):
        ValueOptimizationAgent.act(self, phase)
        mask = np.random.binomial(1, self.tp.exploration.bootstrapped_data_sharing_probability,
                                  self.tp.exploration.architecture_num_q_heads)
        self.memory.update_last_transition_info({'mask': mask})
--- a/agents/categorical_dqn_agent.py
+++ b/agents/categorical_dqn_agent.py
@@ -1,60 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 # Categorical Deep Q Network - https://arxiv.org/pdf/1707.06887.pdf
 class CategoricalDQNAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.z_values = np.linspace(self.tp.agent.v_min, self.tp.agent.v_max, self.tp.agent.atoms)
    # prediction's format is (batch,actions,atoms)
    def get_q_values(self, prediction):
        return np.dot(prediction, self.z_values)
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # for the action we actually took, the error is calculated by the atoms distribution
        # for all other actions, the error is 0
        distributed_q_st_plus_1 = self.main_network.target_network.predict(next_states)
        # initialize with the current prediction so that we will
        TD_targets = self.main_network.online_network.predict(current_states)
        # only update the action that we have actually done in this transition
        target_actions = np.argmax(self.get_q_values(distributed_q_st_plus_1), axis=1)
        m = np.zeros((self.tp.batch_size, self.z_values.size))
        batches = np.arange(self.tp.batch_size)
        for j in range(self.z_values.size):
            tzj = np.fmax(np.fmin(rewards + (1.0 - game_overs) * self.tp.agent.discount * self.z_values[j],
                                self.z_values[self.z_values.size - 1]),
                                self.z_values[0])
            bj = (tzj - self.z_values[0])/(self.z_values[1] - self.z_values[0])
            u = (np.ceil(bj)).astype(int)
            l = (np.floor(bj)).astype(int)
            m[batches, l] = m[batches, l] + (distributed_q_st_plus_1[batches, target_actions, j] * (u - bj))
            m[batches, u] = m[batches, u] + (distributed_q_st_plus_1[batches, target_actions, j] * (bj - l))
        # total_loss = cross entropy between actual result above and predicted result for the given action
        TD_targets[batches, actions] = m
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
--- a/agents/clipped_ppo_agent.py
+++ b/agents/clipped_ppo_agent.py
@@ -1,212 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.actor_critic_agent import *
 from random import shuffle
 # Clipped Proximal Policy Optimization - https://arxiv.org/abs/1707.06347
 class ClippedPPOAgent(ActorCriticAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ActorCriticAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id,
                                  create_target_network=True)
        # signals definition
        self.value_loss = Signal('Value Loss')
        self.signals.append(self.value_loss)
        self.policy_loss = Signal('Policy Loss')
        self.signals.append(self.policy_loss)
        self.total_kl_divergence_during_training_process = 0.0
        self.unclipped_grads = Signal('Grads (unclipped)')
        self.signals.append(self.unclipped_grads)
        self.value_targets = Signal('Value Targets')
        self.signals.append(self.value_targets)
        self.kl_divergence = Signal('KL Divergence')
        self.signals.append(self.kl_divergence)
    def fill_advantages(self, batch):
        current_states, next_states, actions, rewards, game_overs, total_return = self.extract_batch(batch)
        current_state_values = self.main_network.online_network.predict(current_states)[0]
        current_state_values = current_state_values.squeeze()
        self.state_values.add_sample(current_state_values)
        # calculate advantages
        advantages = []
        value_targets = []
        if self.policy_gradient_rescaler == PolicyGradientRescaler.A_VALUE:
            advantages = total_return - current_state_values
        elif self.policy_gradient_rescaler == PolicyGradientRescaler.GAE:
            # get bootstraps
            episode_start_idx = 0
            advantages = np.array([])
            value_targets = np.array([])
            for idx, game_over in enumerate(game_overs):
                if game_over:
                    # get advantages for the rollout
                    value_bootstrapping = np.zeros((1,))
                    rollout_state_values = np.append(current_state_values[episode_start_idx:idx+1], value_bootstrapping)
                    rollout_advantages, gae_based_value_targets = \
                        self.get_general_advantage_estimation_values(rewards[episode_start_idx:idx+1],
                                                                     rollout_state_values)
                    episode_start_idx = idx + 1
                    advantages = np.append(advantages, rollout_advantages)
                    value_targets = np.append(value_targets, gae_based_value_targets)
        else:
            screen.warning("WARNING: The requested policy gradient rescaler is not available")
        # standardize
        advantages = (advantages - np.mean(advantages)) / (np.std(advantages) + 1e-8)
        for transition, advantage, value_target in zip(batch, advantages, value_targets):
            transition.info['advantage'] = advantage
            transition.info['gae_based_value_target'] = value_target
        self.action_advantages.add_sample(advantages)
    def train_network(self, dataset, epochs):
        loss = []
        for j in range(epochs):
            loss = {
                'total_loss': [],
                'policy_losses': [],
                'unclipped_grads': [],
                'fetch_result': []
            }
            shuffle(dataset)
            for i in range(int(len(dataset) / self.tp.batch_size)):
                batch = dataset[i * self.tp.batch_size:(i + 1) * self.tp.batch_size]
                current_states, _, actions, _, _, total_return = self.extract_batch(batch)
                advantages = np.array([t.info['advantage'] for t in batch])
                gae_based_value_targets = np.array([t.info['gae_based_value_target'] for t in batch])
                if not self.tp.env_instance.discrete_controls and len(actions.shape) == 1:
                    actions = np.expand_dims(actions, -1)
                # get old policy probabilities and distribution
                result = self.main_network.target_network.predict(current_states)
                old_policy_distribution = result[1:]
                # calculate gradients and apply on both the local policy network and on the global policy network
                fetches = [self.main_network.online_network.output_heads[1].kl_divergence,
                           self.main_network.online_network.output_heads[1].entropy]
                total_return = np.expand_dims(total_return, -1)
                value_targets = gae_based_value_targets if self.tp.agent.estimate_value_using_gae else total_return
                inputs = copy.copy(current_states)
                # TODO: why is this output 0 and not output 1?
                inputs['output_0_0'] = actions
                # TODO: does old_policy_distribution really need to be represented as a list?
                # A: yes it does, in the event of discrete controls, it has just a mean
                # otherwise, it has both a mean and standard deviation
                for input_index, input in enumerate(old_policy_distribution):
                    inputs['output_0_{}'.format(input_index + 1)] = input
                total_loss, policy_losses, unclipped_grads, fetch_result =\
                    self.main_network.online_network.accumulate_gradients(
                        inputs, [total_return, advantages], additional_fetches=fetches)
                self.value_targets.add_sample(value_targets)
                if self.tp.distributed:
                    self.main_network.apply_gradients_to_global_network()
                    self.main_network.update_online_network()
                else:
                    self.main_network.apply_gradients_to_online_network()
                self.main_network.online_network.reset_accumulated_gradients()
                loss['total_loss'].append(total_loss)
                loss['policy_losses'].append(policy_losses)
                loss['unclipped_grads'].append(unclipped_grads)
                loss['fetch_result'].append(fetch_result)
                self.unclipped_grads.add_sample(unclipped_grads)
            for key in loss.keys():
                loss[key] = np.mean(loss[key], 0)
            if self.tp.learning_rate_decay_rate != 0:
                curr_learning_rate = self.main_network.online_network.get_variable_value(self.tp.learning_rate)
                self.curr_learning_rate.add_sample(curr_learning_rate)
            else:
                curr_learning_rate = self.tp.learning_rate
            # log training parameters
            screen.log_dict(
                OrderedDict([
                    ("Surrogate loss", loss['policy_losses'][0]),
                    ("KL divergence", loss['fetch_result'][0]),
                    ("Entropy", loss['fetch_result'][1]),
                    ("training epoch", j),
                    ("learning_rate", curr_learning_rate)
                ]),
                prefix="Policy training"
            )
        self.total_kl_divergence_during_training_process = loss['fetch_result'][0]
        self.entropy.add_sample(loss['fetch_result'][1])
        self.kl_divergence.add_sample(loss['fetch_result'][0])
        return policy_losses
    def post_training_commands(self):
        # clean memory
        self.memory.clean()
    def train(self):
        self.main_network.sync()
        dataset = self.memory.transitions
        self.fill_advantages(dataset)
        # take only the requested number of steps
        dataset = dataset[:self.tp.agent.num_consecutive_playing_steps]
        if self.tp.distributed and self.tp.agent.share_statistics_between_workers:
            self.running_observation_stats.push(np.array([np.array(t.state['observation']) for t in dataset]))
        losses = self.train_network(dataset, 10)
        self.value_loss.add_sample(losses[0])
        self.policy_loss.add_sample(losses[1])
        self.update_log()  # should be done in order to update the data that has been accumulated * while not playing *
        return np.append(losses[0], losses[1])
    def choose_action(self, current_state, phase=RunPhase.TRAIN):
        if self.env.discrete_controls:
            # DISCRETE
            _, action_values = self.main_network.online_network.predict(self.tf_input_state(current_state))
            action_values = action_values.squeeze()
            if phase == RunPhase.TRAIN:
                action = self.exploration_policy.get_action(action_values)
            else:
                action = np.argmax(action_values)
            action_info = {"action_probability": action_values[action]}
            # self.entropy.add_sample(-np.sum(action_values * np.log(action_values)))
        else:
            # CONTINUOUS
            _, action_values_mean, action_values_std = self.main_network.online_network.predict(self.tf_input_state(current_state))
            action_values_mean = action_values_mean.squeeze()
            action_values_std = action_values_std.squeeze()
            if phase == RunPhase.TRAIN:
                action = np.squeeze(np.random.randn(1, self.action_space_size) * action_values_std + action_values_mean)
                # if self.current_episode % 5 == 0 and self.current_episode_steps_counter < 5:
                #     print action
            else:
                action = action_values_mean
            action_info = {"action_probability": action_values_mean}
        return action, action_info
--- a/agents/ddpg_agent.py
+++ b/agents/ddpg_agent.py
@@ -1,109 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.actor_critic_agent import *
 from configurations import *
 # Deep Deterministic Policy Gradients Network - https://arxiv.org/pdf/1509.02971.pdf
 class DDPGAgent(ActorCriticAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ActorCriticAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id,
                                  create_target_network=True)
        # define critic network
        self.critic_network = self.main_network
        # self.networks.append(self.critic_network)
        # define actor network
        tuning_parameters.agent.input_types = {'observation': InputTypes.Observation}
        tuning_parameters.agent.output_types = [OutputTypes.Pi]
        self.actor_network = NetworkWrapper(tuning_parameters, True, self.has_global, 'actor',
                                            self.replicated_device, self.worker_device)
        self.networks.append(self.actor_network)
        self.q_values = Signal("Q")
        self.signals.append(self.q_values)
        self.reset_game(do_not_reset_env=True)
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # TD error = r + discount*max(q_st_plus_1) - q_st
        next_actions = self.actor_network.target_network.predict(next_states)
        inputs = copy.copy(next_states)
        inputs['action'] = next_actions
        q_st_plus_1 = self.critic_network.target_network.predict(inputs)
        TD_targets = np.expand_dims(rewards, -1) + \
                     (1.0 - np.expand_dims(game_overs, -1)) * self.tp.agent.discount * q_st_plus_1
        # get the gradients of the critic output with respect to the action
        actions_mean = self.actor_network.online_network.predict(current_states)
        critic_online_network = self.critic_network.online_network
        # TODO: convert into call to predict, current method ignores lstm middleware for example
        action_gradients = self.critic_network.sess.run(critic_online_network.gradients_wrt_inputs['action'],
                                                        feed_dict=critic_online_network._feed_dict({
                                                            **current_states,
                                                            'action': actions_mean,
                                                        }))[0]
        # train the critic
        if len(actions.shape) == 1:
            actions = np.expand_dims(actions, -1)
        result = self.critic_network.train_and_sync_networks({**current_states, 'action': actions}, TD_targets)
        total_loss = result[0]
        # apply the gradients from the critic to the actor
        actor_online_network = self.actor_network.online_network
        gradients = self.actor_network.sess.run(actor_online_network.weighted_gradients,
                                                feed_dict=actor_online_network._feed_dict({
                                                    **current_states,
                                                    actor_online_network.gradients_weights_ph: -action_gradients,
                                                }))
        if self.actor_network.has_global:
            self.actor_network.global_network.apply_gradients(gradients)
            self.actor_network.update_online_network()
        else:
            self.actor_network.online_network.apply_gradients(gradients)
        return total_loss
    def train(self):
        return Agent.train(self)
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        assert not self.env.discrete_controls, 'DDPG works only for continuous control problems'
        result = self.actor_network.online_network.predict(self.tf_input_state(curr_state))
        action_values = result[0].squeeze()
        if phase == RunPhase.TRAIN:
            action = self.exploration_policy.get_action(action_values)
        else:
            action = action_values
        action = np.clip(action, self.env.action_space_low, self.env.action_space_high)
        # get q value
        action_batch = np.expand_dims(action, 0)
        if type(action) != np.ndarray:
            action_batch = np.array([[action]])
        inputs = self.tf_input_state(curr_state)
        inputs['action'] = action_batch
        q_value = self.critic_network.online_network.predict(inputs)[0]
        self.q_values.add_sample(q_value)
        action_info = {"action_value": q_value}
        return action, action_info
--- a/agents/ddqn_agent.py
+++ b/agents/ddqn_agent.py
@@ -1,42 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 # Double DQN - https://arxiv.org/abs/1509.06461
 class DDQNAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        selected_actions = np.argmax(self.main_network.online_network.predict(next_states), 1)
        q_st_plus_1 = self.main_network.target_network.predict(next_states)
        TD_targets = self.main_network.online_network.predict(current_states)
        # initialize with the current prediction so that we will
        #  only update the action that we have actually done in this transition
        for i in range(self.tp.batch_size):
            TD_targets[i, actions[i]] = rewards[i] \
                                        + (1.0 - game_overs[i]) * self.tp.agent.discount * q_st_plus_1[i][
                selected_actions[i]]
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
--- a/agents/dfp_agent.py
+++ b/agents/dfp_agent.py
@@ -1,86 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.agent import *
 # Direct Future Prediction Agent - http://vladlen.info/papers/learning-to-act.pdf
 class DFPAgent(Agent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        Agent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.current_goal = self.tp.agent.goal_vector
        self.main_network = NetworkWrapper(tuning_parameters, False, self.has_global, 'main',
                                           self.replicated_device, self.worker_device)
        self.networks.append(self.main_network)
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, total_returns = self.extract_batch(batch)
        # create the inputs for the network
        input = current_states
        input['goal'] = np.repeat(np.expand_dims(self.current_goal, 0), self.tp.batch_size, 0)
        # get the current outputs of the network
        targets = self.main_network.online_network.predict(input)
        # change the targets for the taken actions
        for i in range(self.tp.batch_size):
            targets[i, actions[i]] = batch[i].info['future_measurements'].flatten()
        result = self.main_network.train_and_sync_networks(input, targets)
        total_loss = result[0]
        return total_loss
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        # convert to batch so we can run it through the network
        observation = np.expand_dims(np.array(curr_state['observation']), 0)
        measurements = np.expand_dims(np.array(curr_state['measurements']), 0)
        goal = np.expand_dims(self.current_goal, 0)
        # predict the future measurements
        measurements_future_prediction = self.main_network.online_network.predict({
            "observation": observation,
            "measurements": measurements,
            "goal": goal})[0]
        action_values = np.zeros((self.action_space_size,))
        num_steps_used_for_objective = len(self.tp.agent.future_measurements_weights)
        # calculate the score of each action by multiplying it's future measurements with the goal vector
        for action_idx in range(self.action_space_size):
            action_measurements = measurements_future_prediction[action_idx]
            action_measurements = np.reshape(action_measurements,
                                             (self.tp.agent.num_predicted_steps_ahead, self.measurements_size[0]))
            future_steps_values = np.dot(action_measurements, self.current_goal)
            action_values[action_idx] = np.dot(future_steps_values[-num_steps_used_for_objective:],
                                               self.tp.agent.future_measurements_weights)
        # choose action according to the exploration policy and the current phase (evaluating or training the agent)
        if phase == RunPhase.TRAIN:
            action = self.exploration_policy.get_action(action_values)
        else:
            action = np.argmax(action_values)
        action_values = action_values.squeeze()
        # store information for plotting interactively (actual plotting is done in agent)
        if self.tp.visualization.plot_action_values_online:
            for idx, action_name in enumerate(self.env.actions_description):
                self.episode_running_info[action_name].append(action_values[idx])
        action_info = {"action_probability": 0, "action_value": action_values[action]}
        return action, action_info
--- a/agents/distributional_dqn_agent.py
+++ b/agents/distributional_dqn_agent.py
@@ -1,60 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 # Distributional Deep Q Network - https://arxiv.org/pdf/1707.06887.pdf
 class DistributionalDQNAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.z_values = np.linspace(self.tp.agent.v_min, self.tp.agent.v_max, self.tp.agent.atoms)
    # prediction's format is (batch,actions,atoms)
    def get_q_values(self, prediction):
        return np.dot(prediction, self.z_values)
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # for the action we actually took, the error is calculated by the atoms distribution
        # for all other actions, the error is 0
        distributed_q_st_plus_1 = self.main_network.target_network.predict(next_states)
        # initialize with the current prediction so that we will
        TD_targets = self.main_network.online_network.predict(current_states)
        # only update the action that we have actually done in this transition
        target_actions = np.argmax(self.get_q_values(distributed_q_st_plus_1), axis=1)
        m = np.zeros((self.tp.batch_size, self.z_values.size))
        batches = np.arange(self.tp.batch_size)
        for j in range(self.z_values.size):
            tzj = np.fmax(np.fmin(rewards + (1.0 - game_overs) * self.tp.agent.discount * self.z_values[j],
                                self.z_values[self.z_values.size - 1]),
                                self.z_values[0])
            bj = (tzj - self.z_values[0])/(self.z_values[1] - self.z_values[0])
            u = (np.ceil(bj)).astype(int)
            l = (np.floor(bj)).astype(int)
            m[batches, l] = m[batches, l] + (distributed_q_st_plus_1[batches, target_actions, j] * (u - bj))
            m[batches, u] = m[batches, u] + (distributed_q_st_plus_1[batches, target_actions, j] * (bj - l))
        # total_loss = cross entropy between actual result above and predicted result for the given action
        TD_targets[batches, actions] = m
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
--- a/agents/dqn_agent.py
+++ b/agents/dqn_agent.py
@@ -1,43 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 # Deep Q Network - https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
 class DQNAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # for the action we actually took, the error is:
        # TD error = r + discount*max(q_st_plus_1) - q_st
        # for all other actions, the error is 0
        q_st_plus_1 = self.main_network.target_network.predict(next_states)
        # initialize with the current prediction so that we will
        TD_targets = self.main_network.online_network.predict(current_states)
        #  only update the action that we have actually done in this transition
        for i in range(self.tp.batch_size):
            TD_targets[i, actions[i]] = rewards[i] + (1.0 - game_overs[i]) * self.tp.agent.discount * np.max(
                q_st_plus_1[i], 0)
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
--- a/agents/human_agent.py
+++ b/agents/human_agent.py
@@ -1,67 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.agent import *
 import pygame
 class HumanAgent(Agent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        Agent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.clock = pygame.time.Clock()
        self.max_fps = int(self.tp.visualization.max_fps_for_human_control)
        screen.log_title("Human Control Mode")
        available_keys = self.env.get_available_keys()
        if available_keys:
            screen.log("Use keyboard keys to move. Press escape to quit. Available keys:")
            screen.log("")
            for action, key in self.env.get_available_keys():
                screen.log("\t- {}: {}".format(action, key))
            screen.separator()
    def train(self):
        return 0
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        action = self.env.get_action_from_user()
        # keep constant fps
        self.clock.tick(self.max_fps)
        if not self.env.renderer.is_open:
            self.save_replay_buffer_and_exit()
        return action, {"action_value": 0}
    def save_replay_buffer_and_exit(self):
        replay_buffer_path = os.path.join(logger.experiments_path, 'replay_buffer.p')
        self.memory.tp = None
        to_pickle(self.memory, replay_buffer_path)
        screen.log_title("Replay buffer was stored in {}".format(replay_buffer_path))
        exit()
    def log_to_screen(self, phase):
        # log to screen
        screen.log_dict(
            OrderedDict([
                ("Episode", self.current_episode),
                ("total reward", self.total_reward_in_current_episode),
                ("steps", self.total_steps_counter)
            ]),
            prefix="Recording"
        )
--- a/agents/imitation_agent.py
+++ b/agents/imitation_agent.py
@@ -1,65 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.agent import *
 # Imitation Agent
 class ImitationAgent(Agent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        Agent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.main_network = NetworkWrapper(tuning_parameters, False, self.has_global, 'main',
                                           self.replicated_device, self.worker_device)
        self.networks.append(self.main_network)
        self.imitation = True
    def extract_action_values(self, prediction):
        return prediction.squeeze()
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        # convert to batch so we can run it through the network
        prediction = self.main_network.online_network.predict(self.tf_input_state(curr_state))
        # get action values and extract the best action from it
        action_values = self.extract_action_values(prediction)
        if self.env.discrete_controls:
            # DISCRETE
            # action = np.argmax(action_values)
            action = self.evaluation_exploration_policy.get_action(action_values)
            action_value = {"action_probability": action_values[action]}
        else:
            # CONTINUOUS
            action = action_values
            action_value = {}
        return action, action_value
    def log_to_screen(self, phase):
        # log to screen
        if phase == RunPhase.TRAIN:
            # for the training phase - we log during the episode to visualize the progress in training
            screen.log_dict(
                OrderedDict([
                    ("Worker", self.task_id),
                    ("Episode", self.current_episode),
                    ("Loss", self.loss.values[-1]),
                    ("Training iteration", self.training_iteration)
                ]),
                prefix="Training"
            )
        else:
            # for the evaluation phase - logging as in regular RL
            Agent.log_to_screen(self, phase)
--- a/agents/mmc_agent.py
+++ b/agents/mmc_agent.py
@@ -1,42 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 class MixedMonteCarloAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.mixing_rate = tuning_parameters.agent.monte_carlo_mixing_rate
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, total_return = self.extract_batch(batch)
        TD_targets = self.main_network.online_network.predict(current_states)
        selected_actions = np.argmax(self.main_network.online_network.predict(next_states), 1)
        q_st_plus_1 = self.main_network.target_network.predict(next_states)
        # initialize with the current prediction so that we will
        #  only update the action that we have actually done in this transition
        for i in range(self.tp.batch_size):
            one_step_target = rewards[i] + (1.0 - game_overs[i]) * self.tp.agent.discount * q_st_plus_1[i][
                selected_actions[i]]
            monte_carlo_target = total_return[i]
            TD_targets[i, actions[i]] = (1 - self.mixing_rate) * one_step_target + self.mixing_rate * monte_carlo_target
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
--- a/agents/n_step_q_agent.py
+++ b/agents/n_step_q_agent.py
@@ -1,88 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import numpy as np
 import scipy.signal
 from agents.value_optimization_agent import ValueOptimizationAgent
 from agents.policy_optimization_agent import PolicyOptimizationAgent
 from logger import logger
 from utils import Signal, last_sample
 # N Step Q Learning Agent - https://arxiv.org/abs/1602.01783
 class NStepQAgent(ValueOptimizationAgent, PolicyOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id, create_target_network=True)
        self.last_gradient_update_step_idx = 0
        self.q_values = Signal('Q Values')
        self.unclipped_grads = Signal('Grads (unclipped)')
        self.value_loss = Signal('Value Loss')
        self.signals.append(self.q_values)
        self.signals.append(self.unclipped_grads)
        self.signals.append(self.value_loss)
    def learn_from_batch(self, batch):
        # batch contains a list of episodes to learn from
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # get the values for the current states
        state_value_head_targets = self.main_network.online_network.predict(current_states)
        # the targets for the state value estimator
        num_transitions = len(game_overs)
        if self.tp.agent.targets_horizon == '1-Step':
            # 1-Step Q learning
            q_st_plus_1 = self.main_network.target_network.predict(next_states)
            for i in reversed(range(num_transitions)):
                state_value_head_targets[i][actions[i]] = \
                    rewards[i] + (1.0 - game_overs[i]) * self.tp.agent.discount * np.max(q_st_plus_1[i], 0)
        elif self.tp.agent.targets_horizon == 'N-Step':
            # N-Step Q learning
            if game_overs[-1]:
                R = 0
            else:
                R = np.max(self.main_network.target_network.predict(last_sample(next_states)))
            for i in reversed(range(num_transitions)):
                R = rewards[i] + self.tp.agent.discount * R
                state_value_head_targets[i][actions[i]] = R
        else:
            assert True, 'The available values for targets_horizon are: 1-Step, N-Step'
        # train
        result = self.main_network.online_network.accumulate_gradients(current_states, [state_value_head_targets])
        # logging
        total_loss, losses, unclipped_grads = result[:3]
        self.unclipped_grads.add_sample(unclipped_grads)
        self.value_loss.add_sample(losses[0])
        return total_loss
    def train(self):
        # update the target network of every network that has a target network
        if self.total_steps_counter % self.tp.agent.num_steps_between_copying_online_weights_to_target == 0:
            for network in self.networks:
                network.update_target_network(self.tp.agent.rate_for_copying_weights_to_target)
            logger.create_signal_value('Update Target Network', 1)
        else:
            logger.create_signal_value('Update Target Network', 0, overwrite=False)
        return PolicyOptimizationAgent.train(self)
--- a/agents/naf_agent.py
+++ b/agents/naf_agent.py
@@ -1,81 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import numpy as np
 from agents.value_optimization_agent import ValueOptimizationAgent
 from utils import RunPhase, Signal
 # Normalized Advantage Functions - https://arxiv.org/pdf/1603.00748.pdf
 class NAFAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.l_values = Signal("L")
        self.a_values = Signal("Advantage")
        self.mu_values = Signal("Action")
        self.v_values = Signal("V")
        self.signals += [self.l_values, self.a_values, self.mu_values, self.v_values]
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # TD error = r + discount*v_st_plus_1 - q_st
        v_st_plus_1 = self.main_network.target_network.predict(
            next_states,
            self.main_network.target_network.output_heads[0].V,
            squeeze_output=False,
        )
        TD_targets = np.expand_dims(rewards, -1) + (1.0 - np.expand_dims(game_overs, -1)) * self.tp.agent.discount * v_st_plus_1
        if len(actions.shape) == 1:
            actions = np.expand_dims(actions, -1)
        result = self.main_network.train_and_sync_networks({**current_states, 'output_0_0': actions}, TD_targets)
        total_loss = result[0]
        return total_loss
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        assert not self.env.discrete_controls, 'NAF works only for continuous control problems'
        # convert to batch so we can run it through the network
        # observation = np.expand_dims(np.array(curr_state['observation']), 0)
        naf_head = self.main_network.online_network.output_heads[0]
        action_values = self.main_network.online_network.predict(
            self.tf_input_state(curr_state),
            outputs=naf_head.mu,
            squeeze_output=False,
        )
        if phase == RunPhase.TRAIN:
            action = self.exploration_policy.get_action(action_values)
        else:
            action = action_values
        Q, L, A, mu, V = self.main_network.online_network.predict(
            {**self.tf_input_state(curr_state), 'output_0_0': action_values},
            outputs=[naf_head.Q, naf_head.L, naf_head.A, naf_head.mu, naf_head.V],
        )
        # store the q values statistics for logging
        self.q_values.add_sample(Q)
        self.l_values.add_sample(L)
        self.a_values.add_sample(A)
        self.mu_values.add_sample(mu)
        self.v_values.add_sample(V)
        action_value = {"action_value": Q}
        return action, action_value
--- a/agents/nec_agent.py
+++ b/agents/nec_agent.py
@@ -1,96 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import numpy as np
 import os, pickle
 from agents.value_optimization_agent import ValueOptimizationAgent
 from logger import screen
 from utils import RunPhase
 # Neural Episodic Control - https://arxiv.org/pdf/1703.01988.pdf
 class NECAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id,
                                        create_target_network=False)
        self.current_episode_state_embeddings = []
        self.training_started = False
    def learn_from_batch(self, batch):
        if not self.main_network.online_network.output_heads[0].DND.has_enough_entries(self.tp.agent.number_of_knn):
            return 0
        else:
            if not self.training_started:
                self.training_started = True
                screen.log_title("Finished collecting initial entries in DND. Starting to train network...")
        current_states, next_states, actions, rewards, game_overs, total_return = self.extract_batch(batch)
        TD_targets = self.main_network.online_network.predict(current_states)
        #  only update the action that we have actually done in this transition
        for i in range(self.tp.batch_size):
            TD_targets[i, actions[i]] = total_return[i]
        # train the neural network
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
    def act(self, phase=RunPhase.TRAIN):
        if self.in_heatup:
            # get embedding in heatup (otherwise we get it through choose_action)
            embedding = self.main_network.online_network.predict(
                self.tf_input_state(self.curr_state),
                outputs=self.main_network.online_network.state_embedding)
            self.current_episode_state_embeddings.append(embedding)
        return super().act(phase)
    def get_prediction(self, curr_state):
        # get the actions q values and the state embedding
        embedding, actions_q_values = self.main_network.online_network.predict(
            self.tf_input_state(curr_state),
            outputs=[self.main_network.online_network.state_embedding,
                     self.main_network.online_network.output_heads[0].output]
        )
        # store the state embedding for inserting it to the DND later
        self.current_episode_state_embeddings.append(embedding.squeeze())
        actions_q_values = actions_q_values[0][0]
        return actions_q_values
    def reset_game(self, do_not_reset_env=False):
        super().reset_game(do_not_reset_env)
        # get the last full episode that we have collected
        episode = self.memory.get_last_complete_episode()
        if episode is not None:
            # the indexing is only necessary because the heatup can end in the middle of an episode
            # this won't be required after fixing this so that when the heatup is ended, the episode is closed
            returns = episode.get_transitions_attribute('total_return')[:len(self.current_episode_state_embeddings)]
            actions = episode.get_transitions_attribute('action')[:len(self.current_episode_state_embeddings)]
            self.main_network.online_network.output_heads[0].DND.add(self.current_episode_state_embeddings,
                                                                     actions, returns)
        self.current_episode_state_embeddings = []
    def save_model(self, model_id):
        self.main_network.save_model(model_id)
        with open(os.path.join(self.tp.save_model_dir, str(model_id) + '.dnd'), 'wb') as f:
            pickle.dump(self.main_network.online_network.output_heads[0].DND, f, pickle.HIGHEST_PROTOCOL)
--- a/agents/pal_agent.py
+++ b/agents/pal_agent.py
@@ -1,65 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 # Persistent Advantage Learning - https://arxiv.org/pdf/1512.04860.pdf
 class PALAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.alpha = tuning_parameters.agent.pal_alpha
        self.persistent = tuning_parameters.agent.persistent_advantage_learning
        self.monte_carlo_mixing_rate = tuning_parameters.agent.monte_carlo_mixing_rate
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, total_return = self.extract_batch(batch)
        selected_actions = np.argmax(self.main_network.online_network.predict(next_states), 1)
        # next state values
        q_st_plus_1_target = self.main_network.target_network.predict(next_states)
        v_st_plus_1_target = np.max(q_st_plus_1_target, 1)
        # current state values according to online network
        q_st_online = self.main_network.online_network.predict(current_states)
        # current state values according to target network
        q_st_target = self.main_network.target_network.predict(current_states)
        v_st_target = np.max(q_st_target, 1)
        # calculate TD error
        TD_targets = np.copy(q_st_online)
        for i in range(self.tp.batch_size):
            TD_targets[i, actions[i]] = rewards[i] + (1.0 - game_overs[i]) * self.tp.agent.discount * \
                                                     q_st_plus_1_target[i][selected_actions[i]]
            advantage_learning_update = v_st_target[i] - q_st_target[i, actions[i]]
            next_advantage_learning_update = v_st_plus_1_target[i] - q_st_plus_1_target[i, selected_actions[i]]
            # Persistent Advantage Learning or Regular Advantage Learning
            if self.persistent:
                TD_targets[i, actions[i]] -= self.alpha * min(advantage_learning_update, next_advantage_learning_update)
            else:
                TD_targets[i, actions[i]] -= self.alpha * advantage_learning_update
            # mixing monte carlo updates
            monte_carlo_target = total_return[i]
            TD_targets[i, actions[i]] = (1 - self.monte_carlo_mixing_rate) * TD_targets[i, actions[i]] \
                                        + self.monte_carlo_mixing_rate * monte_carlo_target
        result = self.main_network.train_and_sync_networks(current_states, TD_targets)
        total_loss = result[0]
        return total_loss
--- a/agents/policy_gradients_agent.py
+++ b/agents/policy_gradients_agent.py
@@ -1,93 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.policy_optimization_agent import *
 import numpy as np
 from logger import *
 import tensorflow as tf
 try:
    import matplotlib.pyplot as plt
 except:
    from logger import failed_imports
    failed_imports.append("matplotlib")
 from utils import *
 class PolicyGradientsAgent(PolicyOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        PolicyOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.returns_mean = Signal('Returns Mean')
        self.returns_variance = Signal('Returns Variance')
        self.signals.append(self.returns_mean)
        self.signals.append(self.returns_variance)
        self.last_gradient_update_step_idx = 0
    def learn_from_batch(self, batch):
        # batch contains a list of episodes to learn from
        current_states, next_states, actions, rewards, game_overs, total_returns = self.extract_batch(batch)
        for i in reversed(range(len(total_returns))):
            if self.policy_gradient_rescaler == PolicyGradientRescaler.TOTAL_RETURN:
                total_returns[i] = total_returns[0]
            elif self.policy_gradient_rescaler == PolicyGradientRescaler.FUTURE_RETURN:
                # just take the total return as it is
                pass
            elif self.policy_gradient_rescaler == PolicyGradientRescaler.FUTURE_RETURN_NORMALIZED_BY_EPISODE:
                # we can get a single transition episode while playing Doom Basic, causing the std to be 0
                if self.std_discounted_return != 0:
                    total_returns[i] = (total_returns[i] - self.mean_discounted_return) / self.std_discounted_return
                else:
                    total_returns[i] = 0
            elif self.policy_gradient_rescaler == PolicyGradientRescaler.FUTURE_RETURN_NORMALIZED_BY_TIMESTEP:
                total_returns[i] -= self.mean_return_over_multiple_episodes[i]
            else:
                screen.warning("WARNING: The requested policy gradient rescaler is not available")
        targets = total_returns
        if not self.env.discrete_controls and len(actions.shape) < 2:
            actions = np.expand_dims(actions, -1)
        self.returns_mean.add_sample(np.mean(total_returns))
        self.returns_variance.add_sample(np.std(total_returns))
        result = self.main_network.online_network.accumulate_gradients({**current_states, 'output_0_0': actions}, targets)
        total_loss = result[0]
        return total_loss
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        # convert to batch so we can run it through the network
        if self.env.discrete_controls:
            # DISCRETE
            action_values = self.main_network.online_network.predict(self.tf_input_state(curr_state)).squeeze()
            if phase == RunPhase.TRAIN:
                action = self.exploration_policy.get_action(action_values)
            else:
                action = np.argmax(action_values)
            action_value = {"action_probability": action_values[action]}
            self.entropy.add_sample(-np.sum(action_values * np.log(action_values + eps)))
        else:
            # CONTINUOUS
            result = self.main_network.online_network.predict(self.tf_input_state(curr_state))
            action_values = result[0].squeeze()
            if phase == RunPhase.TRAIN:
                action = self.exploration_policy.get_action(action_values)
            else:
                action = action_values
            action_value = {}
        return action, action_value
--- a/agents/policy_optimization_agent.py
+++ b/agents/policy_optimization_agent.py
@@ -1,123 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.agent import *
 from memories.memory import Episode
 class PolicyGradientRescaler(Enum):
    TOTAL_RETURN = 0
    FUTURE_RETURN = 1
    FUTURE_RETURN_NORMALIZED_BY_EPISODE = 2
    FUTURE_RETURN_NORMALIZED_BY_TIMESTEP = 3  # baselined
    Q_VALUE = 4
    A_VALUE = 5
    TD_RESIDUAL = 6
    DISCOUNTED_TD_RESIDUAL = 7
    GAE = 8
 class PolicyOptimizationAgent(Agent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0, create_target_network=False):
        Agent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.main_network = NetworkWrapper(tuning_parameters, create_target_network, self.has_global, 'main',
                                           self.replicated_device, self.worker_device)
        self.networks.append(self.main_network)
        self.policy_gradient_rescaler = PolicyGradientRescaler().get(self.tp.agent.policy_gradient_rescaler)
        # statistics for variance reduction
        self.last_gradient_update_step_idx = 0
        self.max_episode_length = 100000
        self.mean_return_over_multiple_episodes = np.zeros(self.max_episode_length)
        self.num_episodes_where_step_has_been_seen = np.zeros(self.max_episode_length)
        self.entropy = Signal('Entropy')
        self.signals.append(self.entropy)
        self.reset_game(do_not_reset_env=True)
    def log_to_screen(self, phase):
        # log to screen
        if self.current_episode > 0:
            screen.log_dict(
                OrderedDict([
                    ("Worker", self.task_id),
                    ("Episode", self.current_episode),
                    ("total reward", self.total_reward_in_current_episode),
                    ("steps", self.total_steps_counter),
                    ("training iteration", self.training_iteration)
                ]),
                prefix=phase
            )
    def update_episode_statistics(self, episode):
        episode_discounted_returns = []
        for i in range(episode.length()):
            transition = episode.get_transition(i)
            episode_discounted_returns.append(transition.total_return)
            self.num_episodes_where_step_has_been_seen[i] += 1
            self.mean_return_over_multiple_episodes[i] -= self.mean_return_over_multiple_episodes[i] / \
                                                          self.num_episodes_where_step_has_been_seen[i]
            self.mean_return_over_multiple_episodes[i] += transition.total_return / \
                                                          self.num_episodes_where_step_has_been_seen[i]
        self.mean_discounted_return = np.mean(episode_discounted_returns)
        self.std_discounted_return = np.std(episode_discounted_returns)
    def train(self):
        if self.memory.length() == 0:
            return 0
        episode = self.memory.get_episode(0)
        # check if we should calculate gradients or skip
        episode_ended = self.memory.num_complete_episodes() >= 1
        num_steps_passed_since_last_update = episode.length() - self.last_gradient_update_step_idx
        is_t_max_steps_passed = num_steps_passed_since_last_update >= self.tp.agent.num_steps_between_gradient_updates
        if not (is_t_max_steps_passed or episode_ended):
            return 0
        total_loss = 0
        if num_steps_passed_since_last_update > 0:
            # we need to update the returns of the episode until now
            episode.update_returns(self.tp.agent.discount)
            # get t_max transitions or less if the we got to a terminal state
            # will be used for both actor-critic and vanilla PG.
            # # In order to get full episodes, Vanilla PG will set the end_idx to a very big value.
            transitions = []
            start_idx = self.last_gradient_update_step_idx
            end_idx = episode.length()
            for idx in range(start_idx, end_idx):
                transitions.append(episode.get_transition(idx))
            self.last_gradient_update_step_idx = end_idx
            # update the statistics for the variance reduction techniques
            if self.tp.agent.type == 'PolicyGradientsAgent':
                self.update_episode_statistics(episode)
            # accumulate the gradients and apply them once in every apply_gradients_every_x_episodes episodes
            total_loss = self.learn_from_batch(transitions)
            if self.current_episode % self.tp.agent.apply_gradients_every_x_episodes == 0:
                self.main_network.apply_gradients_and_sync_networks()
        # move the pointer to the next episode start and discard the episode. we use it only once
        if episode_ended:
            self.memory.remove_episode(0)
            self.last_gradient_update_step_idx = 0
        return total_loss
--- a/agents/ppo_agent.py
+++ b/agents/ppo_agent.py
@@ -1,289 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.actor_critic_agent import *
 from random import shuffle
 # Proximal Policy Optimization - https://arxiv.org/pdf/1707.06347.pdf
 class PPOAgent(ActorCriticAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ActorCriticAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id,
                                  create_target_network=True)
        self.critic_network = self.main_network
        # define the policy network
        tuning_parameters.agent.input_types = {'observation': InputTypes.Observation}
        tuning_parameters.agent.output_types = [OutputTypes.PPO]
        tuning_parameters.agent.optimizer_type = 'Adam'
        tuning_parameters.agent.l2_regularization = 0
        self.policy_network = NetworkWrapper(tuning_parameters, True, self.has_global, 'policy',
                                             self.replicated_device, self.worker_device)
        self.networks.append(self.policy_network)
        # signals definition
        self.value_loss = Signal('Value Loss')
        self.signals.append(self.value_loss)
        self.policy_loss = Signal('Policy Loss')
        self.signals.append(self.policy_loss)
        self.kl_divergence = Signal('KL Divergence')
        self.signals.append(self.kl_divergence)
        self.total_kl_divergence_during_training_process = 0.0
        self.unclipped_grads = Signal('Grads (unclipped)')
        self.signals.append(self.unclipped_grads)
        self.reset_game(do_not_reset_env=True)
    def fill_advantages(self, batch):
        current_states, next_states, actions, rewards, game_overs, total_return = self.extract_batch(batch)
        # * Found not to have any impact *
        # current_states_with_timestep = self.concat_state_and_timestep(batch)
        current_state_values = self.critic_network.online_network.predict(current_states).squeeze()
        # calculate advantages
        advantages = []
        if self.policy_gradient_rescaler == PolicyGradientRescaler.A_VALUE:
            advantages = total_return - current_state_values
        elif self.policy_gradient_rescaler == PolicyGradientRescaler.GAE:
            # get bootstraps
            episode_start_idx = 0
            advantages = np.array([])
            # current_state_values[game_overs] = 0
            for idx, game_over in enumerate(game_overs):
                if game_over:
                    # get advantages for the rollout
                    value_bootstrapping = np.zeros((1,))
                    rollout_state_values = np.append(current_state_values[episode_start_idx:idx+1], value_bootstrapping)
                    rollout_advantages, _ = \
                        self.get_general_advantage_estimation_values(rewards[episode_start_idx:idx+1],
                                                                     rollout_state_values)
                    episode_start_idx = idx + 1
                    advantages = np.append(advantages, rollout_advantages)
        else:
            screen.warning("WARNING: The requested policy gradient rescaler is not available")
        # standardize
        advantages = (advantages - np.mean(advantages)) / np.std(advantages)
        for transition, advantage in zip(self.memory.transitions, advantages):
            transition.info['advantage'] = advantage
        self.action_advantages.add_sample(advantages)
    def train_value_network(self, dataset, epochs):
        loss = []
        current_states, _, _, _, _, total_return = self.extract_batch(dataset)
        # * Found not to have any impact *
        # add a timestep to the observation
        # current_states_with_timestep = self.concat_state_and_timestep(dataset)
        total_return = np.expand_dims(total_return, -1)
        mix_fraction = self.tp.agent.value_targets_mix_fraction
        for j in range(epochs):
            batch_size = len(dataset)
            if self.critic_network.online_network.optimizer_type != 'LBFGS':
                batch_size = self.tp.batch_size
            for i in range(len(dataset) // batch_size):
                # split to batches for first order optimization techniques
                current_states_batch = {
                    k: v[i * batch_size:(i + 1) * batch_size]
                    for k, v in current_states.items()
                }
                total_return_batch = total_return[i * batch_size:(i + 1) * batch_size]
                old_policy_values = force_list(self.critic_network.target_network.predict(
                    current_states_batch).squeeze())
                if self.critic_network.online_network.optimizer_type != 'LBFGS':
                    targets = total_return_batch
                else:
                    current_values = self.critic_network.online_network.predict(current_states_batch)
                    targets = current_values * (1 - mix_fraction) + total_return_batch * mix_fraction
                inputs = copy.copy(current_states_batch)
                for input_index, input in enumerate(old_policy_values):
                    name = 'output_0_{}'.format(input_index)
                    if name in self.critic_network.online_network.inputs:
                        inputs[name] = input
                value_loss = self.critic_network.online_network.accumulate_gradients(inputs, targets)
                self.critic_network.apply_gradients_to_online_network()
                if self.tp.distributed:
                    self.critic_network.apply_gradients_to_global_network()
                self.critic_network.online_network.reset_accumulated_gradients()
                loss.append([value_loss[0]])
        loss = np.mean(loss, 0)
        return loss
    def concat_state_and_timestep(self, dataset):
        current_states_with_timestep = [np.append(transition.state['observation'], transition.info['timestep'])
                                        for transition in dataset]
        current_states_with_timestep = np.expand_dims(current_states_with_timestep, -1)
        return current_states_with_timestep
    def train_policy_network(self, dataset, epochs):
        loss = []
        for j in range(epochs):
            loss = {
                'total_loss': [],
                'policy_losses': [],
                'unclipped_grads': [],
                'fetch_result': []
            }
            #shuffle(dataset)
            for i in range(len(dataset) // self.tp.batch_size):
                batch = dataset[i * self.tp.batch_size:(i + 1) * self.tp.batch_size]
                current_states, _, actions, _, _, total_return = self.extract_batch(batch)
                advantages = np.array([t.info['advantage'] for t in batch])
                if not self.tp.env_instance.discrete_controls and len(actions.shape) == 1:
                    actions = np.expand_dims(actions, -1)
                # get old policy probabilities and distribution
                old_policy = force_list(self.policy_network.target_network.predict(current_states))
                # calculate gradients and apply on both the local policy network and on the global policy network
                fetches = [self.policy_network.online_network.output_heads[0].kl_divergence,
                           self.policy_network.online_network.output_heads[0].entropy]
                inputs = copy.copy(current_states)
                # TODO: why is this output 0 and not output 1?
                inputs['output_0_0'] = actions
                # TODO: does old_policy_distribution really need to be represented as a list?
                # A: yes it does, in the event of discrete controls, it has just a mean
                # otherwise, it has both a mean and standard deviation
                for input_index, input in enumerate(old_policy):
                    inputs['output_0_{}'.format(input_index + 1)] = input
                total_loss, policy_losses, unclipped_grads, fetch_result =\
                    self.policy_network.online_network.accumulate_gradients(
                        inputs, [advantages], additional_fetches=fetches)
                self.policy_network.apply_gradients_to_online_network()
                if self.tp.distributed:
                    self.policy_network.apply_gradients_to_global_network()
                self.policy_network.online_network.reset_accumulated_gradients()
                loss['total_loss'].append(total_loss)
                loss['policy_losses'].append(policy_losses)
                loss['unclipped_grads'].append(unclipped_grads)
                loss['fetch_result'].append(fetch_result)
                self.unclipped_grads.add_sample(unclipped_grads)
            for key in loss.keys():
                loss[key] = np.mean(loss[key], 0)
            if self.tp.learning_rate_decay_rate != 0:
                curr_learning_rate = self.main_network.online_network.get_variable_value(self.tp.learning_rate)
                self.curr_learning_rate.add_sample(curr_learning_rate)
            else:
                curr_learning_rate = self.tp.learning_rate
            # log training parameters
            screen.log_dict(
                OrderedDict([
                    ("Surrogate loss", loss['policy_losses'][0]),
                    ("KL divergence", loss['fetch_result'][0]),
                    ("Entropy", loss['fetch_result'][1]),
                    ("training epoch", j),
                    ("learning_rate", curr_learning_rate)
                ]),
                prefix="Policy training"
            )
        self.total_kl_divergence_during_training_process = loss['fetch_result'][0]
        self.entropy.add_sample(loss['fetch_result'][1])
        self.kl_divergence.add_sample(loss['fetch_result'][0])
        return loss['total_loss']
    def update_kl_coefficient(self):
        # John Schulman takes the mean kl divergence only over the last epoch which is strange but we will follow
        # his implementation for now because we know it works well
        screen.log_title("KL = {}".format(self.total_kl_divergence_during_training_process))
        # update kl coefficient
        kl_target = self.tp.agent.target_kl_divergence
        kl_coefficient = self.policy_network.online_network.get_variable_value(
            self.policy_network.online_network.output_heads[0].kl_coefficient)
        new_kl_coefficient = kl_coefficient
        if self.total_kl_divergence_during_training_process > 1.3 * kl_target:
            # kl too high => increase regularization
            new_kl_coefficient *= 1.5
        elif self.total_kl_divergence_during_training_process < 0.7 * kl_target:
            # kl too low => decrease regularization
            new_kl_coefficient /= 1.5
        # update the kl coefficient variable
        if kl_coefficient != new_kl_coefficient:
            self.policy_network.online_network.set_variable_value(
                self.policy_network.online_network.output_heads[0].assign_kl_coefficient,
                new_kl_coefficient,
                self.policy_network.online_network.output_heads[0].kl_coefficient_ph)
        screen.log_title("KL penalty coefficient change = {} -> {}".format(kl_coefficient, new_kl_coefficient))
    def post_training_commands(self):
        if self.tp.agent.use_kl_regularization:
            self.update_kl_coefficient()
        # clean memory
        self.memory.clean()
    def train(self):
        self.policy_network.sync()
        self.critic_network.sync()
        dataset = self.memory.transitions
        self.fill_advantages(dataset)
        # take only the requested number of steps
        dataset = dataset[:self.tp.agent.num_consecutive_playing_steps]
        value_loss = self.train_value_network(dataset, 1)
        policy_loss = self.train_policy_network(dataset, 10)
        self.value_loss.add_sample(value_loss)
        self.policy_loss.add_sample(policy_loss)
        self.update_log()  # should be done in order to update the data that has been accumulated * while not playing *
        return np.append(value_loss, policy_loss)
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        if self.env.discrete_controls:
            # DISCRETE
            action_values = self.policy_network.online_network.predict(self.tf_input_state(curr_state)).squeeze()
            if phase == RunPhase.TRAIN:
                action = self.exploration_policy.get_action(action_values)
            else:
                action = np.argmax(action_values)
            action_info = {"action_probability": action_values[action]}
            # self.entropy.add_sample(-np.sum(action_values * np.log(action_values)))
        else:
            # CONTINUOUS
            action_values_mean, action_values_std = self.policy_network.online_network.predict(self.tf_input_state(curr_state))
            action_values_mean = action_values_mean.squeeze()
            action_values_std = action_values_std.squeeze()
            if phase == RunPhase.TRAIN:
                action = np.squeeze(np.random.randn(1, self.action_space_size) * action_values_std + action_values_mean)
            else:
                action = action_values_mean
            action_info = {"action_probability": action_values_mean}
        return action, action_info
--- a/agents/qr_dqn_agent.py
+++ b/agents/qr_dqn_agent.py
@@ -1,66 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from agents.value_optimization_agent import *
 # Quantile Regression Deep Q Network - https://arxiv.org/pdf/1710.10044v1.pdf
 class QuantileRegressionDQNAgent(ValueOptimizationAgent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0):
        ValueOptimizationAgent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.quantile_probabilities = np.ones(self.tp.agent.atoms) / float(self.tp.agent.atoms)
    # prediction's format is (batch,actions,atoms)
    def get_q_values(self, quantile_values):
        return np.dot(quantile_values, self.quantile_probabilities)
    def learn_from_batch(self, batch):
        current_states, next_states, actions, rewards, game_overs, _ = self.extract_batch(batch)
        # get the quantiles of the next states and current states
        next_state_quantiles = self.main_network.target_network.predict(next_states)
        current_quantiles = self.main_network.online_network.predict(current_states)
        # get the optimal actions to take for the next states
        target_actions = np.argmax(self.get_q_values(next_state_quantiles), axis=1)
        # calculate the Bellman update
        batch_idx = list(range(self.tp.batch_size))
        rewards = np.expand_dims(rewards, -1)
        game_overs = np.expand_dims(game_overs, -1)
        TD_targets = rewards + (1.0 - game_overs) * self.tp.agent.discount \
                               * next_state_quantiles[batch_idx, target_actions]
        # get the locations of the selected actions within the batch for indexing purposes
        actions_locations = [[b, a] for b, a in zip(batch_idx, actions)]
        # calculate the cumulative quantile probabilities and reorder them to fit the sorted quantiles order
        cumulative_probabilities = np.array(range(self.tp.agent.atoms+1))/float(self.tp.agent.atoms)  # tau_i
        quantile_midpoints = 0.5*(cumulative_probabilities[1:] + cumulative_probabilities[:-1])  # tau^hat_i
        quantile_midpoints = np.tile(quantile_midpoints, (self.tp.batch_size, 1))
        sorted_quantiles = np.argsort(current_quantiles[batch_idx, actions])
        for idx in range(self.tp.batch_size):
            quantile_midpoints[idx, :] = quantile_midpoints[idx, sorted_quantiles[idx]]
        # train
        result = self.main_network.train_and_sync_networks({
            **current_states,
            'output_0_0': actions_locations,
            'output_0_1': quantile_midpoints,
        }, TD_targets)
        total_loss = result[0]
        return total_loss
--- a/agents/value_optimization_agent.py
+++ b/agents/value_optimization_agent.py
@@ -1,77 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import numpy as np
 from agents.agent import Agent
 from architectures.network_wrapper import NetworkWrapper
 from utils import RunPhase, Signal
 class ValueOptimizationAgent(Agent):
    def __init__(self, env, tuning_parameters, replicated_device=None, thread_id=0, create_target_network=True):
        Agent.__init__(self, env, tuning_parameters, replicated_device, thread_id)
        self.main_network = NetworkWrapper(tuning_parameters, create_target_network, self.has_global, 'main',
                                           self.replicated_device, self.worker_device)
        self.networks.append(self.main_network)
        self.q_values = Signal("Q")
        self.signals.append(self.q_values)
        self.reset_game(do_not_reset_env=True)
    # Algorithms for which q_values are calculated from predictions will override this function
    def get_q_values(self, prediction):
        return prediction
    def get_prediction(self, curr_state):
        return self.main_network.online_network.predict(self.tf_input_state(curr_state))
    def _validate_action(self, policy, action):
        if np.array(action).shape != ():
            raise ValueError((
                'The exploration_policy {} returned a vector of actions '
                'instead of a single action. ValueOptimizationAgents '
                'require exploration policies which return a single action.'
            ).format(policy.__class__.__name__))
    def choose_action(self, curr_state, phase=RunPhase.TRAIN):
        prediction = self.get_prediction(curr_state)
        actions_q_values = self.get_q_values(prediction)
        # choose action according to the exploration policy and the current phase (evaluating or training the agent)
        if phase == RunPhase.TRAIN:
            exploration_policy = self.exploration_policy
        else:
            exploration_policy = self.evaluation_exploration_policy
        action = exploration_policy.get_action(actions_q_values)
        self._validate_action(exploration_policy, action)
        # this is for bootstrapped dqn
        if type(actions_q_values) == list and len(actions_q_values) > 0:
            actions_q_values = actions_q_values[self.exploration_policy.selected_head]
        actions_q_values = actions_q_values.squeeze()
        # store the q values statistics for logging
        self.q_values.add_sample(actions_q_values)
        # store information for plotting interactively (actual plotting is done in agent)
        if self.tp.visualization.plot_action_values_online:
            for idx, action_name in enumerate(self.env.actions_description):
                self.episode_running_info[action_name].append(actions_q_values[idx])
        action_value = {"action_value": actions_q_values[action], "max_action_value": np.max(actions_q_values)}
        return action, action_value
--- a/architectures/init.py
+++ b/architectures/init.py
@@ -1,31 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from architectures.architecture import *
 from logger import failed_imports
 try:
    from architectures.tensorflow_components.general_network import *
    from architectures.tensorflow_components.architecture import *
 except ImportError:
    failed_imports.append("TensorFlow")
 try:
    from architectures.neon_components.general_network import *
    from architectures.neon_components.architecture import *
 except ImportError:
    failed_imports.append("Neon")
 from architectures.network_wrapper import *
--- a/architectures/neon_components/architecture.py
+++ b/architectures/neon_components/architecture.py
@@ -1,129 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import sys
 import copy
 from ngraph.frontends.neon import *
 import ngraph as ng
 from architectures.architecture import *
 import numpy as np
 from utils import *
 class NeonArchitecture(Architecture):
    def __init__(self, tuning_parameters, name="", global_network=None, network_is_local=True):
        Architecture.__init__(self, tuning_parameters, name)
        assert tuning_parameters.agent.neon_support, 'Neon is not supported for this agent'
        self.clip_error = tuning_parameters.clip_gradients
        self.total_loss = None
        self.epoch = 0
        self.inputs = []
        self.outputs = []
        self.targets = []
        self.losses = []
        self.transformer = tuning_parameters.sess
        self.network = self.get_model(tuning_parameters)
        self.accumulated_gradients = []
        # training and inference ops
        train_output = ng.sequential([
            self.optimizer(self.total_loss),
            self.total_loss
        ])
        placeholders = self.inputs + self.targets
        self.train_op = self.transformer.add_computation(
            ng.computation(
                train_output, *placeholders
            )
        )
        self.predict_op = self.transformer.add_computation(
            ng.computation(
                self.outputs, self.inputs[0]
            )
        )
        # update weights from array op
        self.weights = [ng.placeholder(w.axes) for w in self.total_loss.variables()]
        self.set_weights_ops = []
        for target_variable, variable in zip(self.total_loss.variables(), self.weights):
            self.set_weights_ops.append(self.transformer.add_computation(
                ng.computation(
                    ng.assign(target_variable, variable), variable
                )
            ))
        # get weights op
        self.get_variables = self.transformer.add_computation(
            ng.computation(
                self.total_loss.variables()
            )
        )
    def predict(self, inputs):
        batch_size = inputs.shape[0]
        # move batch axis to the end
        inputs = inputs.swapaxes(0, -1)
        prediction = self.predict_op(inputs)  # TODO: problem with multiple inputs
        if type(prediction) != tuple:
            prediction = (prediction)
        # process all the outputs from the network
        output = []
        for p in prediction:
            output.append(p.transpose()[:batch_size].copy())
        # if there is only one output then we don't need a list
        if len(output) == 1:
            output = output[0]
        return output
    def train_on_batch(self, inputs, targets):
        loss = self.accumulate_gradients(inputs, targets)
        self.apply_and_reset_gradients(self.accumulated_gradients)
        return loss
    def get_weights(self):
        return self.get_variables()
    def set_weights(self, weights, rate=1.0):
        if rate != 1:
            current_weights = self.get_weights()
            updated_weights = [(1 - rate) * t + rate * o for t, o in zip(current_weights, weights)]
        else:
            updated_weights = weights
        for update_function, variable in zip(self.set_weights_ops, updated_weights):
            update_function(variable)
    def accumulate_gradients(self, inputs, targets):
        # Neon doesn't currently allow separating the grads calculation and grad apply operations
        # so this feature is not currently available. instead we do a full training iteration
        inputs = force_list(inputs)
        targets = force_list(targets)
        for idx, input in enumerate(inputs):
            inputs[idx] = input.swapaxes(0, -1)
        for idx, target in enumerate(targets):
            targets[idx] = np.rollaxis(target, 0, len(target.shape))
        all_inputs = inputs + targets
        loss = np.mean(self.train_op(*all_inputs))
        return [loss]
--- a/architectures/neon_components/embedders.py
+++ b/architectures/neon_components/embedders.py
@@ -1,88 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import ngraph.frontends.neon as neon
 import ngraph as ng
 from ngraph.util.names import name_scope
 class InputEmbedder(object):
    def __init__(self, input_size, batch_size=None, activation_function=neon.Rectlin(), name="embedder"):
        self.name = name
        self.input_size = input_size
        self.batch_size = batch_size
        self.activation_function = activation_function
        self.weights_init = neon.GlorotInit()
        self.biases_init = neon.ConstantInit()
        self.input = None
        self.output = None
    def __call__(self, prev_input_placeholder=None):
        with name_scope(self.get_name()):
            # create the input axes
            axes = []
            if len(self.input_size) == 2:
                axis_names = ['H', 'W']
            else:
                axis_names = ['C', 'H', 'W']
            for axis_size, axis_name in zip(self.input_size, axis_names):
                axes.append(ng.make_axis(axis_size, name=axis_name))
            batch_axis_full = ng.make_axis(self.batch_size, name='N')
            input_axes = ng.make_axes(axes)
            if prev_input_placeholder is None:
                self.input = ng.placeholder(input_axes + [batch_axis_full])
            else:
                self.input = prev_input_placeholder
            self._build_module()
        return self.input, self.output(self.input)
    def _build_module(self):
        pass
    def get_name(self):
        return self.name
 class ImageEmbedder(InputEmbedder):
    def __init__(self, input_size, batch_size=None, input_rescaler=255.0, activation_function=neon.Rectlin(), name="embedder"):
        InputEmbedder.__init__(self, input_size, batch_size, activation_function, name)
        self.input_rescaler = input_rescaler
    def _build_module(self):
        # image observation
        self.output = neon.Sequential([
            neon.Preprocess(functor=lambda x: x / self.input_rescaler),
            neon.Convolution((8, 8, 32), strides=4, activation=self.activation_function,
                             filter_init=self.weights_init, bias_init=self.biases_init),
            neon.Convolution((4, 4, 64), strides=2, activation=self.activation_function,
                             filter_init=self.weights_init, bias_init=self.biases_init),
            neon.Convolution((3, 3, 64), strides=1, activation=self.activation_function,
                             filter_init=self.weights_init, bias_init=self.biases_init)
        ])
 class VectorEmbedder(InputEmbedder):
    def __init__(self, input_size, batch_size=None, activation_function=neon.Rectlin(), name="embedder"):
        InputEmbedder.__init__(self, input_size, batch_size, activation_function, name)
    def _build_module(self):
        # vector observation
        self.output = neon.Sequential([
                neon.Affine(nout=256, activation=self.activation_function,
                            weight_init=self.weights_init, bias_init=self.biases_init)
            ])
--- a/architectures/neon_components/general_network.py
+++ b/architectures/neon_components/general_network.py
@@ -1,192 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from architectures.neon_components.embedders import *
 from architectures.neon_components.heads import *
 from architectures.neon_components.middleware import *
 from architectures.neon_components.architecture import *
 from configurations import InputTypes, OutputTypes, MiddlewareTypes
 class GeneralNeonNetwork(NeonArchitecture):
    def __init__(self, tuning_parameters, name="", global_network=None, network_is_local=True):
        self.global_network = global_network
        self.network_is_local = network_is_local
        self.num_heads_per_network = 1 if tuning_parameters.agent.use_separate_networks_per_head else \
            len(tuning_parameters.agent.output_types)
        self.num_networks = 1 if not tuning_parameters.agent.use_separate_networks_per_head else \
            len(tuning_parameters.agent.output_types)
        self.input_embedders = []
        self.output_heads = []
        self.activation_function = self.get_activation_function(
            tuning_parameters.agent.hidden_layers_activation_function)
        NeonArchitecture.__init__(self, tuning_parameters, name, global_network, network_is_local)
    def get_activation_function(self, activation_function_string):
        activation_functions = {
            'relu': neon.Rectlin(),
            'tanh': neon.Tanh(),
            'sigmoid': neon.Logistic(),
            'elu': neon.Explin(),
            'selu': None,
            'none': None
        }
        assert activation_function_string in activation_functions.keys(), \
            "Activation function must be one of the following {}".format(activation_functions.keys())
        return activation_functions[activation_function_string]
    def get_input_embedder(self, embedder_type):
        # the observation can be either an image or a vector
        def get_observation_embedding(with_timestep=False):
            if self.input_height > 1:
                return ImageEmbedder((self.input_depth, self.input_height, self.input_width), self.batch_size,
                                     name="observation")
            else:
                return VectorEmbedder((self.input_depth, self.input_width + int(with_timestep)), self.batch_size,
                                      name="observation")
        input_mapping = {
            InputTypes.Observation: get_observation_embedding(),
            InputTypes.Measurements: VectorEmbedder(self.measurements_size, self.batch_size, name="measurements"),
            InputTypes.GoalVector: VectorEmbedder(self.measurements_size, self.batch_size, name="goal_vector"),
            InputTypes.Action: VectorEmbedder((self.num_actions,), self.batch_size, name="action"),
            InputTypes.TimedObservation: get_observation_embedding(with_timestep=True),
        }
        return input_mapping[embedder_type]
    def get_middleware_embedder(self, middleware_type):
        return {MiddlewareTypes.LSTM: None,   # LSTM over Neon is currently not supported in Coach
                MiddlewareTypes.FC: FC_Embedder}.get(middleware_type)(self.activation_function)
    def get_output_head(self, head_type, head_idx, loss_weight=1.):
        output_mapping = {
            OutputTypes.Q: QHead,
            OutputTypes.DuelingQ: DuelingQHead,
            OutputTypes.V: None, # Policy Optimization algorithms over Neon are currently not supported in Coach
            OutputTypes.Pi: None,  # Policy Optimization algorithms over Neon are currently not supported in Coach
            OutputTypes.MeasurementsPrediction: None, # DFP over Neon is currently not supported in Coach
            OutputTypes.DNDQ: None,  # NEC over Neon is currently not supported in Coach
            OutputTypes.NAF: None,  # NAF over Neon is currently not supported in Coach
            OutputTypes.PPO: None, # PPO over Neon is currently not supported in Coach
            OutputTypes.PPO_V: None  # PPO over Neon is currently not supported in Coach
        }
        return output_mapping[head_type](self.tp, head_idx, loss_weight, self.network_is_local)
    def get_model(self, tuning_parameters):
        """
        :param tuning_parameters: A Preset class instance with all the running paramaters
        :type tuning_parameters: Preset
        :return: A model
        """
        assert len(self.tp.agent.input_types) > 0, "At least one input type should be defined"
        assert len(self.tp.agent.output_types) > 0, "At least one output type should be defined"
        assert self.tp.agent.middleware_type is not None, "Exactly one middleware type should be defined"
        assert len(self.tp.agent.loss_weights) > 0, "At least one loss weight should be defined"
        assert len(self.tp.agent.output_types) == len(self.tp.agent.loss_weights), \
            "Number of loss weights should match the number of output types"
        local_network_in_distributed_training = self.global_network is not None and self.network_is_local
        tuning_parameters.activation_function = self.activation_function
        done_creating_input_placeholders = False
        for network_idx in range(self.num_networks):
            with name_scope('network_{}'.format(network_idx)):
                ####################
                # Input Embeddings #
                ####################
                state_embedding = []
                for idx, input_type in enumerate(self.tp.agent.input_types):
                    # get the class of the input embedder
                    self.input_embedders.append(self.get_input_embedder(input_type))
                    # in the case each head uses a different network, we still reuse the input placeholders
                    prev_network_input_placeholder = self.inputs[idx] if done_creating_input_placeholders else None
                    # create the input embedder instance and store the input placeholder and the embedding
                    input_placeholder, embedding = self.input_embedders[-1](prev_network_input_placeholder)
                    if len(self.inputs) < len(self.tp.agent.input_types):
                        self.inputs.append(input_placeholder)
                    state_embedding.append(embedding)
                done_creating_input_placeholders = True
                ##############
                # Middleware #
                ##############
                state_embedding = ng.concat_along_axis(state_embedding, state_embedding[0].axes[0]) \
                    if len(state_embedding) > 1 else state_embedding[0]
                self.middleware_embedder = self.get_middleware_embedder(self.tp.agent.middleware_type)
                _, self.state_embedding = self.middleware_embedder(state_embedding)
                ################
                # Output Heads #
                ################
                for head_idx in range(self.num_heads_per_network):
                    for head_copy_idx in range(self.tp.agent.num_output_head_copies):
                        if self.tp.agent.use_separate_networks_per_head:
                            # if we use separate networks per head, then the head type corresponds top the network idx
                            head_type_idx = network_idx
                        else:
                            # if we use a single network with multiple heads, then the head type is the current head idx
                            head_type_idx = head_idx
                        self.output_heads.append(self.get_output_head(self.tp.agent.output_types[head_type_idx],
                                                                      head_copy_idx,
                                                                      self.tp.agent.loss_weights[head_type_idx]))
                        if self.network_is_local:
                            output, target_placeholder, input_placeholder = self.output_heads[-1](self.state_embedding)
                            self.targets.extend(target_placeholder)
                        else:
                            output, input_placeholder = self.output_heads[-1](self.state_embedding)
                        self.outputs.extend(output)
                        self.inputs.extend(input_placeholder)
        # Losses
        self.losses = []
        for output_head in self.output_heads:
            self.losses += output_head.loss
        self.total_loss = sum(self.losses)
        # Learning rate
        if self.tp.learning_rate_decay_rate != 0:
            raise Exception("learning rate decay is not supported in neon")
        # Optimizer
        if local_network_in_distributed_training and \
                hasattr(self.tp.agent, "shared_optimizer") and self.tp.agent.shared_optimizer:
            # distributed training and this is the local network instantiation
            self.optimizer = self.global_network.optimizer
        else:
            if tuning_parameters.agent.optimizer_type == 'Adam':
                self.optimizer = neon.Adam(
                    learning_rate=tuning_parameters.learning_rate,
                    gradient_clip_norm=tuning_parameters.clip_gradients
                )
            elif tuning_parameters.agent.optimizer_type == 'RMSProp':
                self.optimizer = neon.RMSProp(
                    learning_rate=tuning_parameters.learning_rate,
                    gradient_clip_norm=tuning_parameters.clip_gradients,
                    decay_rate=0.9,
                    epsilon=0.01
                )
            elif tuning_parameters.agent.optimizer_type == 'LBFGS':
                raise Exception("LBFGS optimizer is not supported in neon")
            else:
                raise Exception("{} is not a valid optimizer type".format(tuning_parameters.agent.optimizer_type))
--- a/architectures/neon_components/heads.py
+++ b/architectures/neon_components/heads.py
@@ -1,194 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import ngraph as ng
 from ngraph.util.names import name_scope
 import ngraph.frontends.neon as neon
 import numpy as np
 from utils import force_list
 from architectures.neon_components.losses import *
 class Head(object):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        self.head_idx = head_idx
        self.name = "head"
        self.output = []
        self.loss = []
        self.loss_type = []
        self.regularizations = []
        self.loss_weight = force_list(loss_weight)
        self.weights_init = neon.GlorotInit()
        self.biases_init = neon.ConstantInit()
        self.target = []
        self.input = []
        self.is_local = is_local
        self.batch_size = tuning_parameters.batch_size
    def __call__(self, input_layer):
        """
        Wrapper for building the module graph including scoping and loss creation
        :param input_layer: the input to the graph
        :return: the output of the last layer and the target placeholder
        """
        with name_scope(self.get_name()):
            self._build_module(input_layer)
            self.output = force_list(self.output)
            self.target = force_list(self.target)
            self.input = force_list(self.input)
            self.loss_type = force_list(self.loss_type)
            self.loss = force_list(self.loss)
            self.regularizations = force_list(self.regularizations)
            if self.is_local:
               self.set_loss()
        if self.is_local:
            return self.output, self.target, self.input
        else:
            return self.output, self.input
    def _build_module(self, input_layer):
        """
        Builds the graph of the module
        :param input_layer: the input to the graph
        :return: None
        """
        pass
    def get_name(self):
        """
        Get a formatted name for the module
        :return: the formatted name
        """
        return '{}_{}'.format(self.name, self.head_idx)
    def set_loss(self):
        """
        Creates a target placeholder and loss function for each loss_type and regularization
        :param loss_type: a tensorflow loss function
        :param scope: the name scope to include the tensors in
        :return: None
        """
        # add losses and target placeholder
        for idx in range(len(self.loss_type)):
            # output_axis = ng.make_axis(self.num_actions, name='q_values')
            batch_axis_full = ng.make_axis(self.batch_size, name='N')
            target = ng.placeholder(ng.make_axes([self.output[0].axes[0], batch_axis_full]))
            self.target.append(target)
            loss = self.loss_type[idx](self.target[-1], self.output[idx],
                                       weights=self.loss_weight[idx], scope=self.get_name())
            self.loss.append(loss)
        # add regularizations
        for regularization in self.regularizations:
            self.loss.append(regularization)
 class QHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'q_values_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        if tuning_parameters.agent.replace_mse_with_huber_loss:
            raise Exception("huber loss is not supported in neon")
        else:
            self.loss_type = mean_squared_error
    def _build_module(self, input_layer):
        # Standard Q Network
        self.output = neon.Sequential([
                neon.Affine(nout=self.num_actions,
                            weight_init=self.weights_init, bias_init=self.biases_init)
            ])(input_layer)
 class DuelingQHead(QHead):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        QHead.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
    def _build_module(self, input_layer):
        # Dueling Network
        # state value tower - V
        output_axis = ng.make_axis(self.num_actions, name='q_values')
        state_value = neon.Sequential([
            neon.Affine(nout=256, activation=neon.Rectlin(),
                        weight_init=self.weights_init, bias_init=self.biases_init),
            neon.Affine(nout=1,
                        weight_init=self.weights_init, bias_init=self.biases_init)
        ])(input_layer)
        # action advantage tower - A
        action_advantage_unnormalized = neon.Sequential([
            neon.Affine(nout=256, activation=neon.Rectlin(),
                        weight_init=self.weights_init, bias_init=self.biases_init),
            neon.Affine(axes=output_axis,
                        weight_init=self.weights_init, bias_init=self.biases_init)
        ])(input_layer)
        action_advantage = action_advantage_unnormalized - ng.mean(action_advantage_unnormalized)
        repeated_state_value = ng.expand_dims(ng.slice_along_axis(state_value, state_value.axes[0], 0), output_axis, 0)
        # merge to state-action value function Q
        self.output = repeated_state_value + action_advantage
 class MeasurementsPredictionHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'future_measurements_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.num_measurements = tuning_parameters.env.measurements_size[0] \
            if tuning_parameters.env.measurements_size else 0
        self.num_prediction_steps = tuning_parameters.agent.num_predicted_steps_ahead
        self.multi_step_measurements_size = self.num_measurements * self.num_prediction_steps
        if tuning_parameters.agent.replace_mse_with_huber_loss:
            raise Exception("huber loss is not supported in neon")
        else:
            self.loss_type = mean_squared_error
    def _build_module(self, input_layer):
        # This is almost exactly the same as Dueling Network but we predict the future measurements for each action
        multistep_measurements_size = self.measurements_size[0] * self.num_predicted_steps_ahead
        # actions expectation tower (expectation stream) - E
        with name_scope("expectation_stream"):
            expectation_stream = neon.Sequential([
                neon.Affine(nout=256, activation=neon.Rectlin(),
                            weight_init=self.weights_init, bias_init=self.biases_init),
                neon.Affine(nout=multistep_measurements_size,
                            weight_init=self.weights_init, bias_init=self.biases_init)
            ])(input_layer)
        # action fine differences tower (action stream) - A
        with name_scope("action_stream"):
            action_stream_unnormalized = neon.Sequential([
                neon.Affine(nout=256, activation=neon.Rectlin(),
                            weight_init=self.weights_init, bias_init=self.biases_init),
                neon.Affine(nout=self.num_actions * multistep_measurements_size,
                            weight_init=self.weights_init, bias_init=self.biases_init),
                neon.Reshape((self.num_actions, multistep_measurements_size))
            ])(input_layer)
            action_stream = action_stream_unnormalized - ng.mean(action_stream_unnormalized)
        repeated_expectation_stream = ng.slice_along_axis(expectation_stream, expectation_stream.axes[0], 0)
        repeated_expectation_stream = ng.expand_dims(repeated_expectation_stream, output_axis, 0)
        # merge to future measurements predictions
        self.output = repeated_expectation_stream + action_stream
--- a/architectures/neon_components/middleware.py
+++ b/architectures/neon_components/middleware.py
@@ -1,50 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import ngraph as ng
 import ngraph.frontends.neon as neon
 from ngraph.util.names import name_scope
 import numpy as np
 class MiddlewareEmbedder(object):
    def __init__(self, activation_function=neon.Rectlin(), name="middleware_embedder"):
        self.name = name
        self.input = None
        self.output = None
        self.weights_init = neon.GlorotInit()
        self.biases_init = neon.ConstantInit()
        self.activation_function = activation_function
    def __call__(self, input_layer):
        with name_scope(self.get_name()):
            self.input = input_layer
            self._build_module()
        return self.input, self.output(self.input)
    def _build_module(self):
        pass
    def get_name(self):
        return self.name
 class FC_Embedder(MiddlewareEmbedder):
    def _build_module(self):
        self.output = neon.Sequential([
                neon.Affine(nout=512, activation=self.activation_function,
                            weight_init=self.weights_init, bias_init=self.biases_init)])
--- a/architectures/network_wrapper.py
+++ b/architectures/network_wrapper.py
@@ -1,187 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from collections import OrderedDict
 from configurations import Preset, Frameworks
 from logger import *
 try:
    import tensorflow as tf
    from architectures.tensorflow_components.general_network import GeneralTensorFlowNetwork
 except ImportError:
    failed_imports.append("TensorFlow")
 try:
    from architectures.neon_components.general_network import GeneralNeonNetwork
 except ImportError:
    failed_imports.append("Neon")
 class NetworkWrapper(object):
    """
    Contains multiple networks and managers syncing and gradient updates
    between them.
    """
    def __init__(self, tuning_parameters, has_target, has_global, name, replicated_device=None, worker_device=None):
        """
        :param tuning_parameters:
        :type tuning_parameters: Preset
        :param has_target:
        :param has_global:
        :param name:
        :param replicated_device:
        :param worker_device:
        """
        self.tp = tuning_parameters
        self.has_target = has_target
        self.has_global = has_global
        self.name = name
        self.sess = tuning_parameters.sess
        if self.tp.framework == Frameworks.TensorFlow:
            general_network = GeneralTensorFlowNetwork
        elif self.tp.framework == Frameworks.Neon:
            general_network = GeneralNeonNetwork
        else:
            raise Exception("{} Framework is not supported".format(Frameworks().to_string(self.tp.framework)))
        # Global network - the main network shared between threads
        self.global_network = None
        if self.has_global:
            with tf.device(replicated_device):
                self.global_network = general_network(tuning_parameters, '{}/global'.format(name),
                                                      network_is_local=False)
        # Online network - local copy of the main network used for playing
        self.online_network = None
        with tf.device(worker_device):
            self.online_network = general_network(tuning_parameters, '{}/online'.format(name),
                                                  self.global_network, network_is_local=True)
        # Target network - a local, slow updating network used for stabilizing the learning
        self.target_network = None
        if self.has_target:
            with tf.device(worker_device):
                self.target_network = general_network(tuning_parameters, '{}/target'.format(name),
                                                      network_is_local=True)
        if not self.tp.distributed and self.tp.framework == Frameworks.TensorFlow:
            variables_to_restore = tf.global_variables()
            variables_to_restore = [v for v in variables_to_restore if '/online' in v.name]
            self.model_saver = tf.train.Saver(variables_to_restore)
            #, max_to_keep=None) # uncomment to unlimit number of stored checkpoints
            if self.tp.sess and self.tp.checkpoint_restore_dir:
                checkpoint = tf.train.latest_checkpoint(self.tp.checkpoint_restore_dir)
                screen.log_title("Loading checkpoint: {}".format(checkpoint))
                self.model_saver.restore(self.tp.sess, checkpoint)
                self.update_target_network()
    def sync(self):
        """
        Initializes the weights of the networks to match each other
        :return:
        """
        self.update_online_network()
        self.update_target_network()
    def update_target_network(self, rate=1.0):
        """
        Copy weights: online network >>> target network
        :param rate: the rate of copying the weights - 1 for copying exactly
        """
        if self.target_network:
            self.target_network.set_weights(self.online_network.get_weights(), rate)
    def update_online_network(self, rate=1.0):
        """
        Copy weights: global network >>> online network
        :param rate: the rate of copying the weights - 1 for copying exactly
        """
        if self.global_network:
            self.online_network.set_weights(self.global_network.get_weights(), rate)
    def apply_gradients_to_global_network(self):
        """
        Apply gradients from the online network on the global network
        :return:
        """
        self.global_network.apply_gradients(self.online_network.accumulated_gradients)
    def apply_gradients_to_online_network(self):
        """
        Apply gradients from the online network on itself
        :return:
        """
        self.online_network.apply_gradients(self.online_network.accumulated_gradients)
    def train_and_sync_networks(self, inputs, targets, additional_fetches=[]):
        """
        A generic training function that enables multi-threading training using a global network if necessary.
        :param inputs: The inputs for the network.
        :param targets: The targets corresponding to the given inputs
        :param additional_fetches: Any additional tensor the user wants to fetch
        :return: The loss of the training iteration
        """
        result = self.online_network.accumulate_gradients(inputs, targets, additional_fetches=additional_fetches)
        self.apply_gradients_and_sync_networks()
        return result
    def apply_gradients_and_sync_networks(self):
        """
        Applies the gradients accumulated in the online network to the global network or to itself and syncs the
        networks if necessary
        """
        if self.global_network:
            self.apply_gradients_to_global_network()
            self.online_network.reset_accumulated_gradients()
            self.update_online_network()
        else:
            self.online_network.apply_and_reset_gradients(self.online_network.accumulated_gradients)
    def get_local_variables(self):
        """
        Get all the variables that are local to the thread
        :return: a list of all the variables that are local to the thread
        """
        local_variables = [v for v in tf.global_variables() if self.online_network.name in v.name]
        if self.has_target:
            local_variables += [v for v in tf.global_variables() if self.target_network.name in v.name]
        return local_variables
    def get_global_variables(self):
        """
        Get all the variables that are shared between threads
        :return: a list of all the variables that are shared between threads
        """
        global_variables = [v for v in tf.global_variables() if self.global_network.name in v.name]
        return global_variables
    def set_session(self, sess):
        self.sess = sess
        self.online_network.sess = sess
        if self.global_network:
            self.global_network.sess = sess
        if self.target_network:
            self.target_network.sess = sess
    def save_model(self, model_id):
        saved_model_path = self.model_saver.save(self.tp.sess, os.path.join(self.tp.save_model_dir,
                                                                        str(model_id) + '.ckpt'))
        screen.log_dict(
            OrderedDict([
                ("Saving model", saved_model_path),
            ]),
            prefix="Checkpoint"
        )
--- a/architectures/tensorflow_components/architecture.py
+++ b/architectures/tensorflow_components/architecture.py
@@ -1,367 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import time
 import numpy as np
 import tensorflow as tf
 from architectures.architecture import Architecture
 from utils import force_list, squeeze_list
 from configurations import Preset, MiddlewareTypes
 def variable_summaries(var):
    """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
    with tf.name_scope('summaries'):
        layer_weight_name = '_'.join(var.name.split('/')[-3:])[:-2]
        with tf.name_scope(layer_weight_name):
            mean = tf.reduce_mean(var)
            tf.summary.scalar('mean', mean)
            with tf.name_scope('stddev'):
              stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
            tf.summary.scalar('stddev', stddev)
            tf.summary.scalar('max', tf.reduce_max(var))
            tf.summary.scalar('min', tf.reduce_min(var))
            tf.summary.histogram('histogram', var)
 class TensorFlowArchitecture(Architecture):
    def __init__(self, tuning_parameters, name="", global_network=None, network_is_local=True):
        """
        :param tuning_parameters: The parameters used for running the algorithm
        :type tuning_parameters: Preset
        :param name: The name of the network
        """
        Architecture.__init__(self, tuning_parameters, name)
        self.middleware_embedder = None
        self.network_is_local = network_is_local
        assert tuning_parameters.agent.tensorflow_support, 'TensorFlow is not supported for this agent'
        self.sess = tuning_parameters.sess
        self.inputs = {}
        self.outputs = []
        self.targets = []
        self.losses = []
        self.total_loss = None
        self.trainable_weights = []
        self.weights_placeholders = []
        self.curr_rnn_c_in = None
        self.curr_rnn_h_in = None
        self.gradients_wrt_inputs = []
        self.train_writer = None
        self.optimizer_type = self.tp.agent.optimizer_type
        if self.tp.seed is not None:
            tf.set_random_seed(self.tp.seed)
        with tf.variable_scope(self.name, initializer=tf.contrib.layers.xavier_initializer()):
            self.global_step = tf.train.get_or_create_global_step()
            # build the network
            self.get_model(tuning_parameters)
            # model weights
            self.trainable_weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.name)
            # locks for synchronous training
            if self.tp.distributed and not self.tp.agent.async_training and not self.network_is_local:
                self.lock_counter = tf.get_variable("lock_counter", [], tf.int32,
                                                    initializer=tf.constant_initializer(0, dtype=tf.int32),
                                                    trainable=False)
                self.lock = self.lock_counter.assign_add(1, use_locking=True)
                self.lock_init = self.lock_counter.assign(0)
                self.release_counter = tf.get_variable("release_counter", [], tf.int32,
                                                       initializer=tf.constant_initializer(0, dtype=tf.int32),
                                                       trainable=False)
                self.release = self.release_counter.assign_add(1, use_locking=True)
                self.release_init = self.release_counter.assign(0)
            # local network does the optimization so we need to create all the ops we are going to use to optimize
            for idx, var in enumerate(self.trainable_weights):
                placeholder = tf.placeholder(tf.float32, shape=var.get_shape(), name=str(idx) + '_holder')
                self.weights_placeholders.append(placeholder)
                if self.tp.visualization.tensorboard:
                    variable_summaries(var)
            self.update_weights_from_list = [weights.assign(holder) for holder, weights in
                                             zip(self.weights_placeholders, self.trainable_weights)]
            # gradients ops
            self.tensor_gradients = tf.gradients(self.total_loss, self.trainable_weights)
            self.gradients_norm = tf.global_norm(self.tensor_gradients)
            if self.tp.clip_gradients is not None and self.tp.clip_gradients != 0:
                self.clipped_grads, self.grad_norms = tf.clip_by_global_norm(self.tensor_gradients,
                                                                             tuning_parameters.clip_gradients)
            # gradients of the outputs w.r.t. the inputs
            # at the moment, this is only used by ddpg
            if len(self.outputs) == 1:
                self.gradients_wrt_inputs = {name: tf.gradients(self.outputs[0], input_ph) for name, input_ph in self.inputs.items()}
                self.gradients_weights_ph = tf.placeholder('float32', self.outputs[0].shape, 'output_gradient_weights')
                self.weighted_gradients = tf.gradients(self.outputs[0], self.trainable_weights, self.gradients_weights_ph)
            # L2 regularization
            if self.tp.agent.l2_regularization != 0:
                self.l2_regularization = [tf.add_n([tf.nn.l2_loss(v) for v in self.trainable_weights])
                                          * self.tp.agent.l2_regularization]
                tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, self.l2_regularization)
            self.inc_step = self.global_step.assign_add(1)
            # defining the optimization process (for LBFGS we have less control over the optimizer)
            if self.optimizer_type != 'LBFGS':
                # no global network, this is a plain simple centralized training
                self.update_weights_from_batch_gradients = self.optimizer.apply_gradients(
                    zip(self.weights_placeholders, self.trainable_weights), global_step=self.global_step)
            if self.tp.visualization.tensorboard:
                current_scope_summaries = tf.get_collection(tf.GraphKeys.SUMMARIES,
                                                            scope=tf.contrib.framework.get_name_scope())
                self.merged = tf.summary.merge(current_scope_summaries)
            # initialize or restore model
            if not self.tp.distributed:
                # Merge all the summaries
                self.init_op = tf.global_variables_initializer()
                if self.sess:
                    if self.tp.visualization.tensorboard:
                        # Write the merged summaries to the current experiment directory
                        self.train_writer = tf.summary.FileWriter(self.tp.experiment_path + '/tensorboard',
                                                                  self.sess.graph)
                    self.sess.run(self.init_op)
        self.accumulated_gradients = None
    def reset_accumulated_gradients(self):
        """
        Reset the gradients accumulation placeholder
        """
        if self.accumulated_gradients is None:
            self.accumulated_gradients = self.tp.sess.run(self.trainable_weights)
        for ix, grad in enumerate(self.accumulated_gradients):
            self.accumulated_gradients[ix] = grad * 0
    def accumulate_gradients(self, inputs, targets, additional_fetches=None):
        """
        Runs a forward pass & backward pass, clips gradients if needed and accumulates them into the accumulation
        placeholders
        :param additional_fetches: Optional tensors to fetch during gradients calculation
        :param inputs: The input batch for the network
        :param targets: The targets corresponding to the input batch
        :return: A list containing the total loss and the individual network heads losses
        """
        if self.accumulated_gradients is None:
            self.reset_accumulated_gradients()
        # feed inputs
        if additional_fetches is None:
            additional_fetches = []
        feed_dict = self._feed_dict(inputs)
        # feed targets
        targets = force_list(targets)
        for placeholder_idx, target in enumerate(targets):
            feed_dict[self.targets[placeholder_idx]] = target
        if self.optimizer_type != 'LBFGS':
            # set the fetches
            fetches = [self.gradients_norm]
            if self.tp.clip_gradients:
                fetches.append(self.clipped_grads)
            else:
                fetches.append(self.tensor_gradients)
            fetches += [self.total_loss, self.losses]
            if self.tp.agent.middleware_type == MiddlewareTypes.LSTM:
                fetches.append(self.middleware_embedder.state_out)
            additional_fetches_start_idx = len(fetches)
            fetches += additional_fetches
            # feed the lstm state if necessary
            if self.tp.agent.middleware_type == MiddlewareTypes.LSTM:
                # we can't always assume that we are starting from scratch here can we?
                feed_dict[self.middleware_embedder.c_in] = self.middleware_embedder.c_init
                feed_dict[self.middleware_embedder.h_in] = self.middleware_embedder.h_init
            if self.tp.visualization.tensorboard:
                fetches += [self.merged]
            # get grads
            result = self.tp.sess.run(fetches, feed_dict=feed_dict)
            if hasattr(self, 'train_writer') and self.train_writer is not None:
                self.train_writer.add_summary(result[-1], self.tp.current_episode)
            # extract the fetches
            norm_unclipped_grads, grads, total_loss, losses = result[:4]
            if self.tp.agent.middleware_type == MiddlewareTypes.LSTM:
                (self.curr_rnn_c_in, self.curr_rnn_h_in) = result[4]
            fetched_tensors = []
            if len(additional_fetches) > 0:
                fetched_tensors = result[additional_fetches_start_idx:additional_fetches_start_idx +
                                                                      len(additional_fetches)]
            # accumulate the gradients
            for idx, grad in enumerate(grads):
                self.accumulated_gradients[idx] += grad
            return total_loss, losses, norm_unclipped_grads, fetched_tensors
        else:
            self.optimizer.minimize(session=self.tp.sess, feed_dict=feed_dict)
            return [0]
    def apply_and_reset_gradients(self, gradients, scaler=1.):
        """
        Applies the given gradients to the network weights and resets the accumulation placeholder
        :param gradients: The gradients to use for the update
        :param scaler: A scaling factor that allows rescaling the gradients before applying them
        """
        self.apply_gradients(gradients, scaler)
        self.reset_accumulated_gradients()
    def apply_gradients(self, gradients, scaler=1.):
        """
        Applies the given gradients to the network weights
        :param gradients: The gradients to use for the update
        :param scaler: A scaling factor that allows rescaling the gradients before applying them
        """
        if self.tp.agent.async_training or not self.tp.distributed:
            if hasattr(self, 'global_step') and not self.network_is_local:
                self.tp.sess.run(self.inc_step)
        if self.optimizer_type != 'LBFGS':
            # lock barrier
            if hasattr(self, 'lock_counter'):
                self.tp.sess.run(self.lock)
                while self.tp.sess.run(self.lock_counter) % self.tp.num_threads != 0:
                    time.sleep(0.00001)
                # rescale the gradients so that they average out with the gradients from the other workers
                scaler /= float(self.tp.num_threads)
            # apply gradients
            if scaler != 1.:
                for gradient in gradients:
                    gradient /= scaler
            feed_dict = dict(zip(self.weights_placeholders, gradients))
            _ = self.tp.sess.run(self.update_weights_from_batch_gradients, feed_dict=feed_dict)
            # release barrier
            if hasattr(self, 'release_counter'):
                self.tp.sess.run(self.release)
                while self.tp.sess.run(self.release_counter) % self.tp.num_threads != 0:
                    time.sleep(0.00001)
    def _feed_dict(self, inputs):
        feed_dict = {}
        for input_name, input_value in inputs.items():
            if isinstance(input_name, str):
                if input_name not in self.inputs:
                    raise ValueError((
                        'input name {input_name} was provided to create a feed '
                        'dictionary, but there is no placeholder with that name. '
                        'placeholder names available include: {placeholder_names}'
                    ).format(
                        input_name=input_name,
                        placeholder_names=', '.join(self.inputs.keys())
                    ))
                feed_dict[self.inputs[input_name]] = input_value
            elif isinstance(input_name, tf.Tensor) and input_name.op.type == 'Placeholder':
                feed_dict[input_name] = input_value
            else:
                raise ValueError((
                    'input dictionary expects strings or placeholders as keys, '
                    'but found key {key} of type {type}'
                ).format(
                    key=input_name,
                    type=type(input_name),
                ))
        return feed_dict
    def predict(self, inputs, outputs=None, squeeze_output=True):
        """
        Run a forward pass of the network using the given input
        :param inputs: The input for the network
        :param outputs: The output for the network, defaults to self.outputs
        :param squeeze_output: call squeeze_list on output
        :return: The network output
        WARNING: must only call once per state since each call is assumed by LSTM to be a new time step.
        """
        feed_dict = self._feed_dict(inputs)
        if outputs is None:
            outputs = self.outputs
        if self.tp.agent.middleware_type == MiddlewareTypes.LSTM:
            feed_dict[self.middleware_embedder.c_in] = self.curr_rnn_c_in
            feed_dict[self.middleware_embedder.h_in] = self.curr_rnn_h_in
            output, (self.curr_rnn_c_in, self.curr_rnn_h_in) = self.tp.sess.run([outputs, self.middleware_embedder.state_out], feed_dict=feed_dict)
        else:
            output = self.tp.sess.run(outputs, feed_dict)
        if squeeze_output:
            output = squeeze_list(output)
        return output
    def get_weights(self):
        """
        :return: a list of tensors containing the network weights for each layer
        """
        return self.trainable_weights
    def set_weights(self, weights, new_rate=1.0):
        """
        Sets the network weights from the given list of weights tensors
        """
        feed_dict = {}
        old_weights, new_weights = self.tp.sess.run([self.get_weights(), weights])
        for placeholder_idx, new_weight in enumerate(new_weights):
            feed_dict[self.weights_placeholders[placeholder_idx]]\
                = new_rate * new_weight + (1 - new_rate) * old_weights[placeholder_idx]
        self.tp.sess.run(self.update_weights_from_list, feed_dict)
    def write_graph_to_logdir(self, summary_dir):
        """
        Writes the tensorflow graph to the logdir for tensorboard visualization
        :param summary_dir: the path to the logdir
        """
        summary_writer = tf.summary.FileWriter(summary_dir)
        summary_writer.add_graph(self.sess.graph)
    def get_variable_value(self, variable):
        """
        Get the value of a variable from the graph
        :param variable: the variable
        :return: the value of the variable
        """
        return self.sess.run(variable)
    def set_variable_value(self, assign_op, value, placeholder=None):
        """
        Updates the value of a variable.
        This requires having an assign operation for the variable, and a placeholder which will provide the value
        :param assign_op: an assign operation for the variable
        :param value: a value to set the variable to
        :param placeholder: a placeholder to hold the given value for injecting it into the variable
        """
        self.sess.run(assign_op, feed_dict={placeholder: value})
--- a/architectures/tensorflow_components/embedders.py
+++ b/architectures/tensorflow_components/embedders.py
@@ -1,144 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import tensorflow as tf
 from configurations import EmbedderDepth, EmbedderWidth
 class InputEmbedder(object):
    def __init__(self, input_size, activation_function=tf.nn.relu,
                 embedder_depth=EmbedderDepth.Shallow, embedder_width=EmbedderWidth.Wide,
                 name="embedder"):
        self.name = name
        self.input_size = input_size
        self.activation_function = activation_function
        self.input = None
        self.output = None
        self.embedder_depth = embedder_depth
        self.embedder_width = embedder_width
    def __call__(self, prev_input_placeholder=None):
        with tf.variable_scope(self.get_name()):
            if prev_input_placeholder is None:
                self.input = tf.placeholder("float", shape=(None,) + self.input_size, name=self.get_name())
            else:
                self.input = prev_input_placeholder
            self._build_module()
        return self.input, self.output
    def _build_module(self):
        pass
    def get_name(self):
        return self.name
 class ImageEmbedder(InputEmbedder):
    def __init__(self, input_size, input_rescaler=255.0, activation_function=tf.nn.relu,
                 embedder_depth=EmbedderDepth.Shallow, embedder_width=EmbedderWidth.Wide,
                 name="embedder"):
        InputEmbedder.__init__(self, input_size, activation_function, embedder_depth, embedder_width, name)
        self.input_rescaler = input_rescaler
    def _build_module(self):
        # image observation
        rescaled_observation_stack = self.input / self.input_rescaler
        if self.embedder_depth == EmbedderDepth.Shallow:
            # same embedder as used in the original DQN paper
            self.observation_conv1 = tf.layers.conv2d(rescaled_observation_stack,
                                                      filters=32, kernel_size=(8, 8), strides=(4, 4),
                                                      activation=self.activation_function, data_format='channels_last',
                                                      name='conv1')
            self.observation_conv2 = tf.layers.conv2d(self.observation_conv1,
                                                      filters=64, kernel_size=(4, 4), strides=(2, 2),
                                                      activation=self.activation_function, data_format='channels_last',
                                                      name='conv2')
            self.observation_conv3 = tf.layers.conv2d(self.observation_conv2,
                                                      filters=64, kernel_size=(3, 3), strides=(1, 1),
                                                      activation=self.activation_function, data_format='channels_last',
                                                      name='conv3'
                                                      )
            self.output = tf.contrib.layers.flatten(self.observation_conv3)
        elif self.embedder_depth == EmbedderDepth.Deep:
            # the embedder used in the CARLA papers
            self.observation_conv1 = tf.layers.conv2d(rescaled_observation_stack,
                                                 filters=32, kernel_size=(5, 5), strides=(2, 2),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv1')
            self.observation_conv2 = tf.layers.conv2d(self.observation_conv1,
                                                 filters=32, kernel_size=(3, 3), strides=(1, 1),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv2')
            self.observation_conv3 = tf.layers.conv2d(self.observation_conv2,
                                                 filters=64, kernel_size=(3, 3), strides=(2, 2),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv3')
            self.observation_conv4 = tf.layers.conv2d(self.observation_conv3,
                                                 filters=64, kernel_size=(3, 3), strides=(1, 1),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv4')
            self.observation_conv5 = tf.layers.conv2d(self.observation_conv4,
                                                 filters=128, kernel_size=(3, 3), strides=(2, 2),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv5')
            self.observation_conv6 = tf.layers.conv2d(self.observation_conv5,
                                                 filters=128, kernel_size=(3, 3), strides=(1, 1),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv6')
            self.observation_conv7 = tf.layers.conv2d(self.observation_conv6,
                                                 filters=256, kernel_size=(3, 3), strides=(2, 2),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv7')
            self.observation_conv8 = tf.layers.conv2d(self.observation_conv7,
                                                 filters=256, kernel_size=(3, 3), strides=(1, 1),
                                                 activation=self.activation_function, data_format='channels_last',
                                                 name='conv8')
            self.output = tf.contrib.layers.flatten(self.observation_conv8)
        else:
            raise ValueError("The defined embedder complexity value is invalid")
 class VectorEmbedder(InputEmbedder):
    def __init__(self, input_size, activation_function=tf.nn.relu,
                 embedder_depth=EmbedderDepth.Shallow, embedder_width=EmbedderWidth.Wide,
                 name="embedder"):
        InputEmbedder.__init__(self, input_size, activation_function, embedder_depth, embedder_width, name)
    def _build_module(self):
        # vector observation
        input_layer = tf.contrib.layers.flatten(self.input)
        width = 128 if self.embedder_width == EmbedderWidth.Wide else 32
        if self.embedder_depth == EmbedderDepth.Shallow:
            self.output = tf.layers.dense(input_layer, 2*width, activation=self.activation_function,
                                                 name='fc1')
        elif self.embedder_depth == EmbedderDepth.Deep:
            # the embedder used in the CARLA papers
            self.observation_fc1 = tf.layers.dense(input_layer, width, activation=self.activation_function,
                                                 name='fc1')
            self.observation_fc2 = tf.layers.dense(self.observation_fc1, width, activation=self.activation_function,
                                                 name='fc2')
            self.output = tf.layers.dense(self.observation_fc2, width, activation=self.activation_function,
                                                 name='fc3')
        else:
            raise ValueError("The defined embedder complexity value is invalid")
--- a/architectures/tensorflow_components/general_network.py
+++ b/architectures/tensorflow_components/general_network.py
@@ -1,206 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from architectures.tensorflow_components.embedders import *
 from architectures.tensorflow_components.heads import *
 from architectures.tensorflow_components.middleware import *
 from architectures.tensorflow_components.architecture import *
 from configurations import InputTypes, OutputTypes, MiddlewareTypes
 class GeneralTensorFlowNetwork(TensorFlowArchitecture):
    """
    A generalized version of all possible networks implemented using tensorflow.
    """
    def __init__(self, tuning_parameters, name="", global_network=None, network_is_local=True):
        self.global_network = global_network
        self.network_is_local = network_is_local
        self.num_heads_per_network = 1 if tuning_parameters.agent.use_separate_networks_per_head else \
            len(tuning_parameters.agent.output_types)
        self.num_networks = 1 if not tuning_parameters.agent.use_separate_networks_per_head else \
            len(tuning_parameters.agent.output_types)
        self.input_embedders = []
        self.output_heads = []
        self.activation_function = self.get_activation_function(
            tuning_parameters.agent.hidden_layers_activation_function)
        self.embedder_width = tuning_parameters.agent.embedder_width
        TensorFlowArchitecture.__init__(self, tuning_parameters, name, global_network, network_is_local)
    def get_activation_function(self, activation_function_string):
        activation_functions = {
            'relu': tf.nn.relu,
            'tanh': tf.nn.tanh,
            'sigmoid': tf.nn.sigmoid,
            'elu': tf.nn.elu,
            'selu': tf.nn.selu,
            'none': None
        }
        assert activation_function_string in activation_functions.keys(), \
            "Activation function must be one of the following {}".format(activation_functions.keys())
        return activation_functions[activation_function_string]
    def get_input_embedder(self, embedder_type):
        # the observation can be either an image or a vector
        def get_observation_embedding(with_timestep=False):
            if self.input_height > 1:
                return ImageEmbedder((self.input_height, self.input_width, self.input_depth), name="observation",
                                     input_rescaler=self.tp.agent.input_rescaler, embedder_width=self.embedder_width)
            else:
                return VectorEmbedder((self.input_width + int(with_timestep), self.input_depth), name="observation",
                                      embedder_width=self.embedder_width)
        input_mapping = {
            InputTypes.Observation: get_observation_embedding(),
            InputTypes.Measurements: VectorEmbedder(self.measurements_size, name="measurements",
                                                    embedder_width=self.embedder_width),
            InputTypes.GoalVector: VectorEmbedder(self.measurements_size, name="goal_vector",
                                                  embedder_width=self.embedder_width),
            InputTypes.Action: VectorEmbedder((self.num_actions,), name="action",
                                              embedder_width=self.embedder_width),
            InputTypes.TimedObservation: get_observation_embedding(with_timestep=True),
        }
        return input_mapping[embedder_type]
    def get_middleware_embedder(self, middleware_type):
        return {MiddlewareTypes.LSTM: LSTM_Embedder,
                MiddlewareTypes.FC: FC_Embedder}.get(middleware_type)(self.activation_function, self.embedder_width)
    def get_output_head(self, head_type, head_idx, loss_weight=1.):
        output_mapping = {
            OutputTypes.Q: QHead,
            OutputTypes.DuelingQ: DuelingQHead,
            OutputTypes.V: VHead,
            OutputTypes.Pi: PolicyHead,
            OutputTypes.MeasurementsPrediction: MeasurementsPredictionHead,
            OutputTypes.DNDQ: DNDQHead,
            OutputTypes.NAF: NAFHead,
            OutputTypes.PPO: PPOHead,
            OutputTypes.PPO_V: PPOVHead,
            OutputTypes.CategoricalQ: CategoricalQHead,
            OutputTypes.QuantileRegressionQ: QuantileRegressionQHead
        }
        return output_mapping[head_type](self.tp, head_idx, loss_weight, self.network_is_local)
    def get_model(self, tuning_parameters):
        """
        :param tuning_parameters: A Preset class instance with all the running paramaters
        :type tuning_parameters: Preset
        :return: A model
        """
        assert len(self.tp.agent.input_types) > 0, "At least one input type should be defined"
        assert len(self.tp.agent.output_types) > 0, "At least one output type should be defined"
        assert self.tp.agent.middleware_type is not None, "Exactly one middleware type should be defined"
        assert len(self.tp.agent.loss_weights) > 0, "At least one loss weight should be defined"
        assert len(self.tp.agent.output_types) == len(self.tp.agent.loss_weights), \
            "Number of loss weights should match the number of output types"
        local_network_in_distributed_training = self.global_network is not None and self.network_is_local
        tuning_parameters.activation_function = self.activation_function
        for network_idx in range(self.num_networks):
            with tf.variable_scope('network_{}'.format(network_idx)):
                ####################
                # Input Embeddings #
                ####################
                state_embedding = []
                for input_name, input_type in self.tp.agent.input_types.items():
                    # get the class of the input embedder
                    input_embedder = self.get_input_embedder(input_type)
                    self.input_embedders.append(input_embedder)
                    # input placeholders are reused between networks. on the first network, store the placeholders
                    # generated by the input_embedders in self.inputs. on the rest of the networks, pass
                    # the existing input_placeholders into the input_embedders.
                    if network_idx == 0:
                        input_placeholder, embedding = input_embedder()
                        self.inputs[input_name] = input_placeholder
                    else:
                        input_placeholder, embedding = input_embedder(self.inputs[input_name])
                    state_embedding.append(embedding)
                ##############
                # Middleware #
                ##############
                state_embedding = tf.concat(state_embedding, axis=-1) if len(state_embedding) > 1 else state_embedding[0]
                self.middleware_embedder = self.get_middleware_embedder(self.tp.agent.middleware_type)
                _, self.state_embedding = self.middleware_embedder(state_embedding)
                ################
                # Output Heads #
                ################
                for head_idx in range(self.num_heads_per_network):
                    for head_copy_idx in range(self.tp.agent.num_output_head_copies):
                        if self.tp.agent.use_separate_networks_per_head:
                            # if we use separate networks per head, then the head type corresponds top the network idx
                            head_type_idx = network_idx
                        else:
                            # if we use a single network with multiple heads, then the head type is the current head idx
                            head_type_idx = head_idx
                        self.output_heads.append(self.get_output_head(self.tp.agent.output_types[head_type_idx],
                                                                      head_copy_idx,
                                                                      self.tp.agent.loss_weights[head_type_idx]))
                        if self.tp.agent.stop_gradients_from_head[head_idx]:
                            head_input = tf.stop_gradient(self.state_embedding)
                        else:
                            head_input = self.state_embedding
                        # build the head
                        if self.network_is_local:
                            output, target_placeholder, input_placeholders = self.output_heads[-1](head_input)
                            self.targets.extend(target_placeholder)
                        else:
                            output, input_placeholders = self.output_heads[-1](head_input)
                        self.outputs.extend(output)
                        # TODO: use head names as well
                        for placeholder_index, input_placeholder in enumerate(input_placeholders):
                            self.inputs['output_{}_{}'.format(head_idx, placeholder_index)] = input_placeholder
        # Losses
        self.losses = tf.losses.get_losses(self.name)
        self.losses += tf.losses.get_regularization_losses(self.name)
        self.total_loss = tf.losses.compute_weighted_loss(self.losses, scope=self.name)
        if self.tp.visualization.tensorboard:
            tf.summary.scalar('total_loss', self.total_loss)
        # Learning rate
        if self.tp.learning_rate_decay_rate != 0:
            self.tp.learning_rate = tf.train.exponential_decay(
                self.tp.learning_rate, self.global_step, decay_steps=self.tp.learning_rate_decay_steps,
                decay_rate=self.tp.learning_rate_decay_rate, staircase=True)
        # Optimizer
        if local_network_in_distributed_training and \
                hasattr(self.tp.agent, "shared_optimizer") and self.tp.agent.shared_optimizer:
            # distributed training and this is the local network instantiation
            self.optimizer = self.global_network.optimizer
        else:
            if tuning_parameters.agent.optimizer_type == 'Adam':
                self.optimizer = tf.train.AdamOptimizer(learning_rate=tuning_parameters.learning_rate)
            elif tuning_parameters.agent.optimizer_type == 'RMSProp':
                self.optimizer = tf.train.RMSPropOptimizer(tuning_parameters.learning_rate, decay=0.9, epsilon=0.01)
            elif tuning_parameters.agent.optimizer_type == 'LBFGS':
                self.optimizer = tf.contrib.opt.ScipyOptimizerInterface(self.total_loss, method='L-BFGS-B',
                                                                        options={'maxiter': 25})
            else:
                raise Exception("{} is not a valid optimizer type".format(tuning_parameters.agent.optimizer_type))
--- a/architectures/tensorflow_components/heads.py
+++ b/architectures/tensorflow_components/heads.py
@@ -1,558 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import tensorflow as tf
 import numpy as np
 from utils import force_list
 # Used to initialize weights for policy and value output layers
 def normalized_columns_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None):
        out = np.random.randn(*shape).astype(np.float32)
        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        return tf.constant(out)
    return _initializer
 class Head(object):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        self.head_idx = head_idx
        self.name = "head"
        self.output = []
        self.loss = []
        self.loss_type = []
        self.regularizations = []
        self.loss_weight = force_list(loss_weight)
        self.target = []
        self.input = []
        self.is_local = is_local
    def __call__(self, input_layer):
        """
        Wrapper for building the module graph including scoping and loss creation
        :param input_layer: the input to the graph
        :return: the output of the last layer and the target placeholder
        """
        with tf.variable_scope(self.get_name(), initializer=tf.contrib.layers.xavier_initializer()):
            self._build_module(input_layer)
            self.output = force_list(self.output)
            self.target = force_list(self.target)
            self.input = force_list(self.input)
            self.loss_type = force_list(self.loss_type)
            self.loss = force_list(self.loss)
            self.regularizations = force_list(self.regularizations)
            if self.is_local:
                self.set_loss()
            self._post_build()
        if self.is_local:
            return self.output, self.target, self.input
        else:
            return self.output, self.input
    def _build_module(self, input_layer):
        """
        Builds the graph of the module
        This method is called early on from __call__. It is expected to store the graph
        in self.output.
        :param input_layer: the input to the graph
        :return: None
        """
        pass
    def _post_build(self):
        """
        Optional function that allows adding any extra definitions after the head has been fully defined
        For example, this allows doing additional calculations that are based on the loss
        :return: None
        """
        pass
    def get_name(self):
        """
        Get a formatted name for the module
        :return: the formatted name
        """
        return '{}_{}'.format(self.name, self.head_idx)
    def set_loss(self):
        """
        Creates a target placeholder and loss function for each loss_type and regularization
        :param loss_type: a tensorflow loss function
        :param scope: the name scope to include the tensors in
        :return: None
        """
        # add losses and target placeholder
        for idx in range(len(self.loss_type)):
            target = tf.placeholder('float', self.output[idx].shape, '{}_target'.format(self.get_name()))
            self.target.append(target)
            loss = self.loss_type[idx](self.target[-1], self.output[idx],
                                       weights=self.loss_weight[idx], scope=self.get_name())
            self.loss.append(loss)
        # add regularizations
        for regularization in self.regularizations:
            self.loss.append(regularization)
 class QHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'q_values_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        if tuning_parameters.agent.replace_mse_with_huber_loss:
            self.loss_type = tf.losses.huber_loss
        else:
            self.loss_type = tf.losses.mean_squared_error
    def _build_module(self, input_layer):
        # Standard Q Network
        self.output = tf.layers.dense(input_layer, self.num_actions, name='output')
 class DuelingQHead(QHead):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        QHead.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
    def _build_module(self, input_layer):
        # state value tower - V
        with tf.variable_scope("state_value"):
            state_value = tf.layers.dense(input_layer, 256, activation=tf.nn.relu, name='fc1')
            state_value = tf.layers.dense(state_value, 1, name='fc2')
            # state_value = tf.expand_dims(state_value, axis=-1)
        # action advantage tower - A
        with tf.variable_scope("action_advantage"):
            action_advantage = tf.layers.dense(input_layer, 256, activation=tf.nn.relu, name='fc1')
            action_advantage = tf.layers.dense(action_advantage, self.num_actions, name='fc2')
            action_advantage = action_advantage - tf.reduce_mean(action_advantage)
        # merge to state-action value function Q
        self.output = tf.add(state_value, action_advantage, name='output')
 class VHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'v_values_head'
        if tuning_parameters.agent.replace_mse_with_huber_loss:
            self.loss_type = tf.losses.huber_loss
        else:
            self.loss_type = tf.losses.mean_squared_error
    def _build_module(self, input_layer):
        # Standard V Network
        self.output = tf.layers.dense(input_layer, 1, name='output',
                                            kernel_initializer=normalized_columns_initializer(1.0))
 class PolicyHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'policy_values_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.output_scale = np.max(tuning_parameters.env_instance.action_space_abs_range)
        self.discrete_controls = tuning_parameters.env_instance.discrete_controls
        self.exploration_policy = tuning_parameters.exploration.policy
        self.exploration_variance = 2*self.output_scale*tuning_parameters.exploration.initial_noise_variance_percentage
        if not self.discrete_controls and not self.output_scale:
            raise ValueError("For continuous controls, an output scale for the network must be specified")
        self.beta = tuning_parameters.agent.beta_entropy
    def _build_module(self, input_layer):
        eps = 1e-15
        if self.discrete_controls:
            self.actions = tf.placeholder(tf.int32, [None], name="actions")
        else:
            self.actions = tf.placeholder(tf.float32, [None, self.num_actions], name="actions")
        self.input = [self.actions]
        # Policy Head
        if self.discrete_controls:
            policy_values = tf.layers.dense(input_layer, self.num_actions, name='fc')
            self.policy_mean = tf.nn.softmax(policy_values, name="policy")
            # define the distributions for the policy and the old policy
            # (the + eps is to prevent probability 0 which will cause the log later on to be -inf)
            self.policy_distribution = tf.contrib.distributions.Categorical(probs=(self.policy_mean + eps))
            self.output = self.policy_mean
        else:
            # mean
            policy_values_mean = tf.layers.dense(input_layer, self.num_actions, activation=tf.nn.tanh, name='fc_mean')
            self.policy_mean = tf.multiply(policy_values_mean, self.output_scale, name='output_mean')
            self.output = [self.policy_mean]
            # std
            if self.exploration_policy == 'ContinuousEntropy':
                policy_values_std = tf.layers.dense(input_layer, self.num_actions,
                                            kernel_initializer=normalized_columns_initializer(0.01), name='fc_std')
                self.policy_std = tf.nn.softplus(policy_values_std, name='output_variance') + eps
                self.output.append(self.policy_std)
            else:
                self.policy_std = tf.constant(self.exploration_variance, dtype='float32', shape=(self.num_actions,))
            # define the distributions for the policy and the old policy
            self.policy_distribution = tf.contrib.distributions.MultivariateNormalDiag(self.policy_mean,
                                                                                       self.policy_std)
        if self.is_local:
            # add entropy regularization
            if self.beta:
                self.entropy = tf.reduce_mean(self.policy_distribution.entropy())
                self.regularizations = -tf.multiply(self.beta, self.entropy, name='entropy_regularization')
                tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, self.regularizations)
            # calculate loss
            self.action_log_probs_wrt_policy = self.policy_distribution.log_prob(self.actions)
            self.advantages = tf.placeholder(tf.float32, [None], name="advantages")
            self.target = self.advantages
            self.loss = -tf.reduce_mean(self.action_log_probs_wrt_policy * self.advantages)
            tf.losses.add_loss(self.loss_weight[0] * self.loss)
 class MeasurementsPredictionHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'future_measurements_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.num_measurements = tuning_parameters.env.measurements_size[0] \
            if tuning_parameters.env.measurements_size else 0
        self.num_prediction_steps = tuning_parameters.agent.num_predicted_steps_ahead
        self.multi_step_measurements_size = self.num_measurements * self.num_prediction_steps
        if tuning_parameters.agent.replace_mse_with_huber_loss:
            self.loss_type = tf.losses.huber_loss
        else:
            self.loss_type = tf.losses.mean_squared_error
    def _build_module(self, input_layer):
        # This is almost exactly the same as Dueling Network but we predict the future measurements for each action
        # actions expectation tower (expectation stream) - E
        with tf.variable_scope("expectation_stream"):
            expectation_stream = tf.layers.dense(input_layer, 256, activation=tf.nn.elu, name='fc1')
            expectation_stream = tf.layers.dense(expectation_stream, self.multi_step_measurements_size, name='output')
            expectation_stream = tf.expand_dims(expectation_stream, axis=1)
        # action fine differences tower (action stream) - A
        with tf.variable_scope("action_stream"):
            action_stream = tf.layers.dense(input_layer, 256, activation=tf.nn.elu, name='fc1')
            action_stream = tf.layers.dense(action_stream, self.num_actions * self.multi_step_measurements_size,
                                            name='output')
            action_stream = tf.reshape(action_stream,
                                       (tf.shape(action_stream)[0], self.num_actions, self.multi_step_measurements_size))
            action_stream = action_stream - tf.reduce_mean(action_stream, reduction_indices=1, keep_dims=True)
        # merge to future measurements predictions
        self.output = tf.add(expectation_stream, action_stream, name='output')
 class DNDQHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'dnd_q_values_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.DND_size = tuning_parameters.agent.dnd_size
        self.DND_key_error_threshold = tuning_parameters.agent.DND_key_error_threshold
        self.l2_norm_added_delta = tuning_parameters.agent.l2_norm_added_delta
        self.new_value_shift_coefficient = tuning_parameters.agent.new_value_shift_coefficient
        self.number_of_nn = tuning_parameters.agent.number_of_knn
        if tuning_parameters.agent.replace_mse_with_huber_loss:
            self.loss_type = tf.losses.huber_loss
        else:
            self.loss_type = tf.losses.mean_squared_error
        self.tp = tuning_parameters
        self.dnd_embeddings = [None]*self.num_actions
        self.dnd_values = [None]*self.num_actions
        self.dnd_indices = [None]*self.num_actions
    def _build_module(self, input_layer):
        # DND based Q head
        from memories import differentiable_neural_dictionary
        if self.tp.checkpoint_restore_dir:
            self.DND = differentiable_neural_dictionary.load_dnd(self.tp.checkpoint_restore_dir)
        else:
            self.DND = differentiable_neural_dictionary.QDND(
                self.DND_size, input_layer.get_shape()[-1], self.num_actions, self.new_value_shift_coefficient,
                key_error_threshold=self.DND_key_error_threshold, learning_rate=self.tp.learning_rate)
        # Retrieve info from DND dictionary
        # We assume that all actions have enough entries in the DND
        self.output = tf.transpose([
            self._q_value(input_layer, action)
            for action in range(self.num_actions)
        ])
    def _q_value(self, input_layer, action):
        result = tf.py_func(self.DND.query,
                            [input_layer, action, self.number_of_nn],
                            [tf.float64, tf.float64, tf.int64])
        self.dnd_embeddings[action] = tf.to_float(result[0])
        self.dnd_values[action] = tf.to_float(result[1])
        self.dnd_indices[action] = result[2]
        # DND calculation
        square_diff = tf.square(self.dnd_embeddings[action] - tf.expand_dims(input_layer, 1))
        distances = tf.reduce_sum(square_diff, axis=2) + [self.l2_norm_added_delta]
        weights = 1.0 / distances
        normalised_weights = weights / tf.reduce_sum(weights, axis=1, keep_dims=True)
        return tf.reduce_sum(self.dnd_values[action] * normalised_weights, axis=1)
 class NAFHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'naf_q_values_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.output_scale = np.max(tuning_parameters.env_instance.action_space_abs_range)
        if tuning_parameters.agent.replace_mse_with_huber_loss:
            self.loss_type = tf.losses.huber_loss
        else:
            self.loss_type = tf.losses.mean_squared_error
    def _build_module(self, input_layer):
        # NAF
        self.action = tf.placeholder(tf.float32, [None, self.num_actions], name="action")
        self.input = self.action
        # V Head
        self.V = tf.layers.dense(input_layer, 1, name='V')
        # mu Head
        mu_unscaled = tf.layers.dense(input_layer, self.num_actions, activation=tf.nn.tanh, name='mu_unscaled')
        self.mu = tf.multiply(mu_unscaled, self.output_scale, name='mu')
        # A Head
        # l_vector is a vector that includes a lower-triangular matrix values
        self.l_vector = tf.layers.dense(input_layer, (self.num_actions * (self.num_actions + 1)) / 2, name='l_vector')
        # Convert l to a lower triangular matrix and exponentiate its diagonal
        i = 0
        columns = []
        for col in range(self.num_actions):
            start_row = col
            num_non_zero_elements = self.num_actions - start_row
            zeros_column_part = tf.zeros_like(self.l_vector[:, 0:start_row])
            diag_element = tf.expand_dims(tf.exp(self.l_vector[:, i]), 1)
            non_zeros_non_diag_column_part = self.l_vector[:, (i + 1):(i + num_non_zero_elements)]
            columns.append(tf.concat([zeros_column_part, diag_element, non_zeros_non_diag_column_part], axis=1))
            i += num_non_zero_elements
        self.L = tf.transpose(tf.stack(columns, axis=1), (0, 2, 1))
        # P = L*L^T
        self.P = tf.matmul(self.L, tf.transpose(self.L, (0, 2, 1)))
        # A = -1/2 * (u - mu)^T * P * (u - mu)
        action_diff = tf.expand_dims(self.action - self.mu, -1)
        a_matrix_form = -0.5 * tf.matmul(tf.transpose(action_diff, (0, 2, 1)), tf.matmul(self.P, action_diff))
        self.A = tf.reshape(a_matrix_form, [-1, 1])
        # Q Head
        self.Q = tf.add(self.V, self.A, name='Q')
        self.output = self.Q
 class PPOHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'ppo_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.discrete_controls = tuning_parameters.env_instance.discrete_controls
        self.output_scale = np.max(tuning_parameters.env_instance.action_space_abs_range)
        # kl coefficient and its corresponding assignment operation and placeholder
        self.kl_coefficient = tf.Variable(tuning_parameters.agent.initial_kl_coefficient,
                                          trainable=False, name='kl_coefficient')
        self.kl_coefficient_ph = tf.placeholder('float', name='kl_coefficient_ph')
        self.assign_kl_coefficient = tf.assign(self.kl_coefficient, self.kl_coefficient_ph)
        self.kl_cutoff = 2*tuning_parameters.agent.target_kl_divergence
        self.high_kl_penalty_coefficient = tuning_parameters.agent.high_kl_penalty_coefficient
        self.clip_likelihood_ratio_using_epsilon = tuning_parameters.agent.clip_likelihood_ratio_using_epsilon
        self.use_kl_regularization = tuning_parameters.agent.use_kl_regularization
        self.beta = tuning_parameters.agent.beta_entropy
    def _build_module(self, input_layer):
        eps = 1e-15
        if self.discrete_controls:
            self.actions = tf.placeholder(tf.int32, [None], name="actions")
        else:
            self.actions = tf.placeholder(tf.float32, [None, self.num_actions], name="actions")
        self.old_policy_mean = tf.placeholder(tf.float32, [None, self.num_actions], "old_policy_mean")
        self.old_policy_std = tf.placeholder(tf.float32, [None, self.num_actions], "old_policy_std")
        # Policy Head
        if self.discrete_controls:
            self.input = [self.actions, self.old_policy_mean]
            policy_values = tf.layers.dense(input_layer, self.num_actions, name='policy_fc')
            self.policy_mean = tf.nn.softmax(policy_values, name="policy")
            # define the distributions for the policy and the old policy
            self.policy_distribution = tf.contrib.distributions.Categorical(probs=(self.policy_mean + eps))
            self.old_policy_distribution = tf.contrib.distributions.Categorical(probs=self.old_policy_mean)
            self.output = self.policy_mean
        else:
            self.input = [self.actions, self.old_policy_mean, self.old_policy_std]
            self.policy_mean = tf.layers.dense(input_layer, self.num_actions, name='policy_mean')
            self.policy_logstd = tf.Variable(np.zeros((1, self.num_actions)), dtype='float32')
            self.policy_std = tf.tile(tf.exp(self.policy_logstd), [tf.shape(input_layer)[0], 1], name='policy_std')
            # define the distributions for the policy and the old policy
            self.policy_distribution = tf.contrib.distributions.MultivariateNormalDiag(self.policy_mean,
                                                                                       self.policy_std)
            self.old_policy_distribution = tf.contrib.distributions.MultivariateNormalDiag(self.old_policy_mean,
                                                                                           self.old_policy_std)
            self.output = [self.policy_mean, self.policy_std]
        self.action_probs_wrt_policy = tf.exp(self.policy_distribution.log_prob(self.actions))
        self.action_probs_wrt_old_policy = tf.exp(self.old_policy_distribution.log_prob(self.actions))
        self.entropy = tf.reduce_mean(self.policy_distribution.entropy())
        # add kl divergence regularization
        self.kl_divergence = tf.reduce_mean(tf.contrib.distributions.kl_divergence(self.old_policy_distribution,
                                                                        self.policy_distribution))
        if self.use_kl_regularization:
            # no clipping => use kl regularization
            self.weighted_kl_divergence = tf.multiply(self.kl_coefficient, self.kl_divergence)
            self.regularizations = self.weighted_kl_divergence + self.high_kl_penalty_coefficient * \
                                                            tf.square(tf.maximum(0.0, self.kl_divergence - self.kl_cutoff))
            tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, self.regularizations)
        # calculate surrogate loss
        self.advantages = tf.placeholder(tf.float32, [None], name="advantages")
        self.target = self.advantages
        self.likelihood_ratio = self.action_probs_wrt_policy / (self.action_probs_wrt_old_policy + eps)
        if self.clip_likelihood_ratio_using_epsilon is not None:
            max_value = 1 + self.clip_likelihood_ratio_using_epsilon
            min_value = 1 - self.clip_likelihood_ratio_using_epsilon
            self.clipped_likelihood_ratio = tf.clip_by_value(self.likelihood_ratio, min_value, max_value)
            self.scaled_advantages = tf.minimum(self.likelihood_ratio * self.advantages,
                                                self.clipped_likelihood_ratio * self.advantages)
        else:
            self.scaled_advantages = self.likelihood_ratio * self.advantages
        # minus sign is in order to set an objective to minimize (we actually strive for maximizing the surrogate loss)
        self.surrogate_loss = -tf.reduce_mean(self.scaled_advantages)
        if self.is_local:
            # add entropy regularization
            if self.beta:
                self.entropy = tf.reduce_mean(self.policy_distribution.entropy())
                self.regularizations = -tf.multiply(self.beta, self.entropy, name='entropy_regularization')
                tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, self.regularizations)
        self.loss = self.surrogate_loss
        tf.losses.add_loss(self.loss)
 class PPOVHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'ppo_v_head'
        self.clip_likelihood_ratio_using_epsilon = tuning_parameters.agent.clip_likelihood_ratio_using_epsilon
    def _build_module(self, input_layer):
        self.old_policy_value = tf.placeholder(tf.float32, [None], "old_policy_values")
        self.input = [self.old_policy_value]
        self.output = tf.layers.dense(input_layer, 1, name='output',
                                            kernel_initializer=normalized_columns_initializer(1.0))
        self.target = self.total_return = tf.placeholder(tf.float32, [None], name="total_return")
        value_loss_1 = tf.square(self.output - self.target)
        value_loss_2 = tf.square(self.old_policy_value +
                                 tf.clip_by_value(self.output - self.old_policy_value,
                                                  -self.clip_likelihood_ratio_using_epsilon,
                                                  self.clip_likelihood_ratio_using_epsilon) - self.target)
        self.vf_loss = tf.reduce_mean(tf.maximum(value_loss_1, value_loss_2))
        self.loss = self.vf_loss
        tf.losses.add_loss(self.loss)
 class CategoricalQHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'categorical_dqn_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.num_atoms = tuning_parameters.agent.atoms
    def _build_module(self, input_layer):
        self.actions = tf.placeholder(tf.int32, [None], name="actions")
        self.input = [self.actions]
        values_distribution = tf.layers.dense(input_layer, self.num_actions * self.num_atoms, name='output')
        values_distribution = tf.reshape(values_distribution, (tf.shape(values_distribution)[0], self.num_actions, self.num_atoms))
        # softmax on atoms dimension
        self.output = tf.nn.softmax(values_distribution)
        # calculate cross entropy loss
        self.distributions = tf.placeholder(tf.float32, shape=(None, self.num_actions, self.num_atoms), name="distributions")
        self.target = self.distributions
        self.loss = tf.nn.softmax_cross_entropy_with_logits(labels=self.target, logits=values_distribution)
        tf.losses.add_loss(self.loss)
 class QuantileRegressionQHead(Head):
    def __init__(self, tuning_parameters, head_idx=0, loss_weight=1., is_local=True):
        Head.__init__(self, tuning_parameters, head_idx, loss_weight, is_local)
        self.name = 'quantile_regression_dqn_head'
        self.num_actions = tuning_parameters.env_instance.action_space_size
        self.num_atoms = tuning_parameters.agent.atoms  # we use atom / quantile interchangeably
        self.huber_loss_interval = 1  # k
    def _build_module(self, input_layer):
        self.actions = tf.placeholder(tf.int32, [None, 2], name="actions")
        self.quantile_midpoints = tf.placeholder(tf.float32, [None, self.num_atoms], name="quantile_midpoints")
        self.input = [self.actions, self.quantile_midpoints]
        # the output of the head is the N unordered quantile locations {theta_1, ..., theta_N}
        quantiles_locations = tf.layers.dense(input_layer, self.num_actions * self.num_atoms, name='output')
        quantiles_locations = tf.reshape(quantiles_locations, (tf.shape(quantiles_locations)[0], self.num_actions, self.num_atoms))
        self.output = quantiles_locations
        self.quantiles = tf.placeholder(tf.float32, shape=(None, self.num_atoms), name="quantiles")
        self.target = self.quantiles
        # only the quantiles of the taken action are taken into account
        quantiles_for_used_actions = tf.gather_nd(quantiles_locations, self.actions)
        # reorder the output quantiles and the target quantiles as a preparation step for calculating the loss
        # the output quantiles vector and the quantile midpoints are tiled as rows of a NxN matrix (N = num quantiles)
        # the target quantiles vector is tiled as column of a NxN matrix
        theta_i = tf.tile(tf.expand_dims(quantiles_for_used_actions, -1), [1, 1, self.num_atoms])
        T_theta_j = tf.tile(tf.expand_dims(self.target, -2), [1, self.num_atoms, 1])
        tau_i = tf.tile(tf.expand_dims(self.quantile_midpoints, -1), [1, 1, self.num_atoms])
        # Huber loss of T(theta_j) - theta_i
        error = T_theta_j - theta_i
        abs_error = tf.abs(error)
        quadratic = tf.minimum(abs_error, self.huber_loss_interval)
        huber_loss = self.huber_loss_interval * (abs_error - quadratic) + 0.5 * quadratic ** 2
        # Quantile Huber loss
        quantile_huber_loss = tf.abs(tau_i - tf.cast(error < 0, dtype=tf.float32)) * huber_loss
        # Quantile regression loss (the probability for each quantile is 1/num_quantiles)
        quantile_regression_loss = tf.reduce_sum(quantile_huber_loss) / float(self.num_atoms)
        self.loss = quantile_regression_loss
        tf.losses.add_loss(self.loss)
--- a/architectures/tensorflow_components/middleware.py
+++ b/architectures/tensorflow_components/middleware.py
@@ -1,77 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import tensorflow as tf
 import numpy as np
 from configurations import EmbedderWidth
 class MiddlewareEmbedder(object):
    def __init__(self, activation_function=tf.nn.relu, embedder_width=EmbedderWidth.Wide, name="middleware_embedder"):
        self.name = name
        self.input = None
        self.output = None
        self.embedder_width = embedder_width
        self.activation_function = activation_function
    def __call__(self, input_layer):
        with tf.variable_scope(self.get_name()):
            self.input = input_layer
            self._build_module()
        return self.input, self.output
    def _build_module(self):
        pass
    def get_name(self):
        return self.name
 class LSTM_Embedder(MiddlewareEmbedder):
    def _build_module(self):
        """
        self.state_in: tuple of placeholders containing the initial state
        self.state_out: tuple of output state
        todo: it appears that the shape of the output is batch, feature
        the code here seems to be slicing off the first element in the batch
        which would definitely be wrong. need to double check the shape
        """
        middleware = tf.layers.dense(self.input, 512, activation=self.activation_function, name='fc1')
        lstm_cell = tf.contrib.rnn.BasicLSTMCell(256, state_is_tuple=True)
        self.c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
        self.h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
        self.state_init = [self.c_init, self.h_init]
        self.c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
        self.h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
        self.state_in = (self.c_in, self.h_in)
        rnn_in = tf.expand_dims(middleware, [0])
        step_size = tf.shape(middleware)[:1]
        state_in = tf.contrib.rnn.LSTMStateTuple(self.c_in, self.h_in)
        lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
            lstm_cell, rnn_in, initial_state=state_in, sequence_length=step_size, time_major=False)
        lstm_c, lstm_h = lstm_state
        self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
        self.output = tf.reshape(lstm_outputs, [-1, 256])
 class FC_Embedder(MiddlewareEmbedder):
    def _build_module(self):
        width = 512 if self.embedder_width == EmbedderWidth.Wide else 64
        self.output = tf.layers.dense(self.input, width, activation=self.activation_function, name='fc1')
--- a/architectures/tensorflow_components/shared_variables.py
+++ b/architectures/tensorflow_components/shared_variables.py
@@ -1,82 +0,0 @@
 #
 # Copyright (c) 2017 Intel Corporation 
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import tensorflow as tf
 import numpy as np
 class SharedRunningStats(object):
    def __init__(self, tuning_parameters, replicated_device, epsilon=1e-2, shape=(), name=""):
        self.tp = tuning_parameters
        with tf.device(replicated_device):
            with tf.variable_scope(name):
                self._sum = tf.get_variable(
                    dtype=tf.float64,
                    shape=shape,
                    initializer=tf.constant_initializer(0.0),
                    name="running_sum", trainable=False)
                self._sum_squared = tf.get_variable(
                    dtype=tf.float64,
                    shape=shape,
                    initializer=tf.constant_initializer(epsilon),
                    name="running_sum_squared", trainable=False)
                self._count = tf.get_variable(
                    dtype=tf.float64,
                    shape=(),
                    initializer=tf.constant_initializer(epsilon),
                    name="count", trainable=False)
                self._shape = shape
                self._mean = self._sum / self._count
                self._std = tf.sqrt(tf.maximum((self._sum_squared - self._count*tf.square(self._mean))
                                               / tf.maximum(self._count-1, 1), epsilon))
                self.new_sum = tf.placeholder(shape=self.shape, dtype=tf.float64, name='sum')
                self.new_sum_squared = tf.placeholder(shape=self.shape, dtype=tf.float64, name='var')
                self.newcount = tf.placeholder(shape=[], dtype=tf.float64, name='count')
                self._inc_sum = tf.assign_add(self._sum, self.new_sum, use_locking=True)
                self._inc_sum_squared = tf.assign_add(self._sum_squared, self.new_sum_squared, use_locking=True)
                self._inc_count = tf.assign_add(self._count, self.newcount, use_locking=True)
    def push(self, x):
        x = x.astype('float64')
        self.tp.sess.run([self._inc_sum, self._inc_sum_squared, self._inc_count],
                         feed_dict={
                             self.new_sum: x.sum(axis=0).ravel(),
                             self.new_sum_squared: np.square(x).sum(axis=0).ravel(),
                             self.newcount: np.array(len(x), dtype='float64')
                         })
    @property
    def n(self):
        return self.tp.sess.run(self._count)
    @property
    def mean(self):
        return self.tp.sess.run(self._mean)
    @property
    def var(self):
        return self.std ** 2
    @property
    def std(self):
        return self.tp.sess.run(self._std)
    @property
    def shape(self):
        return self._shape
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -1,172 +1,44 @@
 # Coach Benchmarks
-The following figures are training curves of some of the presets available through Coach.
+The following table represents the current status of algorithms implemented in Coach relative to the results reported in the original papers. The detailed results for each algorithm can be seen by clicking on its name.
-The X axis in all the figures is the total steps (for multi-threaded runs, this is the accumulated number of steps over all the workers).
+
-The Y axis in all the figures is the average episode reward with an averaging window of 11 episodes.
+The X axis in all the figures is the total steps (for multi-threaded runs, this is the number of steps per worker).
 The Y axis in all the figures is the average episode reward with an averaging window of 100 timesteps.
 For each algorithm, there is a command line for reproducing the results of each graph.
 These are the results you can expect to get when running the pre-defined presets in Coach.
 The environments that were used for testing include:
 * **Atari** - Breakout, Pong and Space Invaders
 * **Mujoco** - Inverted Pendulum, Inverted Double Pendulum, Reacher, Hopper, Half Cheetah, Walker 2D, Ant, Swimmer and Humanoid.
 * **Doom** - Basic, Health Gathering (D1: Basic), Health Gathering Supreme (D2: Navigation), Battle (D3: Battle)
 * **Fetch** - Reach, Slide, Push, Pick-and-Place
-## A3C
+## Summary
-### Breakout_A3C with 16 workers
+![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) *Reproducing paper's results*
-```bash
+![#ceffad](https://placehold.it/15/ceffad/000000?text=+) *Reproducing paper's results for some of the environments*
 python3 coach.py -p Breakout_A3C -n 16 -r
 ```
-<img src="img/Breakout_A3C_16_workers.png" alt="Breakout_A3C_16_workers" width="400"/>
+![#FFA500](https://placehold.it/15/FFA500/000000?text=+) *Training but not reproducing paper's results*
-### InvertedPendulum_A3C with 16 workers
+![#FF4040](https://placehold.it/15/FF4040/000000?text=+) *Not training*
 ```bash
 python3 coach.py -p InvertedPendulum_A3C -n 16 -r
 ```
-<img src="img/Inverted_Pendulum_A3C_16_workers.png" alt="Inverted_Pendulum_A3C_16_workers" width="400"/>
+|                         |**Status**                                                |**Environments**|**Comments**|
 | ----------------------- |:--------------------------------------------------------:|:--------------:|:--------:|
 |**[DQN](dqn)**                  | ![#ceffad](https://placehold.it/15/ceffad/000000?text=+) |Atari           | Pong is not training |
 |**[Dueling DDQN](dueling_ddqn)**| ![#ceffad](https://placehold.it/15/ceffad/000000?text=+) |Atari           | Pong is not training |
 |**[Dueling DDQN with PER](dueling_ddqn_with_per)**| ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari           | |
 |**[Bootstrapped DQN](bootstrapped_dqn)**| ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari           | |
 |**[QR-DQN](qr_dqn)**            | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari           | |
 |**[A3C](a3c)**                  | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari, Mujoco   | |
 |**[Clipped PPO](clipped_ppo)**  | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco          | |
 |**[DDPG](ddpg)**                | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Mujoco          | |
 |**[NEC](nec)**                  | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Atari           | |
 |**[HER](ddpg_her)**                  | ![#2E8B57](https://placehold.it/15/2E8B57/000000?text=+) |Fetch           | |
 |**[HAC](hac)**                  | ![#969696](https://placehold.it/15/969696/000000?text=+) |Pendulum        | |
 |**[DFP](dfp)**                  | ![#ceffad](https://placehold.it/15/ceffad/000000?text=+) |Doom            | Doom Battle was not verified |
 ### Hopper_A3C with 16 workers
-```bash
+**Click on each algorithm to see detailed benchmarking results**
 python3 coach.py -p Hopper_A3C -n 16 -r
 ```
 <img src="img/Hopper_A3C_16_workers.png" alt="Hopper_A3C_16_workers" width="400"/>
 ### Ant_A3C with 16 workers
 ```bash
 python3 coach.py -p Ant_A3C -n 16 -r
 ```
 <img src="img/Ant_A3C_16_workers.png" alt="Ant_A3C_16_workers" width="400"/>
 ## Clipped PPO
 ### InvertedPendulum_ClippedPPO with 16 workers
 ```bash
 python3 coach.py -p InvertedPendulum_ClippedPPO -n 16 -r
 ```
 <img src="img/InvertedPendulum_ClippedPPO_16_workers.png" alt="InvertedPendulum_ClippedPPO_16_workers" width="400"/>
 ### Hopper_ClippedPPO with 16 workers
 ```bash
 python3 coach.py -p Hopper_ClippedPPO -n 16 -r
 ```
 <img src="img/Hopper_ClippedPPO_16_workers.png" alt="Hopper_Clipped_PPO_16_workers" width="400"/>
 ### Humanoid_ClippedPPO with 16 workers
 ```bash
 python3 coach.py -p Humanoid_ClippedPPO -n 16 -r
 ```
 <img src="img/Humanoid_ClippedPPO_16_workers.png" alt="Humanoid_ClippedPPO_16_workers" width="400"/>
 ## DQN
 ### Pong_DQN
 ```bash
 python3 coach.py -p Pong_DQN -r
 ```
 <img src="img/Pong_DQN.png" alt="Pong_DQN" width="400"/>
 ### Doom_Basic_DQN
 ```bash
 python3 coach.py -p Doom_Basic_DQN -r
 ```
 <img src="img/Doom_Basic_DQN.png" alt="Doom_Basic_DQN" width="400"/>
 ## Dueling DDQN
 ### Doom_Basic_Dueling_DDQN
 ```bash
 python3 coach.py -p Doom_Basic_Dueling_DDQN -r
 ```
 <img src="img/Doom_Basic_Dueling_DDQN.png" alt="Doom_Basic_Dueling_DDQN" width="400"/>
 ## DFP
 ### Doom_Health_DFP
 ```bash
 python3 coach.py -p Doom_Health_DFP -r
 ```
 <img src="img/Doom_Health_DFP.png" alt="Doom_Health_DFP" width="400"/>
 ## MMC
 ### Doom_Health_MMC
 ```bash
 python3 coach.py -p Doom_Health_MMC -r
 ```
 <img src="img/Doom_Health_MMC.png" alt="Doom_Health_MMC" width="400"/>
 ## NEC
 ## Pong_NEC
 ```bash
 python3 coach.py -p Pong_NEC -r
 ```
 <img src="img/Pong_NEC.png" alt="Pong_NEC" width="400"/>
 ## Doom_Basic_NEC
 ```bash
 python3 coach.py -p Doom_Basic_NEC -r
 ```
 <img src="img/Doom_Basic_NEC.png" alt="Doom_Basic_NEC" width="400"/>
 ## PG
 ### CartPole_PG
 ```bash
 python3 coach.py -p CartPole_PG -r
 ```
 <img src="img/CartPole_PG.png" alt="CartPole_PG" width="400"/>
 ## DDPG
 ### Pendulum_DDPG
 ```bash
 python3 coach.py -p Pendulum_DDPG -r
 ```
 <img src="img/Pendulum_DDPG.png" alt="Pendulum_DDPG" width="400"/>
 ## NAF
 ### InvertedPendulum_NAF
 ```bash
 python3 coach.py -p InvertedPendulum_NAF -r
 ```
 <img src="img/InvertedPendulum_NAF.png" alt="InvertedPendulum_NAF" width="400"/>
 ### Pendulum_NAF
 ```bash
 python3 coach.py -p Pendulum_NAF -r
 ```
 <img src="img/Pendulum_NAF.png" alt="Pendulum_NAF" width="400"/>
--- a/benchmarks/a3c/README.md
+++ b/benchmarks/a3c/README.md
@@ -0,0 +1,43 @@
 # A3C
 Each experiment uses 3 seeds.
 The parameters used for Clipped PPO are the same parameters as described in the [original paper](https://arxiv.org/abs/1707.06347).
 ### Inverted Pendulum A3C - 1/2/4/8/16 workers
 ```bash
 python3 coach.py -p Mujoco_A3C -lvl inverted_pendulum -n 1
 python3 coach.py -p Mujoco_A3C -lvl inverted_pendulum -n 2
 python3 coach.py -p Mujoco_A3C -lvl inverted_pendulum -n 4
 python3 coach.py -p Mujoco_A3C -lvl inverted_pendulum -n 8
 python3 coach.py -p Mujoco_A3C -lvl inverted_pendulum -n 16
 ```
 <img src="inverted_pendulum_a3c.png" alt="Inverted Pendulum A3C" width="800"/>
 ### Hopper A3C - 16 workers
 ```bash
 python3 coach.py -p Mujoco_A3C -lvl hopper -n 16
 ```
 <img src="hopper_a3c_16_workers.png" alt="Hopper A3C 16 workers" width="800"/>
 ### Walker2D A3C - 16 workers
 ```bash
 python3 coach.py -p Mujoco_A3C -lvl walker2d -n 16
 ```
 <img src="walker2d_a3c_16_workers.png" alt="Walker2D A3C 16 workers" width="800"/>
 ### Space Invaders A3C - 16 workers
 ```bash
 python3 coach.py -p Atari_A3C -lvl space_invaders -n 16
 ```
 <img src="space_invaders_a3c_16_workers.png" alt="Space Invaders A3C 16 workers" width="800"/>
--- a/benchmarks/a3c/hopper_a3c_16_workers.png
+++ b/benchmarks/a3c/hopper_a3c_16_workers.png
--- a/benchmarks/a3c/inverted_pendulum_a3c.png
+++ b/benchmarks/a3c/inverted_pendulum_a3c.png
--- a/benchmarks/a3c/space_invaders_a3c_16_workers.png
+++ b/benchmarks/a3c/space_invaders_a3c_16_workers.png
--- a/benchmarks/a3c/walker2d_a3c_16_workers.png
+++ b/benchmarks/a3c/walker2d_a3c_16_workers.png
--- a/benchmarks/bootstrapped_dqn/README.md
+++ b/benchmarks/bootstrapped_dqn/README.md
@@ -0,0 +1,31 @@
 # Bootstrapped DQN
 Each experiment uses 3 seeds.
 The parameters used for Bootstrapped DQN are the same parameters as described in the [original paper](https://arxiv.org/abs/1602.04621.pdf).
 ### Breakout Bootstrapped DQN - single worker
 ```bash
 python3 coach.py -p Atari_Bootstrapped_DQN -lvl breakout
 ```
 <img src="breakout_bootstrapped_dqn.png" alt="Breakout Bootstrapped DQN" width="800"/>
 ### Pong Bootstrapped DQN - single worker
 ```bash
 python3 coach.py -p Atari_Bootstrapped_DQN -lvl pong
 ```
 <img src="pong_bootstrapped_dqn.png" alt="Pong Bootstrapped DQN" width="800"/>
 ### Space Invaders Bootstrapped DQN - single worker
 ```bash
 python3 coach.py -p Atari_Bootstrapped_DQN -lvl space_invaders
 ```
 <img src="space_invaders_bootstrapped_dqn.png" alt="Space Invaders Bootstrapped DQN" width="800"/>
--- a/benchmarks/bootstrapped_dqn/breakout_bootstrapped_dqn.png
+++ b/benchmarks/bootstrapped_dqn/breakout_bootstrapped_dqn.png
--- a/benchmarks/bootstrapped_dqn/pong_bootstrapped_dqn.png
+++ b/benchmarks/bootstrapped_dqn/pong_bootstrapped_dqn.png
--- a/benchmarks/bootstrapped_dqn/space_invaders_bootstrapped_dqn.png
+++ b/benchmarks/bootstrapped_dqn/space_invaders_bootstrapped_dqn.png
--- a/benchmarks/clipped_ppo/README.md
+++ b/benchmarks/clipped_ppo/README.md
@@ -0,0 +1,84 @@
 # Clipped PPO
 Each experiment uses 3 seeds and is trained for 10k environment steps.
 The parameters used for Clipped PPO are the same parameters as described in the [original paper](https://arxiv.org/abs/1707.06347).
 ### Inverted Pendulum Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl inverted_pendulum
 ```
 <img src="inverted_pendulum_clipped_ppo.png" alt="Inverted Pendulum Clipped PPO" width="800"/>
 ### Inverted Double Pendulum Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl inverted_double_pendulum
 ```
 <img src="inverted_double_pendulum_clipped_ppo.png" alt="Inverted Double Pendulum Clipped PPO" width="800"/>
 ### Reacher Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl reacher
 ```
 <img src="reacher_clipped_ppo.png" alt="Reacher Clipped PPO" width="800"/>
 ### Hopper Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl hopper
 ```
 <img src="hopper_clipped_ppo.png" alt="Hopper Clipped PPO" width="800"/>
 ### Half Cheetah Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl half_cheetah
 ```
 <img src="half_cheetah_clipped_ppo.png" alt="Half Cheetah Clipped PPO" width="800"/>
 ### Walker 2D Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl walker2d
 ```
 <img src="walker2d_clipped_ppo.png" alt="Walker 2D Clipped PPO" width="800"/>
 ### Ant Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl ant
 ```
 <img src="ant_clipped_ppo.png" alt="Ant Clipped PPO" width="800"/>
 ### Swimmer Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl swimmer
 ```
 <img src="swimmer_clipped_ppo.png" alt="Swimmer Clipped PPO" width="800"/>
 ### Humanoid Clipped PPO - single worker
 ```bash
 python3 coach.py -p Mujoco_ClippedPPO -lvl humanoid
 ```
 <img src="humanoid_clipped_ppo.png" alt="Humanoid Clipped PPO" width="800"/>
--- a/benchmarks/clipped_ppo/ant_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/ant_clipped_ppo.png
--- a/benchmarks/clipped_ppo/half_cheetah_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/half_cheetah_clipped_ppo.png
--- a/benchmarks/clipped_ppo/hopper_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/hopper_clipped_ppo.png
--- a/benchmarks/clipped_ppo/humanoid_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/humanoid_clipped_ppo.png
--- a/benchmarks/clipped_ppo/inverted_double_pendulum_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/inverted_double_pendulum_clipped_ppo.png
--- a/benchmarks/clipped_ppo/inverted_pendulum_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/inverted_pendulum_clipped_ppo.png
--- a/benchmarks/clipped_ppo/reacher_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/reacher_clipped_ppo.png
--- a/benchmarks/clipped_ppo/swimmer_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/swimmer_clipped_ppo.png
--- a/benchmarks/clipped_ppo/walker2d_clipped_ppo.png
+++ b/benchmarks/clipped_ppo/walker2d_clipped_ppo.png
--- a/benchmarks/ddpg/README.md
+++ b/benchmarks/ddpg/README.md
@@ -0,0 +1,84 @@
 # DDPG
 Each experiment uses 3 seeds and is trained for 2k environment steps.
 The parameters used for DDPG are the same parameters as described in the [original paper](https://arxiv.org/abs/1509.02971).
 ### Inverted Pendulum DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl inverted_pendulum
 ```
 <img src="inverted_pendulum_ddpg.png" alt="Inverted Pendulum DDPG" width="800"/>
 ### Inverted Double Pendulum DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl inverted_double_pendulum
 ```
 <img src="inverted_double_pendulum_ddpg.png" alt="Inverted Double Pendulum DDPG" width="800"/>
 ### Reacher DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl reacher
 ```
 <img src="reacher_ddpg.png" alt="Reacher DDPG" width="800"/>
 ### Hopper DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl hopper
 ```
 <img src="hopper_ddpg.png" alt="Hopper DDPG" width="800"/>
 ### Half Cheetah DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl half_cheetah
 ```
 <img src="half_cheetah_ddpg.png" alt="Half Cheetah DDPG" width="800"/>
 ### Walker 2D DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl walker2d
 ```
 <img src="walker2d_ddpg.png" alt="Walker 2D DDPG" width="800"/>
 ### Ant DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl ant
 ```
 <img src="ant_ddpg.png" alt="Ant DDPG" width="800"/>
 ### Swimmer DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl swimmer
 ```
 <img src="swimmer_ddpg.png" alt="Swimmer DDPG" width="800"/>
 ### Humanoid DDPG - single worker
 ```bash
 python3 coach.py -p Mujoco_DDPG -lvl humanoid
 ```
 <img src="humanoid_ddpg.png" alt="Humanoid DDPG" width="800"/>
--- a/benchmarks/ddpg/ant_ddpg.png
+++ b/benchmarks/ddpg/ant_ddpg.png
--- a/benchmarks/ddpg/half_cheetah_ddpg.png
+++ b/benchmarks/ddpg/half_cheetah_ddpg.png
--- a/benchmarks/ddpg/hopper_ddpg.png
+++ b/benchmarks/ddpg/hopper_ddpg.png
--- a/benchmarks/ddpg/humanoid_ddpg.png
+++ b/benchmarks/ddpg/humanoid_ddpg.png
--- a/benchmarks/ddpg/inverted_double_pendulum_ddpg.png
+++ b/benchmarks/ddpg/inverted_double_pendulum_ddpg.png
--- a/benchmarks/ddpg/inverted_pendulum_ddpg.png
+++ b/benchmarks/ddpg/inverted_pendulum_ddpg.png
--- a/benchmarks/ddpg/reacher_ddpg.png
+++ b/benchmarks/ddpg/reacher_ddpg.png
--- a/benchmarks/ddpg/swimmer_ddpg.png
+++ b/benchmarks/ddpg/swimmer_ddpg.png
--- a/benchmarks/ddpg/walker2d_ddpg.png
+++ b/benchmarks/ddpg/walker2d_ddpg.png
--- a/benchmarks/ddpg_her/README.md
+++ b/benchmarks/ddpg_her/README.md
@@ -0,0 +1,40 @@
 # DDPG with Hindsight Experience Replay
 Each experiment uses 3 seeds.
 The parameters used for DDPG HER are the same parameters as described in the [following paper](https://arxiv.org/abs/1802.09464).
 ### Fetch Reach DDPG HER - single worker
 ```bash
 python3 coach.py -p Fetch_DDPG_HER_baselines -lvl reach
 ```
 <img src="fetch_ddpg_her_reach_1_worker.png" alt="Fetch DDPG HER Reach 1 Worker" width="800"/>
 ### Fetch Push DDPG HER - 8 workers
 ```bash
 python3 coach.py -p Fetch_DDPG_HER_baselines -lvl push -n 8
 ```
 <img src="fetch_ddpg_her_push_8_workers.png" alt="Fetch DDPG HER Push 8 Worker" width="800"/>
 ### Fetch Slide DDPG HER - 8 workers
 ```bash
 python3 coach.py -p Fetch_DDPG_HER_baselines -lvl slide -n 8
 ```
 <img src="fetch_ddpg_her_slide_8_workers.png" alt="Fetch DDPG HER Slide 8 Worker" width="800"/>
 ### Fetch Pick And Place DDPG HER - 8 workers
 ```bash
 python3 coach.py -p Fetch_DDPG_HER -lvl pick_and_place -n 8
 ```
 <img src="fetch_ddpg_her_pick_and_place_8_workers.png" alt="Fetch DDPG HER Pick And Place 8 Workers" width="800"/>
--- a/benchmarks/ddpg_her/fetch_ddpg_her_pick_and_place_8_workers.png
+++ b/benchmarks/ddpg_her/fetch_ddpg_her_pick_and_place_8_workers.png
--- a/benchmarks/ddpg_her/fetch_ddpg_her_push_8_workers.png
+++ b/benchmarks/ddpg_her/fetch_ddpg_her_push_8_workers.png
--- a/benchmarks/ddpg_her/fetch_ddpg_her_reach_1_worker.png
+++ b/benchmarks/ddpg_her/fetch_ddpg_her_reach_1_worker.png
--- a/benchmarks/ddpg_her/fetch_ddpg_her_slide_8_workers.png
+++ b/benchmarks/ddpg_her/fetch_ddpg_her_slide_8_workers.png
--- a/benchmarks/dfp/README.md
+++ b/benchmarks/dfp/README.md
@@ -0,0 +1,31 @@
 # DFP
 Each experiment uses 3 seeds.
 The parameters used for DFP are the same parameters as described in the [original paper](https://arxiv.org/abs/1611.01779).
 ### Doom Basic DFP - 8 workers
 ```bash
 python3 coach.py -p Doom_Basic_DFP -n 8
 ```
 <img src="doom_basic_dfp_8_workers.png" alt="Doom Basic DFP 8 workers" width="800"/>
 ### Doom Health (D1: Basic) DFP - 8 workers
 ```bash
 python3 coach.py -p Doom_Health_DFP -n 8
 ```
 <img src="doom_health_dfp_8_workers.png" alt="Doom Health DFP 8 workers" width="800"/>
 ### Doom Health Supreme (D2: Navigation) DFP - 8 workers
 ```bash
 python3 coach.py -p Doom_Health_Supreme_DFP -n 8
 ```
 <img src="doom_health_supreme_dfp_8_workers.png" alt="Doom Health Supreme DFP 8 workers" width="800"/>
--- a/benchmarks/dfp/doom_basic_dfp_8_workers.png
+++ b/benchmarks/dfp/doom_basic_dfp_8_workers.png
--- a/benchmarks/dfp/doom_health_dfp_8_workers.png
+++ b/benchmarks/dfp/doom_health_dfp_8_workers.png
--- a/benchmarks/dfp/doom_health_supreme_dfp_8_workers.png
+++ b/benchmarks/dfp/doom_health_supreme_dfp_8_workers.png
--- a/benchmarks/dqn/README.md
+++ b/benchmarks/dqn/README.md
@@ -0,0 +1,14 @@
 # DQN
 Each experiment uses 3 seeds.
 The parameters used for DQN are the same parameters as described in the [original paper](https://arxiv.org/abs/1607.05077.pdf).
 ### Breakout DQN - single worker
 ```bash
 python3 coach.py -p Atari_DQN -lvl breakout
 ```
 <img src="breakout_dqn.png" alt="Breakout DQN" width="800"/>
--- a/benchmarks/dqn/breakout_dqn.png
+++ b/benchmarks/dqn/breakout_dqn.png
--- a/benchmarks/dueling_ddqn/README.md
+++ b/benchmarks/dueling_ddqn/README.md
@@ -0,0 +1,14 @@
 # Dueling DDQN
 Each experiment uses 3 seeds and is trained for 10k environment steps.
 The parameters used for Dueling DDQN are the same parameters as described in the [original paper](https://arxiv.org/abs/1706.01502).
 ### Breakout Dueling DDQN - single worker
 ```bash
 python3 coach.py -p Atari_Dueling_DDQN -lvl breakout
 ```
 <img src="breakout_dueling_ddqn.png" alt="Breakout Dueling DDQN" width="800"/>
--- a/benchmarks/dueling_ddqn/breakout_dueling_ddqn.png
+++ b/benchmarks/dueling_ddqn/breakout_dueling_ddqn.png
--- a/benchmarks/dueling_ddqn_with_per/README.md
+++ b/benchmarks/dueling_ddqn_with_per/README.md
@@ -0,0 +1,31 @@
 # Dueling DDQN with Prioritized Experience Replay
 Each experiment uses 3 seeds and is trained for 10k environment steps.
 The parameters used for Dueling DDQN with PER are the same parameters as described in the [following paper](https://arxiv.org/abs/1511.05952).
 ### Breakout Dueling DDQN with PER - single worker
 ```bash
 python3 coach.py -p Atari_Dueling_DDQN_with_PER_OpenAI -lvl breakout
 ```
 <img src="breakout_dueling_ddqn_with_per.png" alt="Breakout Dueling DDQN with PER" width="800"/>
 ### Pong Dueling DDQN with PER - single worker
 ```bash
 python3 coach.py -p Atari_Dueling_DDQN_with_PER_OpenAI -lvl pong
 ```
 <img src="pong_dueling_ddqn_with_per.png" alt="Pong Dueling DDQN with PER" width="800"/>
 ### Space Invaders Dueling DDQN with PER - single worker
 ```bash
 python3 coach.py -p Atari_Dueling_DDQN_with_PER_OpenAI -lvl space_invaders
 ```
 <img src="space_invaders_dueling_ddqn_with_per.png" alt="Space Invaders Dueling DDQN with PER" width="800"/>
--- a/benchmarks/dueling_ddqn_with_per/breakout_dueling_ddqn_with_per.png
+++ b/benchmarks/dueling_ddqn_with_per/breakout_dueling_ddqn_with_per.png
--- a/benchmarks/dueling_ddqn_with_per/pong_dueling_ddqn_with_per.png
+++ b/benchmarks/dueling_ddqn_with_per/pong_dueling_ddqn_with_per.png
--- a/benchmarks/dueling_ddqn_with_per/space_invaders_dueling_ddqn_with_per.png
+++ b/benchmarks/dueling_ddqn_with_per/space_invaders_dueling_ddqn_with_per.png
--- a/benchmarks/img/Ant_A3C_16_workers.png
+++ b/benchmarks/img/Ant_A3C_16_workers.png
--- a/benchmarks/img/Breakout_A3C_16_workers.png
+++ b/benchmarks/img/Breakout_A3C_16_workers.png
--- a/benchmarks/img/CartPole_PG.png
+++ b/benchmarks/img/CartPole_PG.png
--- a/benchmarks/img/Doom_Basic_DQN.png
+++ b/benchmarks/img/Doom_Basic_DQN.png
--- a/benchmarks/img/Doom_Basic_Dueling_DDQN.png
+++ b/benchmarks/img/Doom_Basic_Dueling_DDQN.png
--- a/benchmarks/img/Doom_Basic_NEC.png
+++ b/benchmarks/img/Doom_Basic_NEC.png
--- a/benchmarks/img/Doom_Health_DFP.png
+++ b/benchmarks/img/Doom_Health_DFP.png
--- a/benchmarks/img/Doom_Health_MMC.png
+++ b/benchmarks/img/Doom_Health_MMC.png
--- a/benchmarks/img/Hopper_A3C_16_workers.png
+++ b/benchmarks/img/Hopper_A3C_16_workers.png
--- a/benchmarks/img/Hopper_ClippedPPO_16_workers.png
+++ b/benchmarks/img/Hopper_ClippedPPO_16_workers.png
--- a/benchmarks/img/Humanoid_ClippedPPO_16_workers.png
+++ b/benchmarks/img/Humanoid_ClippedPPO_16_workers.png
--- a/benchmarks/img/InvertedPendulum_ClippedPPO_16_workers.png
+++ b/benchmarks/img/InvertedPendulum_ClippedPPO_16_workers.png
--- a/benchmarks/img/InvertedPendulum_NAF.png
+++ b/benchmarks/img/InvertedPendulum_NAF.png
--- a/benchmarks/img/Inverted_Pendulum_A3C_16_workers.png
+++ b/benchmarks/img/Inverted_Pendulum_A3C_16_workers.png
--- a/Show More
+++ b/Show More