Robosuite exploration (#478)

* Add Robosuite parameters for all env types + initialize env flow * Init flow done * Rest of Environment API complete for RobosuiteEnvironment * RobosuiteEnvironment changes * Observation stacking filter * Add proper frame_skip in addition to control_freq * Hardcode Coach rendering to 'frontview' camera * Robosuite_Lift_DDPG preset + Robosuite env updates * Move observation stacking filter from env to preset * Pre-process observation - concatenate depth map (if exists) to image and object state (if exists) to robot state * Preset parameters based on Surreal DDPG parameters, taken from: https://github.com/SurrealAI/surreal/blob/master/surreal/main/ddpg_configs.py * RobosuiteEnvironment fixes - working now with PyGame rendering * Preset minor modifications * ObservationStackingFilter - option to concat non-vector observations * Consider frame skip when setting horizon in robosuite env * Robosuite lift preset - update heatup length and training interval * Robosuite env - change control_freq to 10 to match Surreal usage * Robosuite clipped PPO preset * Distribute multiple workers (-n #) over multiple GPUs * Clipped PPO memory optimization from @shadiendrawis * Fixes to evaluation only workers * RoboSuite_ClippedPPO: Update training interval * Undo last commit (update training interval) * Fix "doube-negative" if conditions * multi-agent single-trainer clipped ppo training with cartpole * cleanups (not done yet) + ~tuned hyper-params for mast * Switch to Robosuite v1 APIs * Change presets to IK controller * more cleanups + enabling evaluation worker + better logging * RoboSuite_Lift_ClippedPPO updates * Fix major bug in obs normalization filter setup * Reduce coupling between Robosuite API and Coach environment * Now only non task-specific parameters are explicitly defined in Coach * Removed a bunch of enums of Robosuite elements, using simple strings instead * With this change new environments/robots/controllers in Robosuite can be used immediately in Coach * MAST: better logging of actor-trainer interaction + bug fixes + performance improvements. Still missing: fixed pubsub for obs normalization running stats + logging for trainer signals * lstm support for ppo * setting JOINT VELOCITY action space by default + fix for EveryNEpisodes video dump filter + new TaskIDDumpFilter + allowing or between video dump filters * Separate Robosuite clipped PPO preset for the non-MAST case * Add flatten layer to architectures and use it in Robosuite presets This is required for embedders that mix conv and dense TODO: Add MXNet implementation * publishing running_stats together with the published policy + hyper-param for when to publish a policy + cleanups * bug-fix for memory leak in MAST * Bugfix: Return value in TF BatchnormActivationDropout.to_tf_instance * Explicit activations in embedder scheme so there's no ReLU after flatten * Add clipped PPO heads with configurable dense layers at the beginning * This is a workaround needed to mimic Surreal-PPO, where the CNN and LSTM are shared between actor and critic but the FC layers are not shared * Added a "SchemeBuilder" class, currently only used for the new heads but we can change Middleware and Embedder implementations to use it as well * Video dump setting fix in basic preset * logging screen output to file * coach to start the redis-server for a MAST run * trainer drops off-policy data + old policy in ClippedPPO updates only after policy was published + logging free memory stats + actors check for a new policy only at the beginning of a new episode + fixed a bug where the trainer was logging "Training Reward = 0", causing dashboard to incorrectly display the signal * Add missing set_internal_state function in TFSharedRunningStats * Robosuite preset - use SingleLevelSelect instead of hard-coded level * policy ID published directly on Redis * Small fix when writing to log file * Major bugfix in Robosuite presets - pass dense sizes to heads * RoboSuite_Lift_ClippedPPO hyper-params update * add horizon and value bootstrap to GAE calculation, fix A3C with LSTM * adam hyper-params from mujoco * updated MAST preset with IK_POSE_POS controller * configurable initialization for policy stdev + custom extra noise per actor + logging of policy stdev to dashboard * values loss weighting of 0.5 * minor fixes + presets * bug-fix for MAST where the old policy in the trainer had kept updating every training iter while it should only update after every policy publish * bug-fix: reset_internal_state was not called by the trainer * bug-fixes in the lstm flow + some hyper-param adjustments for CartPole_ClippedPPO_LSTM -> training and sometimes reaches 200 * adding back the horizon hyper-param - a messy commit * another bug-fix missing from prev commit * set control_freq=2 to match action_scale 0.125 * ClippedPPO with MAST cleanups and some preps for TD3 with MAST * TD3 presets. RoboSuite_Lift_TD3 seems to work well with multi-process runs (-n 8) * setting termination on collision to be on by default * bug-fix following prev-prev commit * initial cube exploration environment with TD3 commit * bug fix + minor refactoring * several parameter changes and RND debugging * Robosuite Gym wrapper + Rename TD3_Random* -> Random* * algorithm update * Add RoboSuite v1 env + presets (to eventually replace non-v1 ones) * Remove grasping presets, keep only V1 exp. presets (w/o V1 tag) * Keep just robosuite V1 env as the 'robosuite_environment' module * Exclude Robosuite and MAST presets from integration tests * Exclude LSTM and MAST presets from golden tests * Fix mistakenly removed import * Revert debug changes in ReaderWriterLock * Try another way to exclude LSTM/MAST golden tests * Remove debug prints * Remove PreDense heads, unused in the end * Missed removing an instance of PreDense head * Remove MAST, not required for this PR * Undo unused concat option in ObservationStackingFilter * Remove LSTM updates, not required in this PR * Update README.md * code changes for the exploration flow to work with robosuite master branch * code cleanup + documentation * jupyter tutorial for the goal-based exploration + scatter plot * typo fix * Update README.md * seprate parameter for the obs-goal observation + small fixes * code clarity fixes * adjustment in tutorial 5 * Update tutorial * Update tutorial Co-authored-by: Guy Jacob <guy.jacob@intel.com> Co-authored-by: Gal Leibovich <gal.leibovich@intel.com> Co-authored-by: shadi.endrawis <sendrawi@aipg-ra-skx-03.ra.intel.com>
2025-12-17 19:20:19 +01:00 · 2021-06-01 00:34:19 +03:00
parent 235a259223
commit 0896f43097
25 changed files with 1905 additions and 46 deletions
--- a/README.md
+++ b/README.md
@@ -45,6 +45,7 @@ coach -p CartPole_DQN -r
  * [Distributed Multi-Node Coach](#distributed-multi-node-coach)
  * [Batch Reinforcement Learning](#batch-reinforcement-learning)
 - [Supported Environments](#supported-environments)
  * [Note on MuJoCo version](#note-on-mujoco-version)
 - [Supported Algorithms](#supported-algorithms)
 - [Citation](#citation)
 - [Contact](#contact)
@@ -202,7 +203,7 @@ There are [example](https://github.com/IntelLabs/coach/blob/master/rl_coach/pres
 * *OpenAI Gym:*
-    Installed by default by Coach's installer
+    Installed by default by Coach's installer (see note on MuJoCo version [below](#note-on-mujoco-version)).
 * *ViZDoom:*
@@ -258,6 +259,18 @@ There are [example](https://github.com/IntelLabs/coach/blob/master/rl_coach/pres
    https://github.com/deepmind/dm_control
 * *Robosuite:*<a name="robosuite"></a>
    **__Note:__ To use Robosuite-based environments, please install Coach from the latest cloned repository. It is not yet available as part of the `rl_coach` package on PyPI.**
    Follow the instructions described in the [robosuite documentation](https://robosuite.ai/docs/installation.html) (see note on MuJoCo version [below](#note-on-mujoco-version)).
 ### Note on MuJoCo version
 OpenAI Gym supports MuJoCo only up to version 1.5 (and corresponding mujoco-py version 1.50.x.x). The Robosuite simulation framework, however, requires MuJoCo version 2.0 (and corresponding mujoco-py version 2.0.2.9, as of robosuite version 1.2). Therefore, if you wish to run both Gym-based MuJoCo environments and Robosuite environments, it's recommended to have a separate virtual environment for each.
 Please note that all Gym-Based MuJoCo presets in Coach (`rl_coach/presets/Mujoco_*.py`) have been validated _**only**_ with MuJoCo 1.5 (including the reported [benchmark results](benchmarks)).
 ## Supported Algorithms
--- a/requirements.txt
+++ b/requirements.txt
@@ -14,3 +14,4 @@ redis>=2.10.6
 minio>=4.0.5
 pytest>=3.8.2
 psutil>=5.5.0
 joblib>=0.17.0
--- a/rl_coach/agents/agent.py
+++ b/rl_coach/agents/agent.py
@@ -257,7 +257,6 @@ class Agent(AgentInterface):
        :return: None
        """
        # Loading a memory from a CSV file, requires an input filter to filter through the data.
        # The filter needs a session before it can be used.
        if self.ap.memory.load_memory_from_file_path:
@@ -418,6 +417,7 @@ class Agent(AgentInterface):
            self.num_successes_across_evaluation_episodes = 0
            self.num_evaluation_episodes_completed = 0
            if self.ap.task_parameters.evaluate_only is None:
                # TODO verbosity was mistakenly removed from task_parameters on release 0.11.0, need to bring it back
                # if self.ap.is_a_highest_level_agent or self.ap.task_parameters.verbosity == "high":
                if self.ap.is_a_highest_level_agent:
@@ -439,6 +439,7 @@ class Agent(AgentInterface):
                "Success Rate",
                success_rate)
            if self.ap.task_parameters.evaluate_only is None:
                # TODO verbosity was mistakenly removed from task_parameters on release 0.11.0, need to bring it back
                # if self.ap.is_a_highest_level_agent or self.ap.task_parameters.verbosity == "high":
                if self.ap.is_a_highest_level_agent:
@@ -568,7 +569,7 @@ class Agent(AgentInterface):
        for transition in self.current_episode_buffer.transitions:
            self.discounted_return.add_sample(transition.n_step_discounted_rewards)
-        if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only:
+        if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only is not None:
            self.current_episode += 1
        if self.phase != RunPhase.TEST:
@@ -828,7 +829,7 @@ class Agent(AgentInterface):
            return None
        # count steps (only when training or if we are in the evaluation worker)
-        if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only:
+        if self.phase != RunPhase.TEST or self.ap.task_parameters.evaluate_only is not None:
            self.total_steps_counter += 1
        self.current_episode_steps_counter += 1
--- a/rl_coach/agents/clipped_ppo_agent.py
+++ b/rl_coach/agents/clipped_ppo_agent.py
@@ -15,6 +15,7 @@
 #
 import copy
 import math
 from collections import OrderedDict
 from random import shuffle
 from typing import Union
@@ -156,8 +157,17 @@ class ClippedPPOAgent(ActorCriticAgent):
    def fill_advantages(self, batch):
        network_keys = self.ap.network_wrappers['main'].input_embedders_parameters.keys()
-        current_state_values = self.networks['main'].online_network.predict(batch.states(network_keys))[0]
+        state_values = []
-        current_state_values = current_state_values.squeeze()
+        for i in range(int(batch.size / self.ap.network_wrappers['main'].batch_size) + 1):
            start = i * self.ap.network_wrappers['main'].batch_size
            end = (i + 1) * self.ap.network_wrappers['main'].batch_size
            if start == batch.size:
                break
            state_values.append(self.networks['main'].online_network.predict(
                {k: v[start:end] for k, v in batch.states(network_keys).items()})[0])
        current_state_values = np.concatenate(state_values)
        self.state_values.add_sample(current_state_values)
        # calculate advantages
@@ -213,9 +223,7 @@ class ClippedPPOAgent(ActorCriticAgent):
                       self.networks['main'].online_network.output_heads[1].likelihood_ratio,
                       self.networks['main'].online_network.output_heads[1].clipped_likelihood_ratio]
-            # TODO-fixme if batch.size / self.ap.network_wrappers['main'].batch_size is not an integer, we do not train on
+            for i in range(math.ceil(batch.size / self.ap.network_wrappers['main'].batch_size)):
            #  some of the data
            for i in range(int(batch.size / self.ap.network_wrappers['main'].batch_size)):
                start = i * self.ap.network_wrappers['main'].batch_size
                end = (i + 1) * self.ap.network_wrappers['main'].batch_size
--- a/rl_coach/agents/td3_exp_agent.py
+++ b/rl_coach/agents/td3_exp_agent.py
@@ -0,0 +1,410 @@
 #
 # Copyright (c) 2019 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import copy
 from typing import Union
 from collections import OrderedDict
 from random import shuffle
 import os
 from PIL import Image
 import joblib
 import numpy as np
 from rl_coach.agents.agent import Agent
 from rl_coach.agents.td3_agent import TD3Agent, TD3CriticNetworkParameters, TD3ActorNetworkParameters, \
    TD3AlgorithmParameters, TD3AgentExplorationParameters
 from rl_coach.architectures.embedder_parameters import InputEmbedderParameters
 from rl_coach.base_parameters import NetworkParameters, AgentParameters, MiddlewareScheme
 from rl_coach.core_types import Transition, Batch
 from rl_coach.memories.episodic.episodic_experience_replay import EpisodicExperienceReplayParameters
 from rl_coach.architectures.middleware_parameters import FCMiddlewareParameters
 from rl_coach.architectures.head_parameters import RNDHeadParameters
 from rl_coach.utilities.shared_running_stats import NumpySharedRunningStats
 from rl_coach.logger import screen
 from rl_coach.exploration_policies.e_greedy import EGreedyParameters
 from rl_coach.schedules import LinearSchedule
 class RNDNetworkParameters(NetworkParameters):
    def __init__(self):
        super().__init__()
        self.input_embedders_parameters = {'observation': InputEmbedderParameters(activation_function='leaky_relu',
                                                                                  input_rescaling={'image': 1.0})}
        self.middleware_parameters = FCMiddlewareParameters(scheme=MiddlewareScheme.Empty)
        self.heads_parameters = [RNDHeadParameters()]
        self.create_target_network = False
        self.optimizer_type = 'Adam'
        self.batch_size = 100
        self.learning_rate = 0.0001
        self.should_get_softmax_probabilities = False
 class TD3ExplorationAlgorithmParameters(TD3AlgorithmParameters):
    """
        :param rnd_sample_size: (int)
            The number of states in each RND training iteration.
        :param rnd_batch_size: (int)
            Batch size for the RND optimization cycle.
        :param rnd_optimization_epochs: (int)
            Number of epochs for the RND optimization cycle.
        :param td3_training_ratio: (float)
            The ratio between TD3 training steps and the number of steps in each episode (must be a positive number).
        :param identity_goal_sample_rate: (float)
            For the goal-based agent, this number indicates the probability to sample a goal that is the identity
            (must be a number between 0 and 1).
        :param env_obs_key: (str)
            The name of the state key for the camera observation from the environment.
        :param agent_obs_key: (str)
            The name of the state key for the camera observation for the agent. This key has to be different
             from env_obs_key in case the agent modifies the observation from the environment. For example,
             the goal-based agent concatenates a goal image to the image observation from the environment.
        :param replay_buffer_save_steps: (int)
            The number of steps to periodically save the replay buffer.
        :param replay_buffer_save_path: (str or None)
            A path to save the replay buffer to. if set to None, the replay buffer will be saved in the
            experiment directory.
    """
    def __init__(self):
        super().__init__()
        self.rnd_sample_size = 2000
        self.rnd_batch_size = 500
        self.rnd_optimization_epochs = 4
        self.td3_training_ratio = 1.0
        self.identity_goal_sample_rate = 0.0
        self.env_obs_key = 'camera'
        self.agent_obs_key = 'camera'
        self.replay_buffer_save_steps = 25000
        self.replay_buffer_save_path = None
 class TD3ExplorationAgentParameters(AgentParameters):
    def __init__(self):
        td3_exp_algorithm_params = TD3ExplorationAlgorithmParameters()
        super().__init__(algorithm=td3_exp_algorithm_params,
                         exploration=TD3AgentExplorationParameters(),
                         memory=EpisodicExperienceReplayParameters(),
                         networks=OrderedDict([("actor", TD3ActorNetworkParameters()),
                                               ("critic",
                                                TD3CriticNetworkParameters(td3_exp_algorithm_params.num_q_networks)),
                                               ("predictor", RNDNetworkParameters()),
                                               ("constant", RNDNetworkParameters())]))
    @property
    def path(self):
        return 'rl_coach.agents.td3_exp_agent:TD3ExplorationAgent'
 class TD3ExplorationAgent(TD3Agent):
    def __init__(self, agent_parameters, parent: Union['LevelManager', 'CompositeAgent']=None):
        super().__init__(agent_parameters, parent)
        self.rnd_stats = NumpySharedRunningStats(name='RND_normalization', epsilon=1e-8)
        self.rnd_stats.set_params()
        self.rnd_obs_stats = NumpySharedRunningStats(name='RND_observation_normalization', epsilon=1e-8)
        self.intrinsic_returns_estimate = None
    def update_intrinsic_returns_estimate(self, rewards):
        returns = np.zeros_like(rewards)
        for i, r in enumerate(rewards):
            if self.intrinsic_returns_estimate is None:
                self.intrinsic_returns_estimate = r
            else:
                self.intrinsic_returns_estimate = \
                    self.intrinsic_returns_estimate * self.ap.algorithm.discount + r
            returns[i] = self.intrinsic_returns_estimate
        return returns
    def prepare_rnd_inputs(self, batch):
        env_obs_key = self.ap.algorithm.env_obs_key
        next_states = batch.next_states([env_obs_key])
        inputs = {env_obs_key: self.rnd_obs_stats.normalize(next_states[env_obs_key])}
        return inputs
    def handle_self_supervised_reward(self, batch):
        """
        Allows agents to update the batch for self supervised learning
        :param batch: original training batch
        :return: updated traing batch
        """
        return batch
    def update_transition_before_adding_to_replay_buffer(self, transition: Transition) -> Transition:
        """
        Allows agents to update the transition just before adding it to the replay buffer.
        Can be useful for agents that want to tweak the reward, termination signal, etc.
        :param transition: the transition to update
        :return: the updated transition
        """
        transition = super().update_transition_before_adding_to_replay_buffer(transition)
        image = np.array(transition.state[self.ap.algorithm.env_obs_key])
        if self.rnd_obs_stats.n < 1:
            self.rnd_obs_stats.set_params(shape=image.shape, clip_values=[-5, 5])
        self.rnd_obs_stats.push_val(np.expand_dims(image, 0))
        return transition
    def train_rnd(self):
        if self.memory.num_transitions() == 0:
            return
        transitions = self.memory.transitions[-self.ap.algorithm.rnd_sample_size:]
        dataset = Batch(transitions)
        dataset_order = list(range(dataset.size))
        batch_size = self.ap.algorithm.rnd_batch_size
        for epoch in range(self.ap.algorithm.rnd_optimization_epochs):
            shuffle(dataset_order)
            total_loss = 0
            total_grads = 0
            for i in range(int(dataset.size / batch_size)):
                start = i * batch_size
                end = (i + 1) * batch_size
                batch = Batch(list(np.array(dataset.transitions)[dataset_order[start:end]]))
                inputs = self.prepare_rnd_inputs(batch)
                const_embedding = self.networks['constant'].online_network.predict(inputs)
                res = self.networks['predictor'].train_and_sync_networks(inputs, [const_embedding])
                total_loss += res[0]
                total_grads += res[2]
            screen.log_dict(
                OrderedDict([
                    ("training epoch", epoch),
                    ("dataset size", dataset.size),
                    ("mean loss", total_loss / dataset.size),
                    ("mean gradients", total_grads / dataset.size)
                ]),
                prefix="RND Training"
            )
    def learn_from_batch(self, batch):
        batch = self.handle_self_supervised_reward(batch)
        return super().learn_from_batch(batch)
    def train(self):
        self.ap.algorithm.num_consecutive_training_steps = \
            int(self.current_episode_steps_counter * self.ap.algorithm.td3_training_ratio)
        return Agent.train(self)
    def calculate_novelty(self, batch):
        inputs = self.prepare_rnd_inputs(batch)
        embedding = self.networks['constant'].online_network.predict(inputs)
        prediction = self.networks['predictor'].online_network.predict(inputs)
        prediction_error = np.mean((embedding - prediction) ** 2, axis=1)
        return prediction_error
    def save_replay_buffer(self, dir_path=None):
        if dir_path is None:
            dir_path = os.path.join(self.parent_level_manager.parent_graph_manager.task_parameters.experiment_path,
                                    'replay_buffer')
        if not os.path.exists(dir_path):
            os.mkdir(dir_path)
        path = os.path.join(dir_path, 'RB_{}.joblib.bz2'.format(type(self).__name__))
        joblib.dump(self.memory.get_all_complete_episodes(), path, compress=('bz2', 1))
        screen.log('Saved replay buffer to: \"{}\" - Number of transitions: {}'.format(path,
                                                                                       self.memory.num_transitions()))
    def handle_episode_ended(self) -> None:
        super().handle_episode_ended()
        if self.total_steps_counter % self.ap.algorithm.rnd_sample_size == 0:
            self.train_rnd()
        if self.total_steps_counter % self.ap.algorithm.replay_buffer_save_steps == 0:
            self.save_replay_buffer(self.ap.algorithm.replay_buffer_save_path)
            self.save_rnd_images(self.ap.algorithm.replay_buffer_save_path)
    def save_rnd_images(self, dir_path=None):
        if dir_path is None:
            dir_path = os.path.join(self.parent_level_manager.parent_graph_manager.task_parameters.experiment_path,
                                    'rnd_images')
        else:
            dir_path = os.path.join(dir_path, 'rnd_images')
        if not os.path.exists(dir_path):
            os.mkdir(dir_path)
        transitions = self.memory.transitions
        dataset = Batch(transitions)
        batch_size = self.ap.algorithm.rnd_batch_size
        novelties = []
        for i in range(int(dataset.size / batch_size)):
            start = i * batch_size
            end = (i + 1) * batch_size
            batch = Batch(dataset[start:end])
            novelty = self.calculate_novelty(batch)
            novelties.append(novelty)
        novelties = np.concatenate(novelties)
        sorted_indices = np.argsort(novelties)
        sample_indices = sorted_indices[np.round(np.linspace(0, len(sorted_indices) - 1, 100)).astype(np.uint32)]
        images = []
        for si in sample_indices:
            images.append(np.flip(transitions[si].next_state[self.ap.algorithm.env_obs_key], 0))
        rows = []
        for i in range(10):
            rows.append(np.hstack(images[(i * 10):((i + 1) * 10)]))
        image = np.vstack(rows)
        image = Image.fromarray(image)
        image.save('{}/{}_{}.jpeg'.format(dir_path, 'rnd_samples', len(transitions)))
 class TD3IntrinsicRewardAgentParameters(TD3ExplorationAgentParameters):
    @property
    def path(self):
        return 'rl_coach.agents.td3_exp_agent:TD3IntrinsicRewardAgent'
 class TD3IntrinsicRewardAgent(TD3ExplorationAgent):
    def __init__(self, agent_parameters, parent: Union['LevelManager', 'CompositeAgent']=None):
        super().__init__(agent_parameters, parent)
    def handle_self_supervised_reward(self, batch):
        novelty = self.calculate_novelty(batch)
        for i, t in enumerate(batch.transitions):
            t.reward = novelty[i] / self.rnd_stats.std[0]
        return batch
    def handle_episode_ended(self) -> None:
        super().handle_episode_ended()
        novelty = self.calculate_novelty(Batch(self.memory.get_last_complete_episode().transitions))
        self.rnd_stats.push_val(np.expand_dims(self.update_intrinsic_returns_estimate(novelty), -1))
 class RandomAgentParameters(TD3ExplorationAgentParameters):
    def __init__(self):
        super().__init__()
        self.exploration = EGreedyParameters()
        self.exploration.epsilon_schedule = LinearSchedule(1.0, 1.0, 500000000)
    @property
    def path(self):
        return 'rl_coach.agents.td3_exp_agent:RandomAgent'
 class RandomAgent(TD3ExplorationAgent):
    def __init__(self, agent_parameters, parent: Union['LevelManager', 'CompositeAgent']=None):
        super().__init__(agent_parameters, parent)
        self.ap.algorithm.periodic_exploration_noise = None
        self.ap.algorithm.rnd_sample_size = 100000000000
    def train(self):
        return 0
 class TD3GoalBasedAgentParameters(TD3ExplorationAgentParameters):
    @property
    def path(self):
        return 'rl_coach.agents.td3_exp_agent:TD3GoalBasedAgent'
 class TD3GoalBasedAgent(TD3ExplorationAgent):
    def __init__(self, agent_parameters, parent: Union['LevelManager', 'CompositeAgent']=None):
        super().__init__(agent_parameters, parent)
        self.goal = None
        self.ap.algorithm.use_non_zero_discount_for_terminal_states = False
    def concat_goal(self, state, goal_state):
        ret = np.concatenate([state[self.ap.algorithm.env_obs_key], goal_state[self.ap.algorithm.env_obs_key]], axis=2)
        return ret
    def handle_self_supervised_reward(self, batch):
        batch_size = self.ap.network_wrappers['actor'].batch_size
        episode_indices = np.random.randint(self.memory.num_complete_episodes(), size=batch_size)
        transitions = []
        for e_idx in episode_indices:
            episode = self.memory.get_all_complete_episodes()[e_idx]
            transition_idx = np.random.randint(episode.length())
            t = copy.copy(episode[transition_idx])
            if np.random.rand(1) < self.ap.algorithm.identity_goal_sample_rate:
                t.state[self.ap.algorithm.agent_obs_key] = self.concat_goal(t.state, t.state)
                # this doesn't matter for learning but is set anyway so that the agent can pass it through the network
                t.next_state[self.ap.algorithm.agent_obs_key] = self.concat_goal(t.next_state, t.state)
                t.game_over = True
                t.reward = 0
                t.action = np.zeros_like(t.action)
            else:
                if transition_idx == episode.length() - 1:
                    goal = t
                    t.state[self.ap.algorithm.agent_obs_key] = self.concat_goal(t.state, t.next_state)
                    t.next_state[self.ap.algorithm.agent_obs_key] = self.concat_goal(t.next_state, t.next_state)
                else:
                    goal_idx = np.random.randint(transition_idx, episode.length())
                    goal = episode.transitions[goal_idx]
                    t.state[self.ap.algorithm.agent_obs_key] = self.concat_goal(t.state, episode.transitions[goal_idx].next_state)
                    t.next_state[self.ap.algorithm.agent_obs_key] = self.concat_goal(t.next_state,
                                                                        episode.transitions[goal_idx].next_state)
                camera_equal = np.alltrue(np.equal(t.next_state[self.ap.algorithm.env_obs_key],
                                                   goal.next_state[self.ap.algorithm.env_obs_key]))
                measurements_equal = np.alltrue(np.isclose(t.next_state['measurements'],
                                                           goal.next_state['measurements']))
                t.game_over = camera_equal and measurements_equal
                t.reward = -1
            transitions.append(t)
        return Batch(transitions)
    def choose_action(self, curr_state):
        if self.goal:
            curr_state[self.ap.algorithm.agent_obs_key] = self.concat_goal(curr_state, self.goal.next_state)
        else:
            curr_state[self.ap.algorithm.agent_obs_key] = self.concat_goal(curr_state, curr_state)
        return super().choose_action(curr_state)
    def generate_goal(self):
        if self.memory.num_transitions() == 0:
            return
        transitions = list(np.random.choice(self.memory.transitions,
                                            min(self.ap.algorithm.rnd_sample_size,
                                                self.memory.num_transitions()),
                                            replace=False))
        dataset = Batch(transitions)
        batch_size = self.ap.algorithm.rnd_batch_size
        self.goal = dataset[0]
        max_novelty = 0
        for i in range(int(dataset.size / batch_size)):
            start = i * batch_size
            end = (i + 1) * batch_size
            novelty = self.calculate_novelty(Batch(dataset[start:end]))
            curr_max = np.max(novelty)
            if curr_max > max_novelty:
                max_novelty = curr_max
                idx = start + np.argmax(novelty)
                self.goal = dataset[idx]
    def handle_episode_ended(self) -> None:
        super().handle_episode_ended()
        self.generate_goal()
--- a/rl_coach/architectures/head_parameters.py
+++ b/rl_coach/architectures/head_parameters.py
@@ -258,3 +258,9 @@ class TD3VHeadParameters(HeadParameters):
                         loss_weight=loss_weight)
        self.initializer = initializer
        self.output_bias_initializer = output_bias_initializer
 class RNDHeadParameters(HeadParameters):
    def __init__(self, name: str = 'rnd_head_params', dense_layer=None, is_predictor=False):
        super().__init__(parameterized_class_name="RNDHead", name=name, dense_layer=dense_layer)
        self.is_predictor = is_predictor
--- a/rl_coach/architectures/tensorflow_components/heads/RND_head.py
+++ b/rl_coach/architectures/tensorflow_components/heads/RND_head.py
@@ -0,0 +1,54 @@
 #
 # Copyright (c) 2019 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import tensorflow as tf
 import numpy as np
 from rl_coach.architectures.tensorflow_components.layers import Conv2d, BatchnormActivationDropout
 from rl_coach.architectures.tensorflow_components.heads.head import Head, Orthogonal
 from rl_coach.base_parameters import AgentParameters
 from rl_coach.core_types import Embedding
 from rl_coach.spaces import SpacesDefinition
 class RNDHead(Head):
    def __init__(self, agent_parameters: AgentParameters, spaces: SpacesDefinition, network_name: str,
                 head_idx: int = 0, is_local: bool = True, is_predictor: bool = False):
        super().__init__(agent_parameters, spaces, network_name, head_idx, is_local)
        self.name = 'rnd_head'
        self.return_type = Embedding
        self.is_predictor = is_predictor
        self.activation_function = tf.nn.leaky_relu
        self.loss_type = tf.losses.mean_squared_error
    def _build_module(self, input_layer):
        weight_init = Orthogonal(gain=np.sqrt(2))
        input_layer = Conv2d(num_filters=32, kernel_size=8, strides=4)(input_layer, kernel_initializer=weight_init)
        input_layer = BatchnormActivationDropout(activation_function=self.activation_function)(input_layer)[-1]
        input_layer = Conv2d(num_filters=64, kernel_size=4, strides=2)(input_layer, kernel_initializer=weight_init)
        input_layer = BatchnormActivationDropout(activation_function=self.activation_function)(input_layer)[-1]
        input_layer = Conv2d(num_filters=64, kernel_size=3, strides=1)(input_layer, kernel_initializer=weight_init)
        input_layer = BatchnormActivationDropout(activation_function=self.activation_function)(input_layer)[-1]
        input_layer = tf.contrib.layers.flatten(input_layer)
        if self.is_predictor:
            input_layer = self.dense_layer(512)(input_layer, kernel_initializer=weight_init)
            input_layer = BatchnormActivationDropout(activation_function=tf.nn.relu)(input_layer)[-1]
            input_layer = self.dense_layer(512)(input_layer, kernel_initializer=weight_init)
            input_layer = BatchnormActivationDropout(activation_function=tf.nn.relu)(input_layer)[-1]
        self.output = self.dense_layer(512)(input_layer, name='output', kernel_initializer=weight_init)
--- a/rl_coach/architectures/tensorflow_components/heads/init.py
+++ b/rl_coach/architectures/tensorflow_components/heads/init.py
@@ -19,6 +19,7 @@ from .cil_head import RegressionHead
 from .td3_v_head import TD3VHead
 from .ddpg_v_head import DDPGVHead
 from .wolpertinger_actor_head import WolpertingerActorHead
 from .RND_head import RNDHead
 __all__ = [
    'CategoricalQHead',
@@ -41,5 +42,6 @@ __all__ = [
    'RegressionHead',
    'TD3VHead',
    'DDPGVHead',
-    'WolpertingerActorHead'
+    'WolpertingerActorHead',
    'RNDHead'
 ]
--- a/rl_coach/architectures/tensorflow_components/heads/head.py
+++ b/rl_coach/architectures/tensorflow_components/heads/head.py
@@ -23,6 +23,7 @@ from rl_coach.spaces import SpacesDefinition
 from rl_coach.utils import force_list
 from rl_coach.architectures.tensorflow_components.utils import squeeze_tensor
 # Used to initialize weights for policy and value output layers
 def normalized_columns_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None):
@@ -32,6 +33,29 @@ def normalized_columns_initializer(std=1.0):
    return _initializer
 # Used to initialize RND network parameters
 class Orthogonal(tf.initializers.orthogonal):
    def __init__(self, gain=1.0):
        super().__init__(gain=gain)
    def __call__(self, shape, dtype=None, partition_info=None):
        shape = tuple(shape)
        if len(shape) == 2:
            flat_shape = shape
        elif len(shape) == 4:  # assumes NHWC
            flat_shape = (np.prod(shape[:-1]), shape[-1])
        else:
            raise NotImplementedError
        a = np.random.normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        q = u if u.shape == flat_shape else v  # pick the one with the correct shape
        q = q.reshape(shape)
        return (self.gain * q[:shape[0], :shape[1]]).astype(np.float32)
    def get_config(self):
        return {"gain": self.gain}
 class Head(object):
    """
    A head is the final part of the network. It takes the embedding from the middleware embedder and passes it through
--- a/rl_coach/architectures/tensorflow_components/layers.py
+++ b/rl_coach/architectures/tensorflow_components/layers.py
@@ -109,7 +109,7 @@ class Conv2d(layers.Conv2d):
    def __init__(self, num_filters: int, kernel_size: int, strides: int):
        super(Conv2d, self).__init__(num_filters=num_filters, kernel_size=kernel_size, strides=strides)
-    def __call__(self, input_layer, name: str=None, is_training=None):
+    def __call__(self, input_layer, name: str=None, is_training=None, kernel_initializer=None):
        """
        returns a tensorflow conv2d layer
        :param input_layer: previous layer
@@ -117,7 +117,8 @@ class Conv2d(layers.Conv2d):
        :return: conv2d layer
        """
        return tf.layers.conv2d(input_layer, filters=self.num_filters, kernel_size=self.kernel_size,
-                                strides=self.strides, data_format='channels_last', name=name)
+                                strides=self.strides, data_format='channels_last', name=name,
                                kernel_initializer=kernel_initializer)
    @staticmethod
    @reg_to_tf_instance(layers.Conv2d)
@@ -153,7 +154,7 @@ class BatchnormActivationDropout(layers.BatchnormActivationDropout):
    @staticmethod
    @reg_to_tf_instance(layers.BatchnormActivationDropout)
    def to_tf_instance(base: layers.BatchnormActivationDropout):
-        return BatchnormActivationDropout, BatchnormActivationDropout(
+        return BatchnormActivationDropout(
                batchnorm=base.batchnorm,
                activation_function=base.activation_function,
                dropout_rate=base.dropout_rate)
--- a/rl_coach/coach.py
+++ b/rl_coach/coach.py
@@ -37,7 +37,8 @@ import subprocess
 from glob import glob
 from rl_coach.graph_managers.graph_manager import HumanPlayScheduleParameters, GraphManager
-from rl_coach.utils import list_all_presets, short_dynamic_import, get_open_port, SharedMemoryScratchPad, get_base_dir
+from rl_coach.utils import list_all_presets, short_dynamic_import, get_open_port, SharedMemoryScratchPad, \
    get_base_dir, set_gpu
 from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
 from rl_coach.environments.environment import SingleLevelSelection
 from rl_coach.memories.backend.redis import RedisPubSubMemoryBackendParameters
@@ -49,12 +50,40 @@ from rl_coach.data_stores.redis_data_store import RedisDataStoreParameters
 from rl_coach.data_stores.data_store_impl import get_data_store, construct_data_store_params
 from rl_coach.training_worker import training_worker
 from rl_coach.rollout_worker import rollout_worker
 from rl_coach.schedules import *
 from rl_coach.exploration_policies.e_greedy import *
 if len(set(failed_imports)) > 0:
    screen.warning("Warning: failed to import the following packages - {}".format(', '.join(set(failed_imports))))
 def _get_cuda_available_devices():
    import ctypes
    try:
        devices = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
        return [] if devices[0] == '' else [int(i) for i in devices]
    except KeyError:
        pass
    try:
        cuda_lib = ctypes.CDLL('libcuda.so')
    except OSError:
        return []
    CUDA_SUCCESS = 0
    num_gpus = ctypes.c_int()
    result = cuda_lib.cuInit(0)
    if result != CUDA_SUCCESS:
        return []
    result = cuda_lib.cuDeviceGetCount(ctypes.byref(num_gpus))
    if result != CUDA_SUCCESS:
        return []
    return list(range(num_gpus.value))
 def add_items_to_dict(target_dict, source_dict):
    updated_task_parameters = copy.copy(source_dict)
    updated_task_parameters.update(target_dict)
@@ -215,6 +244,8 @@ class CoachLauncher(object):
    and handle absolutely everything for a job.
    """
    gpus = _get_cuda_available_devices()
    def launch(self):
        """
        Main entry point for the class, and the standard way to run coach from the command line.
@@ -440,6 +471,9 @@ class CoachLauncher(object):
            screen.warning("Exporting ONNX graphs requires setting the --checkpoint_save_secs flag. "
                           "The --export_onnx_graph will have no effect.")
        if args.use_cpu or not CoachLauncher.gpus:
            CoachLauncher.gpus = [None]
        return args
    def get_argument_parser(self) -> argparse.ArgumentParser:
@@ -609,9 +643,9 @@ class CoachLauncher(object):
        # Single-threaded runs
        if args.num_workers == 1:
-            self.start_single_threaded(task_parameters, graph_manager, args)
+            self.start_single_process(task_parameters, graph_manager, args)
        else:
-            self.start_multi_threaded(graph_manager, args)
+            self.start_multi_process(graph_manager, args)
    @staticmethod
    def create_task_parameters(graph_manager: 'GraphManager', args: argparse.Namespace):
@@ -669,12 +703,12 @@ class CoachLauncher(object):
        return task_parameters
    @staticmethod
-    def start_single_threaded(task_parameters, graph_manager: 'GraphManager', args: argparse.Namespace):
+    def start_single_process(task_parameters, graph_manager: 'GraphManager', args: argparse.Namespace):
        # Start the training or evaluation
        start_graph(graph_manager=graph_manager, task_parameters=task_parameters)
    @staticmethod
-    def start_multi_threaded(graph_manager: 'GraphManager', args: argparse.Namespace):
+    def start_multi_process(graph_manager: 'GraphManager', args: argparse.Namespace):
        total_tasks = args.num_workers
        if args.evaluation_worker:
            total_tasks += 1
@@ -695,7 +729,8 @@ class CoachLauncher(object):
                             "and not from a file. ")
        def start_distributed_task(job_type, task_index, evaluation_worker=False,
-                                   shared_memory_scratchpad=shared_memory_scratchpad):
+                                   shared_memory_scratchpad=shared_memory_scratchpad,
                                   gpu_id=None):
            task_parameters = DistributedTaskParameters(
                framework_type=args.framework,
                parameters_server_hosts=ps_hosts,
@@ -715,6 +750,8 @@ class CoachLauncher(object):
                export_onnx_graph=args.export_onnx_graph,
                apply_stop_condition=args.apply_stop_condition
            )
            if gpu_id is not None:
                set_gpu(gpu_id)
            # we assume that only the evaluation workers are rendering
            graph_manager.visualization_parameters.render = args.render and evaluation_worker
            p = Process(target=start_graph, args=(graph_manager, task_parameters))
@@ -723,25 +760,30 @@ class CoachLauncher(object):
            return p
        # parameter server
-        parameter_server = start_distributed_task("ps", 0)
+        parameter_server = start_distributed_task("ps", 0, gpu_id=CoachLauncher.gpus[0])
        # training workers
        # wait a bit before spawning the non chief workers in order to make sure the session is already created
        curr_gpu_idx = 0
        workers = []
-        workers.append(start_distributed_task("worker", 0))
+        workers.append(start_distributed_task("worker", 0, gpu_id=CoachLauncher.gpus[curr_gpu_idx]))
        time.sleep(2)
        for task_index in range(1, args.num_workers):
-            workers.append(start_distributed_task("worker", task_index))
+            curr_gpu_idx = (curr_gpu_idx + 1) % len(CoachLauncher.gpus)
            workers.append(start_distributed_task("worker", task_index, gpu_id=CoachLauncher.gpus[curr_gpu_idx]))
        # evaluation worker
        if args.evaluation_worker or args.render:
-            evaluation_worker = start_distributed_task("worker", args.num_workers, evaluation_worker=True)
+            curr_gpu_idx = (curr_gpu_idx + 1) % len(CoachLauncher.gpus)
            evaluation_worker = start_distributed_task("worker", args.num_workers, evaluation_worker=True,
                                                       gpu_id=CoachLauncher.gpus[curr_gpu_idx])
        # wait for all workers
        [w.join() for w in workers]
        if args.evaluation_worker:
            evaluation_worker.terminate()
        parameter_server.terminate()
 class CoachInterface(CoachLauncher):
--- a/rl_coach/environments/environment.py
+++ b/rl_coach/environments/environment.py
@@ -47,20 +47,21 @@ class LevelSelection(object):
 class SingleLevelSelection(LevelSelection):
-    def __init__(self, levels: Union[str, List[str], Dict[str, str]]):
+    def __init__(self, levels: Union[str, List[str], Dict[str, str]], force_lower=True):
        super().__init__(None)
        self.levels = levels
        if isinstance(levels, list):
            self.levels = {level: level for level in levels}
        if isinstance(levels, str):
            self.levels = {levels: levels}
        self.force_lower = force_lower
    def __str__(self):
        if self.selected_level is None:
            logger.screen.error("No level has been selected. Please select a level using the -lvl command line flag, "
                                "or change the level in the preset. \nThe available levels are: \n{}"
                                .format(', '.join(sorted(self.levels.keys()))), crash=True)
-        selected_level = self.selected_level.lower()
+        selected_level = self.selected_level.lower() if self.force_lower else self.selected_level
        if selected_level not in self.levels.keys():
            logger.screen.error("The selected level ({}) is not part of the available levels ({})"
                                .format(selected_level, ', '.join(self.levels.keys())), crash=True)
--- a/rl_coach/environments/robosuite/cube_exp.py
+++ b/rl_coach/environments/robosuite/cube_exp.py
@@ -0,0 +1,187 @@
 #
 # Copyright (c) 2021 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import numpy as np
 from robosuite.utils.mjcf_utils import CustomMaterial
 from robosuite.environments.manipulation.single_arm_env import SingleArmEnv
 from robosuite.environments.manipulation.lift import Lift
 from robosuite.models.arenas import TableArena
 from robosuite.models.objects import BoxObject
 from robosuite.models.tasks import ManipulationTask
 from robosuite.utils.placement_samplers import UniformRandomSampler
 TABLE_TOP_SIZE = (0.84, 1.25, 0.05)
 TABLE_OFFSET = (0, 0, 0.82)
 class CubeExp(Lift):
    """
    This class corresponds to multi-colored cube exploration for a single robot arm.
    """
    def __init__(
        self,
        robots,
        table_full_size=TABLE_TOP_SIZE,
        table_offset=TABLE_OFFSET,
        placement_initializer=None,
        penalize_reward_on_collision=False,
        end_episode_on_collision=False,
        **kwargs
    ):
        """
        Args:
            robots (str or list of str): Specification for specific robot arm(s) to be instantiated within this env
                (e.g: "Sawyer" would generate one arm; ["Panda", "Panda", "Sawyer"] would generate three robot arms)
                Note: Must be a single single-arm robot!
            table_full_size (3-tuple): x, y, and z dimensions of the table.
            placement_initializer (ObjectPositionSampler instance): if provided, will
                be used to place objects on every reset, else a UniformRandomSampler
                is used by default.
            Rest of kwargs follow Lift class arguments
        """
        if placement_initializer is None:
            placement_initializer = UniformRandomSampler(
                name="ObjectSampler",
                x_range=[0.0, 0.0],
                y_range=[0.0, 0.0],
                rotation=(0.0, 0.0),
                ensure_object_boundary_in_range=False,
                ensure_valid_placement=True,
                reference_pos=table_offset,
                z_offset=0.9,
            )
        super().__init__(
            robots=robots,
            table_full_size=table_full_size,
            placement_initializer=placement_initializer,
            initialization_noise=None,
            **kwargs
        )
        self._max_episode_steps = self.horizon
    def _load_model(self):
        """
        Loads an xml model, puts it in self.model
        """
        SingleArmEnv._load_model(self)
        # Adjust base pose accordingly
        xpos = self.robots[0].robot_model.base_xpos_offset["table"](self.table_full_size[0])
        self.robots[0].robot_model.set_base_xpos(xpos)
        # load model for table top workspace
        mujoco_arena = TableArena(
            table_full_size=self.table_full_size,
            table_friction=self.table_friction,
            table_offset=self.table_offset,
        )
        # Arena always gets set to zero origin
        mujoco_arena.set_origin([0, 0, 0])
        cube_material = self._get_cube_material()
        self.cube = BoxObject(
            name="cube",
            size_min=(0.025, 0.025, 0.025),
            size_max=(0.025, 0.025, 0.025),
            rgba=[1, 0, 0, 1],
            material=cube_material,
        )
        self.placement_initializer.reset()
        self.placement_initializer.add_objects(self.cube)
        # task includes arena, robot, and objects of interest
        self.model = ManipulationTask(
            mujoco_arena=mujoco_arena,
            mujoco_robots=[robot.robot_model for robot in self.robots],
            mujoco_objects=self.cube,
        )
    @property
    def action_spec(self):
        """
        Action space (low, high) for this environment
        """
        low, high = super().action_spec
        return low[:3], high[:3]
    def _get_cube_material(self):
        from robosuite.utils.mjcf_utils import array_to_string
        rgba = (1, 0, 0, 1)
        cube_material = CustomMaterial(
            texture=rgba,
            tex_name="solid",
            mat_name="solid_mat",
        )
        cube_material.tex_attrib.pop('file')
        cube_material.tex_attrib["type"] = "cube"
        cube_material.tex_attrib["builtin"] = "flat"
        cube_material.tex_attrib["rgb1"] = array_to_string(rgba[:3])
        cube_material.tex_attrib["rgb2"] = array_to_string(rgba[:3])
        cube_material.tex_attrib["width"] = "100"
        cube_material.tex_attrib["height"] = "100"
        return cube_material
    def _reset_internal(self):
        """
        Resets simulation internal configurations.
        """
        from robosuite.utils.mjmod import Texture
        super()._reset_internal()
        self._action_dim = 3
        geom_id = self.sim.model.geom_name2id('cube_g0_vis')
        mat_id = self.sim.model.geom_matid[geom_id]
        tex_id = self.sim.model.mat_texid[mat_id]
        texture = Texture(self.sim.model, tex_id)
        bitmap_to_set = texture.bitmap
        bitmap = np.zeros_like(bitmap_to_set)
        bitmap[:100, :, :] = 255
        bitmap[100:200, :, 0] = 255
        bitmap[200:300, :, 1] = 255
        bitmap[300:400, :, 2] = 255
        bitmap[400:500, :, :2] = 255
        bitmap[500:, :, 1:] = 255
        bitmap_to_set[:] = bitmap
        for render_context in self.sim.render_contexts:
            render_context.upload_texture(texture.id)
    def _pre_action(self, action, policy_step=False):
        """ explicitly shut the gripper """
        joined_action = np.append(action, [0., 0., 0., 1.])
        self._action_dim = 7
        super()._pre_action(joined_action, policy_step)
    def _post_action(self, action):
        ret = super()._post_action(action)
        self._action_dim = 3
        return ret
    def reward(self, action=None):
        return 0
    def _check_success(self):
        return False
--- a/rl_coach/environments/robosuite/osc_pose.json
+++ b/rl_coach/environments/robosuite/osc_pose.json
@@ -0,0 +1,18 @@
 {
  "type": "OSC_POSE",
  "input_max": 1,
  "input_min": -1,
  "output_max": [0.125, 0.125, 0.125, 0.5, 0.5, 0.5],
  "output_min": [-0.125, -0.125, -0.125, -0.5, -0.5, -0.5],
  "kp": 150,
  "damping_ratio": 1,
  "impedance_mode": "fixed",
  "kp_limits": [0, 300],
  "damping_ratio_limits": [0, 10],
  "position_limits": [[-0.22, -0.35, 0.82], [0.22, 0.35, 1.3]],
  "orientation_limits": null,
  "uncouple_pos_ori": true,
  "control_delta": true,
  "interpolation": null,
  "ramp_ratio": 0.2
 }
--- a/rl_coach/environments/robosuite_environment.py
+++ b/rl_coach/environments/robosuite_environment.py
@@ -0,0 +1,321 @@
 #
 # Copyright (c) 2020 Intel Corporation
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 from typing import Union ,Dict, Any
 from enum import Enum, Flag, auto
 from copy import deepcopy
 import numpy as np
 import random
 from collections import namedtuple
 try:
    import robosuite
    from robosuite.wrappers import Wrapper, DomainRandomizationWrapper
 except ImportError:
    from rl_coach.logger import failed_imports
    failed_imports.append("Robosuite")
 from rl_coach.base_parameters import Parameters, VisualizationParameters
 from rl_coach.environments.environment import Environment, EnvironmentParameters, LevelSelection
 from rl_coach.spaces import BoxActionSpace, VectorObservationSpace, StateSpace, PlanarMapsObservationSpace
 # Importing our custom Robosuite environments here so that they are properly
 # registered in Robosuite, and so recognized by 'robosuite.make()' and included
 # in 'robosuite.ALL_ENVIRONMENTS'
 import rl_coach.environments.robosuite.cube_exp
 robosuite_environments = list(robosuite.ALL_ENVIRONMENTS)
 robosuite_robots = list(robosuite.ALL_ROBOTS)
 robosuite_controllers = list(robosuite.ALL_CONTROLLERS)
 def get_robosuite_env_extra_parameters(env_name: str):
    import inspect
    assert env_name in robosuite_environments
    env_params = inspect.signature(robosuite.environments.REGISTERED_ENVS[env_name]).parameters
    base_params = list(RobosuiteBaseParameters().env_kwargs_dict().keys()) + ['robots', 'controller_configs']
    return {n: p.default for n, p in env_params.items() if n not in base_params}
 class OptionalObservations(Flag):
    NONE = 0
    CAMERA = auto()
    OBJECT = auto()
 class RobosuiteBaseParameters(Parameters):
    def __init__(self, optional_observations: OptionalObservations = OptionalObservations.NONE):
        super(RobosuiteBaseParameters, self).__init__()
        # NOTE: Attribute names should exactly match the attribute names in Robosuite
        self.horizon = 1000         # Every episode lasts for exactly horizon timesteps
        self.ignore_done = True     # True if never terminating the environment (ignore horizon)
        self.reward_shaping = True  # if True, use dense rewards.
        # How many control signals to receive in every simulated second. This sets the amount of simulation time
        # that passes between every action input (this is NOT the same as frame_skip)
        self.control_freq = 10
        # Optional observations (robot state is always returned)
        # if True, every observation includes a rendered image
        self.use_camera_obs = bool(optional_observations & OptionalObservations.CAMERA)
        # if True, include object (cube/etc.) information in the observation
        self.use_object_obs = bool(optional_observations & OptionalObservations.OBJECT)
        # Camera parameters
        self.has_renderer = False            # Set to true to use Mujoco native viewer for on-screen rendering
        self.render_camera = 'frontview'     # name of camera to use for on-screen rendering
        self.has_offscreen_renderer = self.use_camera_obs
        self.render_collision_mesh = False   # True if rendering collision meshes in camera. False otherwise
        self.render_visual_mesh = True       # True if rendering visual meshes in camera. False otherwise
        self.camera_names = 'agentview'      # name of camera for rendering camera observations
        self.camera_heights = 84             # height of camera frame.
        self.camera_widths = 84              # width of camera frame.
        self.camera_depths = False           # True if rendering RGB-D, and RGB otherwise.
        # Collision
        self.penalize_reward_on_collision = True
        self.end_episode_on_collision = False
    @property
    def optional_observations(self):
        flag = OptionalObservations.NONE
        if self.use_camera_obs:
            flag = OptionalObservations.CAMERA
            if self.use_object_obs:
                flag |= OptionalObservations.OBJECT
        elif self.use_object_obs:
            flag = OptionalObservations.OBJECT
        return flag
    @optional_observations.setter
    def optional_observations(self, value):
        self.use_camera_obs = bool(value & OptionalObservations.CAMERA)
        if self.use_camera_obs:
            self.has_offscreen_renderer = True
        self.use_object_obs = bool(value & OptionalObservations.OBJECT)
    def env_kwargs_dict(self):
        res = {k: (v.value if isinstance(v, Enum) else v) for k, v in vars(self).items()}
        return res
 class RobosuiteEnvironmentParameters(EnvironmentParameters):
    def __init__(self, level, robot=None, controller=None, apply_dr: bool = False,
                 dr_every_n_steps_min: int = 10, dr_every_n_steps_max: int = 20,
                 use_joint_vel_obs=False):
        super().__init__(level=level)
        self.base_parameters = RobosuiteBaseParameters()
        self.extra_parameters = {}
        self.robot = robot
        self.controller = controller
        self.apply_dr = apply_dr
        self.dr_every_n_steps_min = dr_every_n_steps_min
        self.dr_every_n_steps_max = dr_every_n_steps_max
        self.use_joint_vel_obs = use_joint_vel_obs
        self.custom_controller_config_fpath = None
    @property
    def path(self):
        return 'rl_coach.environments.robosuite_environment:RobosuiteEnvironment'
 DEFAULT_REWARD_SCALES = {
    'Lift': 2.25,
    'LiftLab': 2.25,
 }
 RobosuiteStepResult = namedtuple('RobosuiteStepResult', ['observation', 'reward', 'done', 'info'])
 # Environment
 class RobosuiteEnvironment(Environment):
    def __init__(self, level: LevelSelection,
                 seed: int, frame_skip: int, human_control: bool, custom_reward_threshold: Union[int, float, None],
                 visualization_parameters: VisualizationParameters,
                 base_parameters: RobosuiteBaseParameters,
                 extra_parameters: Dict[str, Any],
                 robot: str, controller: str,
                 target_success_rate: float = 1.0, apply_dr: bool = False,
                 dr_every_n_steps_min: int = 10, dr_every_n_steps_max: int = 20, use_joint_vel_obs=False,
                 custom_controller_config_fpath=None, **kwargs):
        super(RobosuiteEnvironment, self).__init__(level, seed, frame_skip, human_control, custom_reward_threshold,
                                                   visualization_parameters, target_success_rate)
        # Validate arguments
        self.frame_skip = max(1, self.frame_skip)
        def validate_input(input, supported, name):
            if input not in supported:
                raise ValueError("Unknown Robosuite {0} passed: '{1}' ; Supported {0}s are: {2}".format(
                    name, input, ' | '.join(supported)
                ))
        validate_input(self.env_id, robosuite_environments, 'environment')
        validate_input(robot, robosuite_robots, 'robot')
        self.robot = robot
        if controller is not None:
            validate_input(controller, robosuite_controllers, 'controller')
        self.controller = controller
        self.base_parameters = base_parameters
        self.base_parameters.has_renderer = self.is_rendered and self.native_rendering
        self.base_parameters.has_offscreen_renderer = self.base_parameters.use_camera_obs or (self.is_rendered and not
                                                                                              self.native_rendering)
        # Seed
        if self.seed is not None:
            np.random.seed(self.seed)
            random.seed(self.seed)
        # Load and initialize environment
        env_args = self.base_parameters.env_kwargs_dict()
        env_args.update(extra_parameters)
        if 'reward_scale' not in env_args and self.env_id in DEFAULT_REWARD_SCALES:
            env_args['reward_scale'] = DEFAULT_REWARD_SCALES[self.env_id]
        env_args['robots'] = self.robot
        controller_cfg = None
        if self.controller is not None:
            controller_cfg = robosuite.controllers.load_controller_config(default_controller=self.controller)
        elif custom_controller_config_fpath is not None:
            controller_cfg = robosuite.controllers.load_controller_config(custom_fpath=custom_controller_config_fpath)
        env_args['controller_configs'] = controller_cfg
        self.env = robosuite.make(self.env_id, **env_args)
        # TODO: Generalize this to filter any observation by name
        if not use_joint_vel_obs:
            self.env.modify_observable('robot0_joint_vel', 'active', False)
        # Wrap with a dummy wrapper so we get a consistent API (there are subtle changes between
        # wrappers and actual environments in Robosuite, for example action_spec as property vs. function)
        self.env = Wrapper(self.env)
        if apply_dr:
            self.env = DomainRandomizationWrapper(self.env, seed=self.seed, randomize_every_n_steps_min=dr_every_n_steps_min,
                                                  randomize_every_n_steps_max=dr_every_n_steps_max)
        # State space
        self.state_space = self._setup_state_space()
        # Action space
        low, high = self.env.unwrapped.action_spec
        self.action_space = BoxActionSpace(low.shape, low=low, high=high)
        self.reset_internal_state()
        if self.is_rendered:
            image = self.get_rendered_image()
            self.renderer.create_screen(image.shape[1], image.shape[0])
        # TODO: Other environments call rendering here, why? reset_internal_state does it
    def _setup_state_space(self):
        state_space = StateSpace({})
        dummy_obs = self._process_observation(self.env.observation_spec())
        state_space['measurements'] = VectorObservationSpace(dummy_obs['measurements'].shape[0])
        if self.base_parameters.use_camera_obs:
            state_space['camera'] = PlanarMapsObservationSpace(dummy_obs['camera'].shape, 0, 255)
        return state_space
    def _process_observation(self, raw_obs):
        new_obs = {}
        # TODO: Support multiple cameras, this assumes a single camera
        camera_name = self.base_parameters.camera_names
        camera_obs = raw_obs.get(camera_name + '_image', None)
        if camera_obs is not None:
            depth_obs = raw_obs.get(camera_name + '_depth', None)
            if depth_obs is not None:
                depth_obs = np.expand_dims(depth_obs, axis=2)
                camera_obs = np.concatenate([camera_obs, depth_obs], axis=2)
            new_obs['camera'] = camera_obs
        measurements = raw_obs['robot0_proprio-state']
        object_obs = raw_obs.get('object-state', None)
        if object_obs is not None:
            measurements = np.concatenate([measurements, object_obs])
        new_obs['measurements'] = measurements
        return new_obs
    def _take_action(self, action):
        action = self.action_space.clip_action_to_space(action)
        # We mimic the "action_repeat" mechanism of RobosuiteWrapper in Surreal.
        # Same concept as frame_skip, only returning the average reward across repeated actions instead
        # of the total reward.
        rewards = []
        for _ in range(self.frame_skip):
            obs, reward, done, info = self.env.step(action)
            rewards.append(reward)
            if done:
                break
        reward = np.mean(rewards)
        self.last_result = RobosuiteStepResult(obs, reward, done, info)
    def _update_state(self):
        obs = self._process_observation(self.last_result.observation)
        self.state = {k: obs[k] for k in self.state_space.sub_spaces}
        self.reward = self.last_result.reward or 0
        self.done = self.last_result.done
        self.info = self.last_result.info
    def _restart_environment_episode(self, force_environment_reset=False):
        reset_obs = self.env.reset()
        self.last_result = RobosuiteStepResult(reset_obs, 0.0, False, {})
    def _render(self):
        self.env.render()
    def get_rendered_image(self):
        img: np.ndarray = self.env.sim.render(camera_name=self.base_parameters.render_camera,
                                              height=512, width=512, depth=False)
        return np.flip(img, 0)
    def close(self):
        self.env.close()
 class RobosuiteGoalBasedExpEnvironmentParameters(RobosuiteEnvironmentParameters):
    @property
    def path(self):
        return 'rl_coach.environments.robosuite_environment:RobosuiteGoalBasedExpEnvironment'
 class RobosuiteGoalBasedExpEnvironment(RobosuiteEnvironment):
    def _process_observation(self, raw_obs):
        new_obs = super()._process_observation(raw_obs)
        new_obs['obs-goal'] = None
        return new_obs
    def _setup_state_space(self):
        state_space = super()._setup_state_space()
        goal_based_shape = list(state_space['camera'].shape)
        goal_based_shape[2] *= 2
        state_space['obs-goal'] = PlanarMapsObservationSpace(tuple(goal_based_shape), 0, 255)
        return state_space
--- a/rl_coach/environments/starcraft2_environment.py
+++ b/rl_coach/environments/starcraft2_environment.py
@@ -114,7 +114,8 @@ class StarCraft2Environment(Environment):
                 observation_type: StarcraftObservationType=StarcraftObservationType.Features,
                 disable_fog: bool=False, auto_select_all_army: bool=True,
                 use_full_action_space: bool=False, **kwargs):
-        super().__init__(level, seed, frame_skip, human_control, custom_reward_threshold, visualization_parameters, target_success_rate)
+        super().__init__(level, seed, frame_skip, human_control, custom_reward_threshold, visualization_parameters,
                         target_success_rate)
        self.screen_size = screen_size
        self.minimap_size = minimap_size
--- a/rl_coach/graph_managers/graph_manager.py
+++ b/rl_coach/graph_managers/graph_manager.py
@@ -222,7 +222,8 @@ class GraphManager(object):
        if isinstance(task_parameters, DistributedTaskParameters):
            # the distributed tensorflow setting
            from rl_coach.architectures.tensorflow_components.distributed_tf_utils import create_monitored_session
-            if hasattr(self.task_parameters, 'checkpoint_restore_path') and self.task_parameters.checkpoint_restore_path:
+            if hasattr(self.task_parameters,
                       'checkpoint_restore_path') and self.task_parameters.checkpoint_restore_path:
                checkpoint_dir = os.path.join(task_parameters.experiment_path, 'checkpoint')
                if os.path.exists(checkpoint_dir):
                    remove_tree(checkpoint_dir)
@@ -438,7 +439,8 @@ class GraphManager(object):
        # perform several steps of playing
        count_end = self.current_step_counter + steps
        result = None
-        while self.current_step_counter < count_end or (wait_for_full_episodes and result is not None and not result.game_over):
+        while self.current_step_counter < count_end or (
                wait_for_full_episodes and result is not None and not result.game_over):
            # reset the environment if the previous episode was terminated
            if self.reset_required:
                self.reset_internal_state()
@@ -506,8 +508,14 @@ class GraphManager(object):
                # act for at least `steps`, though don't interrupt an episode
                count_end = self.current_step_counter + steps
                while self.current_step_counter < count_end:
                    # In case of an evaluation-only worker, fake a phase transition before and after every
                    # episode to make sure results are logged correctly
                    if self.task_parameters.evaluate_only is not None:
                        self.phase = RunPhase.TEST
                    self.act(EnvironmentEpisodes(1))
                    self.sync()
                    if self.task_parameters.evaluate_only is not None:
                        self.phase = RunPhase.TRAIN
        if self.should_stop():
            self.flush_finished()
            screen.success("Reached required success rate. Exiting.")
@@ -717,7 +725,8 @@ class GraphManager(object):
            self.memory_backend = get_memory_backend(self.agent_params.memory.memory_backend_params)
    def should_stop(self) -> bool:
-        return self.task_parameters.apply_stop_condition and all([manager.should_stop() for manager in self.level_managers])
+        return self.task_parameters.apply_stop_condition and all(
            [manager.should_stop() for manager in self.level_managers])
    def get_data_store(self, param):
        if self.data_store:
--- a/rl_coach/logger.py
+++ b/rl_coach/logger.py
@@ -70,7 +70,7 @@ class ScreenLogger(object):
        """
        if not self.log_file:
            self.log_file = open(os.path.join(experiment_path, "log.txt"), "a")
-        self.log_file.write(",".join([t for t in text]))
+        self.log_file.write(",".join([str(t) for t in text]))
        self.log_file.write("\n")
        self.log_file.flush()
        print(*text, flush=True)
--- a/rl_coach/presets/RoboSuite_CubeExp_Random.py
+++ b/rl_coach/presets/RoboSuite_CubeExp_Random.py
@@ -0,0 +1,98 @@
 from rl_coach.agents.td3_exp_agent import RandomAgentParameters
 from rl_coach.architectures.embedder_parameters import InputEmbedderParameters
 from rl_coach.architectures.layers import Dense, Conv2d, BatchnormActivationDropout, Flatten
 from rl_coach.base_parameters import EmbedderScheme
 from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps
 from rl_coach.environments.robosuite_environment import RobosuiteGoalBasedExpEnvironmentParameters, \
    OptionalObservations
 from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
 from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
 from rl_coach.graph_managers.graph_manager import ScheduleParameters
 from rl_coach.architectures.head_parameters import RNDHeadParameters
 ####################
 # Graph Scheduling #
 ####################
 schedule_params = ScheduleParameters()
 schedule_params.improve_steps = TrainingSteps(300000)
 schedule_params.steps_between_evaluation_periods = TrainingSteps(300000)
 schedule_params.evaluation_steps = EnvironmentEpisodes(0)
 schedule_params.heatup_steps = EnvironmentSteps(0)
 #########
 # Agent #
 #########
 agent_params = RandomAgentParameters()
 agent_params.algorithm.use_non_zero_discount_for_terminal_states = True
 agent_params.input_filter = NoInputFilter()
 agent_params.output_filter = NoOutputFilter()
 # Camera observation pre-processing network scheme
 camera_obs_scheme = [
    Conv2d(32, 8, 4),
    BatchnormActivationDropout(activation_function='relu'),
    Conv2d(64, 4, 2),
    BatchnormActivationDropout(activation_function='relu'),
    Conv2d(64, 3, 1),
    BatchnormActivationDropout(activation_function='relu'),
    Flatten(),
    Dense(256),
    BatchnormActivationDropout(activation_function='relu')
 ]
 # Actor
 actor_network = agent_params.network_wrappers['actor']
 actor_network.input_embedders_parameters = {
    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')
 }
 actor_network.middleware_parameters.scheme = [Dense(300), Dense(200)]
 actor_network.learning_rate = 1e-4
 # Critic
 critic_network = agent_params.network_wrappers['critic']
 critic_network.input_embedders_parameters = {
    'action': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')
 }
 critic_network.middleware_parameters.scheme = [Dense(400), Dense(300)]
 critic_network.learning_rate = 1e-4
 # RND
 agent_params.network_wrappers['predictor'].input_embedders_parameters = \
    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,
                                                                 input_rescaling={'image': 1.0},
                                                                 flatten=False)}
 agent_params.network_wrappers['constant'].input_embedders_parameters = \
    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,
                                                                 input_rescaling={'image': 1.0},
                                                                 flatten=False)}
 agent_params.network_wrappers['predictor'].heads_parameters = [RNDHeadParameters(is_predictor=True)]
 ###############
 # Environment #
 ###############
 env_params = RobosuiteGoalBasedExpEnvironmentParameters(level='CubeExp')
 env_params.robot = 'Panda'
 env_params.custom_controller_config_fpath = './rl_coach/environments/robosuite/osc_pose.json'
 env_params.base_parameters.optional_observations = OptionalObservations.CAMERA
 env_params.base_parameters.render_camera = 'frontview'
 env_params.base_parameters.camera_names = 'agentview'
 env_params.base_parameters.camera_depths = False
 env_params.base_parameters.horizon = 200
 env_params.base_parameters.ignore_done = False
 env_params.base_parameters.use_object_obs = True
 env_params.frame_skip = 1
 env_params.base_parameters.control_freq = 2
 env_params.base_parameters.camera_heights = 84
 env_params.base_parameters.camera_widths = 84
 env_params.extra_parameters = {'hard_reset': False}
 graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params, schedule_params=schedule_params)
--- a/rl_coach/presets/RoboSuite_CubeExp_TD3_Goal_Based.py
+++ b/rl_coach/presets/RoboSuite_CubeExp_TD3_Goal_Based.py
@@ -0,0 +1,111 @@
 from rl_coach.agents.td3_exp_agent import TD3GoalBasedAgentParameters
 from rl_coach.architectures.embedder_parameters import InputEmbedderParameters
 from rl_coach.architectures.layers import Dense, Conv2d, BatchnormActivationDropout, Flatten
 from rl_coach.base_parameters import EmbedderScheme
 from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps
 from rl_coach.environments.robosuite_environment import RobosuiteGoalBasedExpEnvironmentParameters, \
    OptionalObservations
 from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
 from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
 from rl_coach.graph_managers.graph_manager import ScheduleParameters
 from rl_coach.architectures.head_parameters import RNDHeadParameters
 from rl_coach.schedules import LinearSchedule
 ####################
 # Graph Scheduling #
 ####################
 schedule_params = ScheduleParameters()
 schedule_params.improve_steps = TrainingSteps(300000)
 schedule_params.steps_between_evaluation_periods = TrainingSteps(300000)
 schedule_params.evaluation_steps = EnvironmentEpisodes(0)
 schedule_params.heatup_steps = EnvironmentSteps(1000)
 #########
 # Agent #
 #########
 agent_params = TD3GoalBasedAgentParameters()
 agent_params.algorithm.use_non_zero_discount_for_terminal_states = False
 agent_params.algorithm.identity_goal_sample_rate = 0.04
 agent_params.exploration.noise_schedule = LinearSchedule(1.5, 0.5, 300000)
 agent_params.algorithm.rnd_sample_size = 2000
 agent_params.algorithm.rnd_batch_size = 500
 agent_params.algorithm.rnd_optimization_epochs = 4
 agent_params.algorithm.td3_training_ratio = 1.0
 agent_params.algorithm.identity_goal_sample_rate = 0.0
 agent_params.algorithm.env_obs_key = 'camera'
 agent_params.algorithm.agent_obs_key = 'obs-goal'
 agent_params.algorithm.replay_buffer_save_steps = 25000
 agent_params.algorithm.replay_buffer_save_path = './tutorials'
 agent_params.input_filter = NoInputFilter()
 agent_params.output_filter = NoOutputFilter()
 # Camera observation pre-processing network scheme
 camera_obs_scheme = [
    Conv2d(32, 8, 4),
    BatchnormActivationDropout(activation_function='relu'),
    Conv2d(64, 4, 2),
    BatchnormActivationDropout(activation_function='relu'),
    Conv2d(64, 3, 1),
    BatchnormActivationDropout(activation_function='relu'),
    Flatten(),
    Dense(256),
    BatchnormActivationDropout(activation_function='relu')
 ]
 # Actor
 actor_network = agent_params.network_wrappers['actor']
 actor_network.input_embedders_parameters = {
    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')
 }
 actor_network.middleware_parameters.scheme = [Dense(300), Dense(200)]
 actor_network.learning_rate = 1e-4
 # Critic
 critic_network = agent_params.network_wrappers['critic']
 critic_network.input_embedders_parameters = {
    'action': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')
 }
 critic_network.middleware_parameters.scheme = [Dense(400), Dense(300)]
 critic_network.learning_rate = 1e-4
 # RND
 agent_params.network_wrappers['predictor'].input_embedders_parameters = \
    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,
                                                                 input_rescaling={'image': 1.0},
                                                                 flatten=False)}
 agent_params.network_wrappers['constant'].input_embedders_parameters = \
    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,
                                                                 input_rescaling={'image': 1.0},
                                                                 flatten=False)}
 agent_params.network_wrappers['predictor'].heads_parameters = [RNDHeadParameters(is_predictor=True)]
 ###############
 # Environment #
 ###############
 env_params = RobosuiteGoalBasedExpEnvironmentParameters(level='CubeExp')
 env_params.robot = 'Panda'
 env_params.custom_controller_config_fpath = './rl_coach/environments/robosuite/osc_pose.json'
 env_params.base_parameters.optional_observations = OptionalObservations.CAMERA
 env_params.base_parameters.render_camera = 'frontview'
 env_params.base_parameters.camera_names = 'agentview'
 env_params.base_parameters.camera_depths = False
 env_params.base_parameters.horizon = 200
 env_params.base_parameters.ignore_done = False
 env_params.base_parameters.use_object_obs = True
 env_params.frame_skip = 1
 env_params.base_parameters.control_freq = 2
 env_params.base_parameters.camera_heights = 84
 env_params.base_parameters.camera_widths = 84
 env_params.extra_parameters = {'hard_reset': False}
 graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params, schedule_params=schedule_params)
--- a/rl_coach/presets/RoboSuite_CubeExp_TD3_Intrinsic_Reward.py
+++ b/rl_coach/presets/RoboSuite_CubeExp_TD3_Intrinsic_Reward.py
@@ -0,0 +1,100 @@
 from rl_coach.agents.td3_exp_agent import TD3IntrinsicRewardAgentParameters
 from rl_coach.architectures.embedder_parameters import InputEmbedderParameters
 from rl_coach.architectures.layers import Dense, Conv2d, BatchnormActivationDropout, Flatten
 from rl_coach.base_parameters import EmbedderScheme
 from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps
 from rl_coach.environments.robosuite_environment import RobosuiteGoalBasedExpEnvironmentParameters, \
    OptionalObservations
 from rl_coach.filters.filter import NoInputFilter, NoOutputFilter
 from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
 from rl_coach.graph_managers.graph_manager import ScheduleParameters
 from rl_coach.architectures.head_parameters import RNDHeadParameters
 from rl_coach.schedules import LinearSchedule
 ####################
 # Graph Scheduling #
 ####################
 schedule_params = ScheduleParameters()
 schedule_params.improve_steps = TrainingSteps(300000)
 schedule_params.steps_between_evaluation_periods = TrainingSteps(300000)
 schedule_params.evaluation_steps = EnvironmentEpisodes(0)
 schedule_params.heatup_steps = EnvironmentSteps(1000)
 #########
 # Agent #
 #########
 agent_params = TD3IntrinsicRewardAgentParameters()
 agent_params.algorithm.use_non_zero_discount_for_terminal_states = True
 agent_params.exploration.noise_schedule = LinearSchedule(1.5, 0.5, 300000)
 agent_params.input_filter = NoInputFilter()
 agent_params.output_filter = NoOutputFilter()
 # Camera observation pre-processing network scheme
 camera_obs_scheme = [
    Conv2d(32, 8, 4),
    BatchnormActivationDropout(activation_function='relu'),
    Conv2d(64, 4, 2),
    BatchnormActivationDropout(activation_function='relu'),
    Conv2d(64, 3, 1),
    BatchnormActivationDropout(activation_function='relu'),
    Flatten(),
    Dense(256),
    BatchnormActivationDropout(activation_function='relu')
 ]
 # Actor
 actor_network = agent_params.network_wrappers['actor']
 actor_network.input_embedders_parameters = {
    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')
 }
 actor_network.middleware_parameters.scheme = [Dense(300), Dense(200)]
 actor_network.learning_rate = 1e-4
 # Critic
 critic_network = agent_params.network_wrappers['critic']
 critic_network.input_embedders_parameters = {
    'action': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),
    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')
 }
 critic_network.middleware_parameters.scheme = [Dense(400), Dense(300)]
 critic_network.learning_rate = 1e-4
 # RND
 agent_params.network_wrappers['predictor'].input_embedders_parameters = \
    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,
                                                                 input_rescaling={'image': 1.0},
                                                                 flatten=False)}
 agent_params.network_wrappers['constant'].input_embedders_parameters = \
    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,
                                                                 input_rescaling={'image': 1.0},
                                                                 flatten=False)}
 agent_params.network_wrappers['predictor'].heads_parameters = [RNDHeadParameters(is_predictor=True)]
 ###############
 # Environment #
 ###############
 env_params = RobosuiteGoalBasedExpEnvironmentParameters(level='CubeExp')
 env_params.robot = 'Panda'
 env_params.custom_controller_config_fpath = './rl_coach/environments/robosuite/osc_pose.json'
 env_params.base_parameters.optional_observations = OptionalObservations.CAMERA
 env_params.base_parameters.render_camera = 'frontview'
 env_params.base_parameters.camera_names = 'agentview'
 env_params.base_parameters.camera_depths = False
 env_params.base_parameters.horizon = 200
 env_params.base_parameters.ignore_done = False
 env_params.base_parameters.use_object_obs = True
 env_params.frame_skip = 1
 env_params.base_parameters.control_freq = 2
 env_params.base_parameters.camera_heights = 84
 env_params.base_parameters.camera_widths = 84
 env_params.extra_parameters = {'hard_reset': False}
 graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params, schedule_params=schedule_params)
--- a/rl_coach/tests/presets/test_presets.py
+++ b/rl_coach/tests/presets/test_presets.py
@@ -22,6 +22,9 @@ FAILING_PRESETS = [
    'CARLA_3_Cameras_DDPG',
    'Starcraft_CollectMinerals_A3C',
    'Starcraft_CollectMinerals_Dueling_DDQN',
    'RoboSuite_CubeExp_Random',
    'RoboSuite_CubeExp_TD3_Goal_Based',
    'RoboSuite_CubeExp_TD3_Intrinsic_Reward',
 ]
 def all_presets():
--- a/Collection.ipynb
+++ b/Collection.ipynb
@@ -0,0 +1,448 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Goal-Based Data Collection\n",
    "A practical approach to robot reinforcement learning is to first collect a large batch of real or simulated robot interaction data, \n",
    "using some data collection policy, and then learn from this data to perform various tasks, using offline learning algorithms.\n",
    "\n",
    "In this notebook, we will demonstrate how to collect diverse dataset for a simple robotics manipulation task\n",
    "using the algorithms detailed in the following paper:\n",
    "[Efficient Self-Supervised Data Collection for Offline Robot Learning](https://arxiv.org/abs/2105.04607).\n",
    "\n",
    "The implementation is based on the Robosuite simulator, which should be installed before running this notebook. Follow the instructions in the Coach readme [here](https://github.com/IntelLabs/coach#robosuite).\n",
    "\n",
    "Presets with predefined parameters for all three algorithms shown in the paper can be found here:\n",
    "\n",
    "* Random Agent: ```presets/RoboSuite_CubeExp_Random.py```\n",
    "\n",
    "* Intrinsic Reward Agent: ```presets/RoboSuite_CubeExp_TD3_Intrinsic_Reward.py```\n",
    "\n",
    "* Goal-Based Agent: ```presets/RoboSuite_CubeExp_TD3_Goal_Based.py```\n",
    "\n",
    "You can run those presets using the command line:\n",
    "\n",
    "`coach -p RoboSuite_CubeExp_TD3_Goal_Based`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preliminaries\n",
    "First, get the required imports and other general settings we need for this notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from rl_coach.agents.td3_exp_agent import TD3GoalBasedAgentParameters\n",
    "from rl_coach.architectures.embedder_parameters import InputEmbedderParameters\n",
    "from rl_coach.architectures.layers import Dense, Conv2d, BatchnormActivationDropout, Flatten\n",
    "from rl_coach.base_parameters import EmbedderScheme\n",
    "from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps\n",
    "from rl_coach.environments.robosuite_environment import RobosuiteGoalBasedExpEnvironmentParameters, \\\n",
    "    OptionalObservations\n",
    "from rl_coach.filters.filter import NoInputFilter, NoOutputFilter\n",
    "from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager\n",
    "from rl_coach.graph_managers.graph_manager import ScheduleParameters\n",
    "from rl_coach.architectures.head_parameters import RNDHeadParameters\n",
    "from rl_coach.schedules import LinearSchedule\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we define the training schedule for the agent. `improve_steps` dictates the number of samples in the final data-set.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "####################\n",
    "# Graph Scheduling #\n",
    "####################\n",
    "\n",
    "schedule_params = ScheduleParameters()\n",
    "schedule_params.improve_steps = TrainingSteps(300000)\n",
    "schedule_params.steps_between_evaluation_periods = TrainingSteps(300000)\n",
    "schedule_params.evaluation_steps = EnvironmentEpisodes(0)\n",
    "schedule_params.heatup_steps = EnvironmentSteps(1000)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we will be using the goal-based algorithm for data-collection. Therefore, we populate\n",
    "the `TD3GoalBasedAgentParameters` class with our desired algorithm specific parameters.\n",
    "\n",
    "The goal-based data collected is based on TD3, using this class you can change the TD3 specific parameters as well.\n",
    "\n",
    "A detailed description of the goal-based and TD3 algorithm specific parameters can be found in \n",
    "```agents/td3_exp_agent.py``` and ```agents/td3_agent.py``` respectively.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#########\n",
    "# Agent #\n",
    "#########\n",
    "\n",
    "agent_params = TD3GoalBasedAgentParameters()\n",
    "agent_params.algorithm.use_non_zero_discount_for_terminal_states = False\n",
    "agent_params.algorithm.identity_goal_sample_rate = 0.04\n",
    "agent_params.exploration.noise_schedule = LinearSchedule(1.5, 0.5, 300000)\n",
    "\n",
    "agent_params.algorithm.rnd_sample_size = 2000\n",
    "agent_params.algorithm.rnd_batch_size = 500\n",
    "agent_params.algorithm.rnd_optimization_epochs = 4\n",
    "agent_params.algorithm.td3_training_ratio = 1.0\n",
    "agent_params.algorithm.identity_goal_sample_rate = 0.0\n",
    "agent_params.algorithm.env_obs_key = 'camera'\n",
    "agent_params.algorithm.agent_obs_key = 'obs-goal'\n",
    "agent_params.algorithm.replay_buffer_save_steps = 25000\n",
    "agent_params.algorithm.replay_buffer_save_path = './Resources'\n",
    "\n",
    "agent_params.input_filter = NoInputFilter()\n",
    "agent_params.output_filter = NoOutputFilter()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll define the networks' architecture and parameters as they appear in the paper.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Camera observation pre-processing network scheme\n",
    "camera_obs_scheme = [\n",
    "    Conv2d(32, 8, 4),\n",
    "    BatchnormActivationDropout(activation_function='relu'),\n",
    "    Conv2d(64, 4, 2),\n",
    "    BatchnormActivationDropout(activation_function='relu'),\n",
    "    Conv2d(64, 3, 1),\n",
    "    BatchnormActivationDropout(activation_function='relu'),\n",
    "    Flatten(),\n",
    "    Dense(256),\n",
    "    BatchnormActivationDropout(activation_function='relu')\n",
    "]\n",
    "\n",
    "# Actor\n",
    "actor_network = agent_params.network_wrappers['actor']\n",
    "actor_network.input_embedders_parameters = {\n",
    "    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),\n",
    "    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')\n",
    "}\n",
    "\n",
    "actor_network.middleware_parameters.scheme = [Dense(300), Dense(200)]\n",
    "actor_network.learning_rate = 1e-4\n",
    "\n",
    "# Critic\n",
    "critic_network = agent_params.network_wrappers['critic']\n",
    "critic_network.input_embedders_parameters = {\n",
    "    'action': InputEmbedderParameters(scheme=EmbedderScheme.Empty),\n",
    "    'measurements': InputEmbedderParameters(scheme=EmbedderScheme.Empty),\n",
    "    agent_params.algorithm.agent_obs_key: InputEmbedderParameters(scheme=camera_obs_scheme, activation_function='none')\n",
    "}\n",
    "\n",
    "critic_network.middleware_parameters.scheme = [Dense(400), Dense(300)]\n",
    "critic_network.learning_rate = 1e-4\n",
    "\n",
    "# RND\n",
    "agent_params.network_wrappers['predictor'].input_embedders_parameters = \\\n",
    "    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,\n",
    "                                                                 input_rescaling={'image': 1.0},\n",
    "                                                                 flatten=False)}\n",
    "agent_params.network_wrappers['constant'].input_embedders_parameters = \\\n",
    "    {agent_params.algorithm.env_obs_key: InputEmbedderParameters(scheme=EmbedderScheme.Empty,\n",
    "                                                                 input_rescaling={'image': 1.0},\n",
    "                                                                 flatten=False)}\n",
    "agent_params.network_wrappers['predictor'].heads_parameters = [RNDHeadParameters(is_predictor=True)]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last thing we need to define is the environment parameters for the manipulation task.\n",
    "This environment is a 7DoF Franka Panda robotic arm with a closed gripper and cartesian\n",
    "position control of the end-effector. The robot is positioned on a table, and a cube object with colored sides is placed in\n",
    "front of it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "###############\n",
    "# Environment #\n",
    "###############\n",
    "env_params = RobosuiteGoalBasedExpEnvironmentParameters(level='CubeExp')\n",
    "env_params.robot = 'Panda'\n",
    "env_params.custom_controller_config_fpath = '../rl_coach/environments/robosuite/osc_pose.json'\n",
    "env_params.base_parameters.optional_observations = OptionalObservations.CAMERA\n",
    "env_params.base_parameters.render_camera = 'frontview'\n",
    "env_params.base_parameters.camera_names = 'agentview'\n",
    "env_params.base_parameters.camera_depths = False\n",
    "env_params.base_parameters.horizon = 200\n",
    "env_params.base_parameters.ignore_done = False\n",
    "env_params.base_parameters.use_object_obs = True\n",
    "env_params.frame_skip = 1\n",
    "env_params.base_parameters.control_freq = 2\n",
    "env_params.base_parameters.camera_heights = 84\n",
    "env_params.base_parameters.camera_widths = 84\n",
    "env_params.extra_parameters = {'hard_reset': False}\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we create the graph manager and call `graph_manager.improve()` in order to start the data collection.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params, schedule_params=schedule_params)\n",
    "graph_manager.improve()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the data collection is complete, the data-set will saved to path specified by `agent_params.algorithm.replay_buffer_save_path`.\n",
    "\n",
    "At this point, the data can be used to learn any downstream task you define on that environment.\n",
    "\n",
    "The script below shows a visualization of the data-set. The dots represent a position of the cube on the table as seen in the data-set, and the color corresponds to the color of the face at the top. The number at the top signifies that number of dots a plot contains for a certain color.\n",
    "\n",
    "First we load the data-set from disk. Note that this can take several minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import joblib\n",
    "\n",
    "print('Loading data-set (this can take several minutes)...')\n",
    "rb_path = os.path.join('./Resources', 'RB_TD3GoalBasedAgent.joblib.bz2')\n",
    "episodes = joblib.load(rb_path)\n",
    "print('Done')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can run the visualization script:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "from collections import OrderedDict\n",
    "from enum import IntEnum\n",
    "from pylab import subplot\n",
    "from gym.envs.robotics.rotations import quat2euler, mat2euler, quat2mat\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "class CubeColor(IntEnum):\n",
    "    YELLOW = 0\n",
    "    CYAN = 1\n",
    "    WHITE = 2\n",
    "    RED = 3\n",
    "    GREEN = 4\n",
    "    BLUE = 5\n",
    "    UNKNOWN = 6\n",
    "\n",
    "\n",
    "x_range = [-0.3, 0.3]\n",
    "y_range = [-0.3, 0.3]\n",
    "\n",
    "COLOR_MAP = OrderedDict([\n",
    "    (int(CubeColor.YELLOW), 'yellow'),\n",
    "    (int(CubeColor.CYAN), 'cyan'),\n",
    "    (int(CubeColor.WHITE), 'white'),\n",
    "    (int(CubeColor.RED), 'red'),\n",
    "    (int(CubeColor.GREEN), 'green'),\n",
    "    (int(CubeColor.BLUE), 'blue'),\n",
    "    (int(CubeColor.UNKNOWN), 'black'),\n",
    "])\n",
    "\n",
    "# Mapping between (subset of) euler angles to top face color, based on the initial cube rotation\n",
    "COLOR_ROTATION_MAP = OrderedDict([\n",
    "    (CubeColor.YELLOW, (0, 2, [np.array([0, 0]),\n",
    "                               np.array([np.pi, np.pi]), np.array([-np.pi, -np.pi]),\n",
    "                               np.array([-np.pi, np.pi]), np.array([np.pi, -np.pi])])),\n",
    "    (CubeColor.CYAN, (0, 2, [np.array([0, np.pi]), np.array([0, -np.pi]),\n",
    "                             np.array([np.pi, 0]), np.array([-np.pi, 0])])),\n",
    "    (CubeColor.WHITE, (1, 2, [np.array([-np.pi / 2])])),\n",
    "    (CubeColor.RED, (1, 2, [np.array([np.pi / 2])])),\n",
    "    (CubeColor.GREEN, (0, 2, [np.array([np.pi / 2, 0])])),\n",
    "    (CubeColor.BLUE, (0, 2, [np.array([-np.pi / 2, 0])])),\n",
    "])\n",
    "\n",
    "\n",
    "def get_cube_top_color(cube_quat, atol):\n",
    "    euler = mat2euler(quat2mat(cube_quat))\n",
    "    for color, (start_dim, end_dim, xy_rotations) in COLOR_ROTATION_MAP.items():\n",
    "        if any(list(np.allclose(euler[start_dim:end_dim], xy_rotation, atol=atol) for xy_rotation in xy_rotations)):\n",
    "            return color\n",
    "    return CubeColor.UNKNOWN\n",
    "\n",
    "\n",
    "def pos2cord(x, y):\n",
    "    x = max(min(x, x_range[1]), x_range[0])\n",
    "    y = max(min(y, y_range[1]), y_range[0])\n",
    "    x = int(((x - x_range[0])/(x_range[1] - x_range[0]))*99)\n",
    "    y = int(((y - y_range[0])/(y_range[1] - y_range[0]))*99)\n",
    "    return x, y\n",
    "\n",
    "\n",
    "pos_idx = 25\n",
    "quat_idx = 28\n",
    "positions = []\n",
    "colors = []\n",
    "print('Extracting cube positions and colors...')\n",
    "for episode in episodes:\n",
    "    for transition in episode:\n",
    "        x, y = transition.state['measurements'][pos_idx:pos_idx+2]\n",
    "        positions.append([x, y])\n",
    "        angle = quat2euler(transition.state['measurements'][quat_idx:quat_idx+4])\n",
    "        colors.append(int(get_cube_top_color(transition.state['measurements'][quat_idx:quat_idx+4], np.pi / 4)))\n",
    "\n",
    "        x_cord, y_cord = pos2cord(x, y)\n",
    "\n",
    "    x, y = episode[-1].next_state['measurements'][pos_idx:pos_idx+2]\n",
    "    positions.append([x, y])\n",
    "    colors.append(int(get_cube_top_color(episode[-1].next_state['measurements'][quat_idx:quat_idx+4], np.pi / 4)))\n",
    "\n",
    "    x_cord, y_cord = pos2cord(x, y)\n",
    "print('Done')\n",
    "\n",
    "\n",
    "fig = plt.figure(figsize=(15.0, 5.0))\n",
    "axes = []\n",
    "for j in range(6):\n",
    "    axes.append(subplot(1, 6, j + 1))\n",
    "    xy = np.array(positions)[np.array(colors) == list(COLOR_MAP.keys())[j]]\n",
    "    axes[-1].scatter(xy[:, 1], xy[:, 0], c=COLOR_MAP[j], alpha=0.01, edgecolors='black')\n",
    "    plt.xlim(y_range)\n",
    "    plt.ylim(x_range)\n",
    "    plt.xticks([])\n",
    "    plt.yticks([])\n",
    "    axes[-1].set_aspect('equal', adjustable='box')\n",
    "    title = 'N=' + str(xy.shape[0])\n",
    "    plt.title(title)\n",
    "\n",
    "for ax in axes:\n",
    "    ax.invert_yaxis()\n",
    "\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }