{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial we'll build a new agent that implements the Categorical Deep Q Network algorithm (https://arxiv.org/pdf/1707.06887.pdf), and a preset that runs the agent on the breakout game of the Atari environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Agent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll start by defining a new head for the neural network used by this algorithm - ```CategoricalQHead```. \n",
    "\n",
    "A head is the final part of the network. It takes the embedding from the middleware embedder and passes it through a neural network to produce the output of the network. There can be multiple heads in a network, and each one has an assigned loss function. The heads are algorithm dependent.\n",
    "\n",
    "It will be defined in a new file - ```architectures/tensorflow_components/heads/categorical_dqn_head.py```.\n",
    "\n",
    "First - some imports."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "module_path = os.path.abspath(os.path.join('..'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)\n",
    "\n",
    "import tensorflow as tf\n",
    "from rl_coach.architectures.tensorflow_components.heads.head import Head, HeadParameters\n",
    "from rl_coach.base_parameters import AgentParameters\n",
    "from rl_coach.core_types import QActionStateValue\n",
    "from rl_coach.spaces import SpacesDefinition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's define a class - ```CategoricalQHeadParameters``` - containing the head parameters and the head itself. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CategoricalQHeadParameters(HeadParameters):\n",
    "    def __init__(self, activation_function: str ='relu', name: str='categorical_q_head_params'):\n",
    "        super().__init__(parameterized_class=CategoricalQHead, activation_function=activation_function, name=name)\n",
    "\n",
    "class CategoricalQHead(Head):\n",
    "    def __init__(self, agent_parameters: AgentParameters, spaces: SpacesDefinition, network_name: str,\n",
    "                 head_idx: int = 0, loss_weight: float = 1., is_local: bool = True, activation_function: str ='relu'):\n",
    "        super().__init__(agent_parameters, spaces, network_name, head_idx, loss_weight, is_local, activation_function)\n",
    "        self.name = 'categorical_dqn_head'\n",
    "        self.num_actions = len(self.spaces.action.actions)\n",
    "        self.num_atoms = agent_parameters.algorithm.atoms\n",
    "        self.return_type = QActionStateValue\n",
    "\n",
    "    def _build_module(self, input_layer):\n",
    "        self.actions = tf.placeholder(tf.int32, [None], name=\"actions\")\n",
    "        self.input = [self.actions]\n",
    "\n",
    "        values_distribution = tf.layers.dense(input_layer, self.num_actions * self.num_atoms, name='output')\n",
    "        values_distribution = tf.reshape(values_distribution, (tf.shape(values_distribution)[0], self.num_actions,\n",
    "                                                               self.num_atoms))\n",
    "        # softmax on atoms dimension\n",
    "        self.output = tf.nn.softmax(values_distribution)\n",
    "\n",
    "        # calculate cross entropy loss\n",
    "        self.distributions = tf.placeholder(tf.float32, shape=(None, self.num_actions, self.num_atoms),\n",
    "                                            name=\"distributions\")\n",
    "        self.target = self.distributions\n",
    "        self.loss = tf.nn.softmax_cross_entropy_with_logits(labels=self.target, logits=values_distribution)\n",
    "        tf.losses.add_loss(self.loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's go ahead and define the network parameters - it will reuse the DQN network parameters but the head parameters will be our ```CategoricalQHeadParameters```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.agents.dqn_agent import DQNNetworkParameters\n",
    "\n",
    "\n",
    "class CategoricalDQNNetworkParameters(DQNNetworkParameters):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.heads_parameters = [CategoricalQHeadParameters()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll define the algorithm parameters, which are the same as the DQN algorithm parameters, with the addition of the Categorical DQN specific v_min, v_max and number of atoms.\n",
    "We'll also define the parameters of the exploration policy, which is epsilon greedy with epsilon starting at a value of 1.0 and decaying to 0.01 throughout 1,000,000 steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.agents.dqn_agent import DQNAlgorithmParameters\n",
    "from rl_coach.exploration_policies.e_greedy import EGreedyParameters\n",
    "from rl_coach.schedules import LinearSchedule\n",
    "\n",
    "\n",
    "class CategoricalDQNAlgorithmParameters(DQNAlgorithmParameters):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.v_min = -10.0\n",
    "        self.v_max = 10.0\n",
    "        self.atoms = 51\n",
    "\n",
    "\n",
    "class CategoricalDQNExplorationParameters(EGreedyParameters):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.epsilon_schedule = LinearSchedule(1, 0.01, 1000000)\n",
    "        self.evaluation_epsilon = 0.001 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's define the agent parameters class which contains all the parameters to be used by the agent - the network, algorithm and exploration parameters that we defined above, and also the parameters of the memory module to be used, which is experience replay in this case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.agents.value_optimization_agent import ValueOptimizationAgent\n",
    "from rl_coach.base_parameters import AgentParameters\n",
    "from rl_coach.core_types import StateType\n",
    "from rl_coach.memories.non_episodic.experience_replay import ExperienceReplayParameters\n",
    "\n",
    "\n",
    "class CategoricalDQNAgentParameters(AgentParameters):\n",
    "    def __init__(self):\n",
    "        super().__init__(algorithm=CategoricalDQNAlgorithmParameters(),\n",
    "                         exploration=CategoricalDQNExplorationParameters(),\n",
    "                         memory=ExperienceReplayParameters(),\n",
    "                         networks={\"main\": CategoricalDQNNetworkParameters()})\n",
    "\n",
    "    @property\n",
    "    def path(self):\n",
    "        return 'agents.categorical_dqn_agent:CategoricalDQNAgent'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last step is to define the agent itself - ```CategoricalDQNAgent``` - which is a type of value optimization agent so it will inherit the ```ValueOptimizationAgent``` class. Our agent will implement the ```learn_from_batch``` function which updates the agent's networks according to an input batch of transitions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Union\n",
    "\n",
    "\n",
    "# Categorical Deep Q Network - https://arxiv.org/pdf/1707.06887.pdf\n",
    "class CategoricalDQNAgent(ValueOptimizationAgent):\n",
    "    def __init__(self, agent_parameters, parent: Union['LevelManager', 'CompositeAgent']=None):\n",
    "        super().__init__(agent_parameters, parent)\n",
    "        self.z_values = np.linspace(self.ap.algorithm.v_min, self.ap.algorithm.v_max, self.ap.algorithm.atoms)\n",
    "\n",
    "    def distribution_prediction_to_q_values(self, prediction):\n",
    "        return np.dot(prediction, self.z_values)\n",
    "\n",
    "    # prediction's format is (batch,actions,atoms)\n",
    "    def get_all_q_values_for_states(self, states: StateType):\n",
    "        prediction = self.get_prediction(states)\n",
    "        return self.distribution_prediction_to_q_values(prediction)\n",
    "\n",
    "    def learn_from_batch(self, batch):\n",
    "        network_keys = self.ap.network_wrappers['main'].input_embedders_parameters.keys()\n",
    "\n",
    "        # for the action we actually took, the error is calculated by the atoms distribution\n",
    "        # for all other actions, the error is 0\n",
    "        distributed_q_st_plus_1, TD_targets = self.networks['main'].parallel_prediction([\n",
    "            (self.networks['main'].target_network, batch.next_states(network_keys)),\n",
    "            (self.networks['main'].online_network, batch.states(network_keys))\n",
    "        ])\n",
    "\n",
    "        # only update the action that we have actually done in this transition\n",
    "        target_actions = np.argmax(self.distribution_prediction_to_q_values(distributed_q_st_plus_1), axis=1)\n",
    "        m = np.zeros((self.ap.network_wrappers['main'].batch_size, self.z_values.size))\n",
    "\n",
    "        batches = np.arange(self.ap.network_wrappers['main'].batch_size)\n",
    "        for j in range(self.z_values.size):\n",
    "            tzj = np.fmax(np.fmin(batch.rewards() +\n",
    "                                  (1.0 - batch.game_overs()) * self.ap.algorithm.discount * self.z_values[j],\n",
    "                                  self.z_values[self.z_values.size - 1]),\n",
    "                          self.z_values[0])\n",
    "            bj = (tzj - self.z_values[0])/(self.z_values[1] - self.z_values[0])\n",
    "            u = (np.ceil(bj)).astype(int)\n",
    "            l = (np.floor(bj)).astype(int)\n",
    "            m[batches, l] = m[batches, l] + (distributed_q_st_plus_1[batches, target_actions, j] * (u - bj))\n",
    "            m[batches, u] = m[batches, u] + (distributed_q_st_plus_1[batches, target_actions, j] * (bj - l))\n",
    "        # total_loss = cross entropy between actual result above and predicted result for the given action\n",
    "        TD_targets[batches, batch.actions()] = m\n",
    "\n",
    "        result = self.networks['main'].train_and_sync_networks(batch.states(network_keys), TD_targets)\n",
    "        total_loss, losses, unclipped_grads = result[:3]\n",
    "\n",
    "        return total_loss, losses, unclipped_grads"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Preset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The new preset will be defined in a new file - ```presets/atari_categorical_dqn.py```.\n",
    "\n",
    "\n",
    "First - let's define the agent parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.agents.categorical_dqn_agent import CategoricalDQNAgentParameters\n",
    "\n",
    "\n",
    "agent_params = CategoricalDQNAgentParameters()\n",
    "agent_params.network_wrappers['main'].learning_rate = 0.00025"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Environment parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.environments.gym_environment import Atari, atari_deterministic_v4\n",
    "from rl_coach.environments.environment import MaxDumpMethod, SelectedPhaseOnlyDumpMethod, SingleLevelSelection\n",
    "\n",
    "\n",
    "env_params = Atari()\n",
    "env_params.level = SingleLevelSelection(atari_deterministic_v4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Schedule and visualization parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.graph_managers.graph_manager import ScheduleParameters\n",
    "from rl_coach.core_types import EnvironmentSteps, RunPhase\n",
    "from rl_coach.base_parameters import VisualizationParameters\n",
    "\n",
    "\n",
    "schedule_params = ScheduleParameters()\n",
    "schedule_params.improve_steps = EnvironmentSteps(50000000)\n",
    "schedule_params.steps_between_evaluation_periods = EnvironmentSteps(250000)\n",
    "schedule_params.evaluation_steps = EnvironmentSteps(135000)\n",
    "schedule_params.heatup_steps = EnvironmentSteps(50000)\n",
    "\n",
    "vis_params = VisualizationParameters()\n",
    "vis_params.video_dump_methods = [SelectedPhaseOnlyDumpMethod(RunPhase.TEST), MaxDumpMethod()]\n",
    "vis_params.dump_mp4 = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Connecting all the dots together - we'll define a graph manager with the Categorial DQN agent parameters, the Atari environment parameters, and the scheduling and visualization parameters defined above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager\n",
    "\n",
    "\n",
    "graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,\n",
    "                                    schedule_params=schedule_params, vis_params=vis_params)\n",
    "graph_manager.env_params.level.select('breakout')\n",
    "graph_manager.visualization_parameters.render = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Running the Preset\n",
    "(this is normally done from command line by running ```coach -p Atari_C51 ... ```)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rl_coach.base_parameters import TaskParameters, Frameworks\n",
    "\n",
    "log_path = '../experiments/atari_categorical_dqn'\n",
    "if not os.path.exists(log_path):\n",
    "    os.makedirs(log_path)\n",
    "    \n",
    "task_parameters = TaskParameters(framework_type=\"tensorflow\", \n",
    "                                evaluate_only=False,\n",
    "                                experiment_path=log_path)\n",
    "\n",
    "task_parameters.__dict__['save_checkpoint_secs'] = None\n",
    "\n",
    "graph_manager.create_graph(task_parameters)\n",
    "\n",
    "# let the adventure begin\n",
    "graph_manager.improve()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}