mirror of
https://github.com/gryf/coach.git
synced 2025-12-17 19:20:19 +01:00
NOTE: tensorflow framework works fine if mxnet is not installed in env, but mxnet will not work if tensorflow is not installed because of the code in network_wrapper.
408 lines
15 KiB
Plaintext
408 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this tutorial we'll build a new agent that implements the Categorical Deep Q Network algorithm (https://arxiv.org/pdf/1707.06887.pdf), and a preset that runs the agent on the breakout game of the Atari environment."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# The Agent"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We'll start by defining a new head for the neural network used by this algorithm - ```CategoricalQHead```. \n",
|
|
"\n",
|
|
"A head is the final part of the network. It takes the embedding from the middleware embedder and passes it through a neural network to produce the output of the network. There can be multiple heads in a network, and each one has an assigned loss function. The heads are algorithm dependent.\n",
|
|
"\n",
|
|
"It will be defined in a new file - ```architectures/tensorflow_components/heads/categorical_dqn_head.py```.\n",
|
|
"\n",
|
|
"First - some imports."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import sys\n",
|
|
"module_path = os.path.abspath(os.path.join('..'))\n",
|
|
"if module_path not in sys.path:\n",
|
|
" sys.path.append(module_path)\n",
|
|
"\n",
|
|
"import tensorflow as tf\n",
|
|
"from rl_coach.architectures.tensorflow_components.heads.head import Head, HeadParameters\n",
|
|
"from rl_coach.base_parameters import AgentParameters\n",
|
|
"from rl_coach.core_types import QActionStateValue\n",
|
|
"from rl_coach.spaces import SpacesDefinition"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now let's define a class - ```CategoricalQHeadParameters``` - containing the head parameters and the head itself. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class CategoricalQHeadParameters(HeadParameters):\n",
|
|
" def __init__(self, activation_function: str ='relu', name: str='categorical_q_head_params'):\n",
|
|
" super().__init__(parameterized_class=CategoricalQHead, activation_function=activation_function, name=name)\n",
|
|
"\n",
|
|
"class CategoricalQHead(Head):\n",
|
|
" def __init__(self, agent_parameters: AgentParameters, spaces: SpacesDefinition, network_name: str,\n",
|
|
" head_idx: int = 0, loss_weight: float = 1., is_local: bool = True, activation_function: str ='relu'):\n",
|
|
" super().__init__(agent_parameters, spaces, network_name, head_idx, loss_weight, is_local, activation_function)\n",
|
|
" self.name = 'categorical_dqn_head'\n",
|
|
" self.num_actions = len(self.spaces.action.actions)\n",
|
|
" self.num_atoms = agent_parameters.algorithm.atoms\n",
|
|
" self.return_type = QActionStateValue\n",
|
|
"\n",
|
|
" def _build_module(self, input_layer):\n",
|
|
" self.actions = tf.placeholder(tf.int32, [None], name=\"actions\")\n",
|
|
" self.input = [self.actions]\n",
|
|
"\n",
|
|
" values_distribution = tf.layers.dense(input_layer, self.num_actions * self.num_atoms, name='output')\n",
|
|
" values_distribution = tf.reshape(values_distribution, (tf.shape(values_distribution)[0], self.num_actions,\n",
|
|
" self.num_atoms))\n",
|
|
" # softmax on atoms dimension\n",
|
|
" self.output = tf.nn.softmax(values_distribution)\n",
|
|
"\n",
|
|
" # calculate cross entropy loss\n",
|
|
" self.distributions = tf.placeholder(tf.float32, shape=(None, self.num_actions, self.num_atoms),\n",
|
|
" name=\"distributions\")\n",
|
|
" self.target = self.distributions\n",
|
|
" self.loss = tf.nn.softmax_cross_entropy_with_logits(labels=self.target, logits=values_distribution)\n",
|
|
" tf.losses.add_loss(self.loss)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now let's go ahead and define the network parameters - it will reuse the DQN network parameters but the head parameters will be our ```CategoricalQHeadParameters```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.agents.dqn_agent import DQNNetworkParameters\n",
|
|
"\n",
|
|
"\n",
|
|
"class CategoricalDQNNetworkParameters(DQNNetworkParameters):\n",
|
|
" def __init__(self):\n",
|
|
" super().__init__()\n",
|
|
" self.heads_parameters = [CategoricalQHeadParameters()]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next we'll define the algorithm parameters, which are the same as the DQN algorithm parameters, with the addition of the Categorical DQN specific v_min, v_max and number of atoms.\n",
|
|
"We'll also define the parameters of the exploration policy, which is epsilon greedy with epsilon starting at a value of 1.0 and decaying to 0.01 throughout 1,000,000 steps."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.agents.dqn_agent import DQNAlgorithmParameters\n",
|
|
"from rl_coach.exploration_policies.e_greedy import EGreedyParameters\n",
|
|
"from rl_coach.schedules import LinearSchedule\n",
|
|
"\n",
|
|
"\n",
|
|
"class CategoricalDQNAlgorithmParameters(DQNAlgorithmParameters):\n",
|
|
" def __init__(self):\n",
|
|
" super().__init__()\n",
|
|
" self.v_min = -10.0\n",
|
|
" self.v_max = 10.0\n",
|
|
" self.atoms = 51\n",
|
|
"\n",
|
|
"\n",
|
|
"class CategoricalDQNExplorationParameters(EGreedyParameters):\n",
|
|
" def __init__(self):\n",
|
|
" super().__init__()\n",
|
|
" self.epsilon_schedule = LinearSchedule(1, 0.01, 1000000)\n",
|
|
" self.evaluation_epsilon = 0.001 "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now let's define the agent parameters class which contains all the parameters to be used by the agent - the network, algorithm and exploration parameters that we defined above, and also the parameters of the memory module to be used, which is experience replay in this case."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.agents.value_optimization_agent import ValueOptimizationAgent\n",
|
|
"from rl_coach.base_parameters import AgentParameters\n",
|
|
"from rl_coach.core_types import StateType\n",
|
|
"from rl_coach.memories.non_episodic.experience_replay import ExperienceReplayParameters\n",
|
|
"\n",
|
|
"\n",
|
|
"class CategoricalDQNAgentParameters(AgentParameters):\n",
|
|
" def __init__(self):\n",
|
|
" super().__init__(algorithm=CategoricalDQNAlgorithmParameters(),\n",
|
|
" exploration=CategoricalDQNExplorationParameters(),\n",
|
|
" memory=ExperienceReplayParameters(),\n",
|
|
" networks={\"main\": CategoricalDQNNetworkParameters()})\n",
|
|
"\n",
|
|
" @property\n",
|
|
" def path(self):\n",
|
|
" return 'agents.categorical_dqn_agent:CategoricalDQNAgent'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The last step is to define the agent itself - ```CategoricalDQNAgent``` - which is a type of value optimization agent so it will inherit the ```ValueOptimizationAgent``` class. Our agent will implement the ```learn_from_batch``` function which updates the agent's networks according to an input batch of transitions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import Union\n",
|
|
"\n",
|
|
"\n",
|
|
"# Categorical Deep Q Network - https://arxiv.org/pdf/1707.06887.pdf\n",
|
|
"class CategoricalDQNAgent(ValueOptimizationAgent):\n",
|
|
" def __init__(self, agent_parameters, parent: Union['LevelManager', 'CompositeAgent']=None):\n",
|
|
" super().__init__(agent_parameters, parent)\n",
|
|
" self.z_values = np.linspace(self.ap.algorithm.v_min, self.ap.algorithm.v_max, self.ap.algorithm.atoms)\n",
|
|
"\n",
|
|
" def distribution_prediction_to_q_values(self, prediction):\n",
|
|
" return np.dot(prediction, self.z_values)\n",
|
|
"\n",
|
|
" # prediction's format is (batch,actions,atoms)\n",
|
|
" def get_all_q_values_for_states(self, states: StateType):\n",
|
|
" prediction = self.get_prediction(states)\n",
|
|
" return self.distribution_prediction_to_q_values(prediction)\n",
|
|
"\n",
|
|
" def learn_from_batch(self, batch):\n",
|
|
" network_keys = self.ap.network_wrappers['main'].input_embedders_parameters.keys()\n",
|
|
"\n",
|
|
" # for the action we actually took, the error is calculated by the atoms distribution\n",
|
|
" # for all other actions, the error is 0\n",
|
|
" distributed_q_st_plus_1, TD_targets = self.networks['main'].parallel_prediction([\n",
|
|
" (self.networks['main'].target_network, batch.next_states(network_keys)),\n",
|
|
" (self.networks['main'].online_network, batch.states(network_keys))\n",
|
|
" ])\n",
|
|
"\n",
|
|
" # only update the action that we have actually done in this transition\n",
|
|
" target_actions = np.argmax(self.distribution_prediction_to_q_values(distributed_q_st_plus_1), axis=1)\n",
|
|
" m = np.zeros((self.ap.network_wrappers['main'].batch_size, self.z_values.size))\n",
|
|
"\n",
|
|
" batches = np.arange(self.ap.network_wrappers['main'].batch_size)\n",
|
|
" for j in range(self.z_values.size):\n",
|
|
" tzj = np.fmax(np.fmin(batch.rewards() +\n",
|
|
" (1.0 - batch.game_overs()) * self.ap.algorithm.discount * self.z_values[j],\n",
|
|
" self.z_values[self.z_values.size - 1]),\n",
|
|
" self.z_values[0])\n",
|
|
" bj = (tzj - self.z_values[0])/(self.z_values[1] - self.z_values[0])\n",
|
|
" u = (np.ceil(bj)).astype(int)\n",
|
|
" l = (np.floor(bj)).astype(int)\n",
|
|
" m[batches, l] = m[batches, l] + (distributed_q_st_plus_1[batches, target_actions, j] * (u - bj))\n",
|
|
" m[batches, u] = m[batches, u] + (distributed_q_st_plus_1[batches, target_actions, j] * (bj - l))\n",
|
|
" # total_loss = cross entropy between actual result above and predicted result for the given action\n",
|
|
" TD_targets[batches, batch.actions()] = m\n",
|
|
"\n",
|
|
" result = self.networks['main'].train_and_sync_networks(batch.states(network_keys), TD_targets)\n",
|
|
" total_loss, losses, unclipped_grads = result[:3]\n",
|
|
"\n",
|
|
" return total_loss, losses, unclipped_grads"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# The Preset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The new preset will be defined in a new file - ```presets/atari_categorical_dqn.py```.\n",
|
|
"\n",
|
|
"\n",
|
|
"First - let's define the agent parameters"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.agents.categorical_dqn_agent import CategoricalDQNAgentParameters\n",
|
|
"\n",
|
|
"\n",
|
|
"agent_params = CategoricalDQNAgentParameters()\n",
|
|
"agent_params.network_wrappers['main'].learning_rate = 0.00025"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Environment parameters"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.environments.gym_environment import Atari, atari_deterministic_v4\n",
|
|
"from rl_coach.environments.environment import MaxDumpMethod, SelectedPhaseOnlyDumpMethod, SingleLevelSelection\n",
|
|
"\n",
|
|
"\n",
|
|
"env_params = Atari()\n",
|
|
"env_params.level = SingleLevelSelection(atari_deterministic_v4)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Schedule and visualization parameters"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.graph_managers.graph_manager import ScheduleParameters\n",
|
|
"from rl_coach.core_types import EnvironmentSteps, RunPhase\n",
|
|
"from rl_coach.base_parameters import VisualizationParameters\n",
|
|
"\n",
|
|
"\n",
|
|
"schedule_params = ScheduleParameters()\n",
|
|
"schedule_params.improve_steps = EnvironmentSteps(50000000)\n",
|
|
"schedule_params.steps_between_evaluation_periods = EnvironmentSteps(250000)\n",
|
|
"schedule_params.evaluation_steps = EnvironmentSteps(135000)\n",
|
|
"schedule_params.heatup_steps = EnvironmentSteps(50000)\n",
|
|
"\n",
|
|
"vis_params = VisualizationParameters()\n",
|
|
"vis_params.video_dump_methods = [SelectedPhaseOnlyDumpMethod(RunPhase.TEST), MaxDumpMethod()]\n",
|
|
"vis_params.dump_mp4 = False"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Connecting all the dots together - we'll define a graph manager with the Categorial DQN agent parameters, the Atari environment parameters, and the scheduling and visualization parameters defined above"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager\n",
|
|
"\n",
|
|
"\n",
|
|
"graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,\n",
|
|
" schedule_params=schedule_params, vis_params=vis_params)\n",
|
|
"graph_manager.env_params.level.select('breakout')\n",
|
|
"graph_manager.visualization_parameters.render = True"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Running the Preset\n",
|
|
"(this is normally done from command line by running ```coach -p Atari_C51 ... ```)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rl_coach.base_parameters import TaskParameters, Frameworks\n",
|
|
"\n",
|
|
"log_path = '../experiments/atari_categorical_dqn'\n",
|
|
"if not os.path.exists(log_path):\n",
|
|
" os.makedirs(log_path)\n",
|
|
" \n",
|
|
"task_parameters = TaskParameters(framework_type=Frameworks.tensorflow, \n",
|
|
" evaluate_only=False,\n",
|
|
" experiment_path=log_path)\n",
|
|
"\n",
|
|
"task_parameters.__dict__['checkpoint_save_secs'] = None\n",
|
|
"\n",
|
|
"graph_manager.create_graph(task_parameters)\n",
|
|
"\n",
|
|
"# let the adventure begin\n",
|
|
"graph_manager.improve()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|