1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 11:40:18 +01:00
Files
coach/docs/_sources/components/agents/value_optimization/bs_dqn.rst.txt
Itai Caspi 6d40ad1650 update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
2018-11-15 15:00:13 +02:00

44 lines
2.0 KiB
ReStructuredText

Bootstrapped DQN
================
**Actions space:** Discrete
**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/bs_dqn.png
:align: center
Algorithm Description
---------------------
Choosing an action
++++++++++++++++++
The current states are used as the input to the network. The network contains several $Q$ heads, which are used
for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
selects a single head to play with during the episode. According to the selected head, only the relevant
output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
Storing the transitions
+++++++++++++++++++++++
For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
the replay buffer.
Training the network
++++++++++++++++++++
First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
according to the standard DQN update rule:
:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
Otherwise, leave it intact so that the transition does not affect the learning of this head.
Then, train the online network according to the calculated targets.
As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.