1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 03:30:19 +01:00
Files
coach/docs/_sources/components/agents/value_optimization/dqn.rst.txt
Itai Caspi 6d40ad1650 update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
2018-11-15 15:00:13 +02:00

38 lines
1.5 KiB
ReStructuredText
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Deep Q Networks
===============
**Actions space:** Discrete
**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
Network Structure
-----------------
.. image:: /_static/img/design_imgs/dqn.png
:align: center
Algorithm Description
---------------------
Training the network
++++++++++++++++++++
1. Sample a batch of transitions from the replay buffer.
2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
use the current states from the sampled batch, and run the online network to get the current Q values predictions.
Set those values as the targets for the actions that were not actually played.
4. For each action that was played, use the following equation for calculating the targets of the network: $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$
:math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
6. Once in every few thousand steps, copy the weights from the online network to the target network.
.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters