coach/docs_raw/source/components/agents/value_optimization/double_dqn.rst at 19ad2d60a7022bb5125855c029f27d86aaa46d64

gryf/coach

mirror of https://github.com/gryf/coach.git synced 2025-12-18 19:50:17 +01:00

Files

shadiendrawis 2b5d1dabe6 ACER algorithm (#184 )

* initial ACER commit

* Code cleanup + several fixes

* Q-retrace bug fix + small clean-ups

* added documentation for acer

* ACER benchmarks

* update benchmarks table

* Add nightly running of golden and trace tests. (#202)

Resolves #200

* comment out nightly trace tests until values reset.

* remove redundant observe ignore (#168)

* ensure nightly test env containers exist. (#205)

Also bump integration test timeout

* wxPython removal (#207)

Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.

* Create CONTRIBUTING.md (#210)

* Create CONTRIBUTING.md.  Resolves #188

* run nightly golden tests sequentially. (#217)

Should reduce resource requirements and potential CPU contention but increases
overall execution time.

* tests: added new setup configuration + test args (#211)

- added utils for future tests and conftest
- added test args

* new docs build

* golden test update

2019-02-20 23:52:34 +02:00

1.4 KiB

Raw Blame History

Actions space: Discrete

References: Deep Reinforcement Learning with Double Q-learning

Network Structure

Algorithm Description

Training the network

Sample a batch of transitions from the replay buffer.
Using the next states from the sampled batch, run the online network in order to find the Q maximizing action argmax_aQ(s_t + 1, a). For these actions, use the corresponding next states and run the target network to calculate Q(s_t + 1, argmax_aQ(s_t + 1, a)).
In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss), use the current states from the sampled batch, and run the online network to get the current Q values predictions. Set those values as the targets for the actions that were not actually played.
For each action that was played, use the following equation for calculating the targets of the network: y_t = r(s_t, a_t) + γ⋅Q(s_t + 1, argmax_aQ(s_t + 1, a))
Finally, train the online network using the current states as inputs, and with the aforementioned targets.
Once in every few thousand steps, copy the weights from the online network to the target network.

1.4 KiB Raw Blame History Unescape Escape

Network Structure

Algorithm Description

Training the network

1.4 KiB

Raw Blame History