mirror of
https://github.com/gryf/coach.git
synced 2026-01-08 23:04:15 +01:00
update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
This commit is contained in:
33
docs_raw/source/components/agents/value_optimization/naf.rst
Normal file
33
docs_raw/source/components/agents/value_optimization/naf.rst
Normal file
@@ -0,0 +1,33 @@
|
||||
Normalized Advantage Functions
|
||||
==============================
|
||||
|
||||
**Actions space:** Continuous
|
||||
|
||||
**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
|
||||
|
||||
Network Structure
|
||||
-----------------
|
||||
|
||||
.. image:: /_static/img/design_imgs/naf.png
|
||||
:width: 600px
|
||||
:align: center
|
||||
|
||||
Algorithm Description
|
||||
---------------------
|
||||
Choosing an action
|
||||
++++++++++++++++++
|
||||
The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
|
||||
It is then passed to the exploration policy which adds noise in order to encourage exploration.
|
||||
|
||||
Training the network
|
||||
++++++++++++++++++++
|
||||
The network is trained by using the following targets:
|
||||
:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
|
||||
Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
|
||||
to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
|
||||
and :math:`y_t` as the targets.
|
||||
After every training step, use a soft update in order to copy the weights from the online network to the target network.
|
||||
|
||||
|
||||
|
||||
.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters
|
||||
Reference in New Issue
Block a user