coach/docs_raw/source/components/agents/value_optimization/pal.rst at ddffac8570e88bd0adf3591bb29a08578a0c1383

gryf/coach

mirror of https://github.com/gryf/coach.git synced 2025-12-18 11:40:18 +01:00

Files

Itai Caspi 6d40ad1650 update of api docstrings across coach and tutorials [WIP] (#91 )

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation

2018-11-15 15:00:13 +02:00

1.9 KiB

Raw Blame History

Actions space: Discrete

References: Increasing the Action Gap: New Operators for Reinforcement Learning

Network Structure

Algorithm Description

Training the network

Sample a batch of transitions from the replay buffer.
Start by calculating the initial target values in the same manner as they are calculated in DDQN y^DDQN_t = r(s_t, a_t) + γQ(s_t + 1, argmax_aQ(s_t + 1, a))
The action gap V(s_t) − Q(s_t, a_t) should then be subtracted from each of the calculated targets. To calculate the action gap, run the target network using the current states and get the Q values for all the actions. Then estimate V as the maximum predicted Q value for the current state: V(s_t) = max_aQ(s_t, a)
For advantage learning (AL), reduce the action gap weighted by a predefined parameter α from the targets y^DDQN_t: y_t = y^DDQN_t − α⋅(V(s_t) − Q(s_t, a_t))
For persistent advantage learning (PAL), the target network is also used in order to calculate the action gap for the next state: V(s_t + 1) − Q(s_t + 1, a_t + 1) where a_t + 1 is chosen by running the next states through the online network and choosing the action that has the highest predicted Q value. Finally, the targets will be defined as - y_t = y^DDQN_t − α⋅min(V(s_t) − Q(s_t, a_t), V(s_t + 1) − Q(s_t + 1, a_t + 1))
Train the online network using the current states as inputs, and with the aforementioned targets.
Once in every few thousand steps, copy the weights from the online network to the target network.

System Message: ERROR/3 (<string>, line 45)

Unknown directive type "autoclass".

.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters

1.9 KiB Raw Blame History Unescape Escape

Network Structure

Algorithm Description

Training the network

1.9 KiB

Raw Blame History