1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 11:10:20 +01:00
Files
coach/docs_raw/docs/algorithms/value_optimization/naf.md
2018-04-23 09:14:20 +03:00

1.0 KiB

Normalized Advantage Functions

Actions space: Continuous

References: Continuous Deep Q-Learning with Model-based Acceleration

Network Structure

Algorithm Description

Choosing an action

The current state is used as an input to the network. The action mean \mu(s_t ) is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration. ###Training the network The network is trained by using the following targets:

y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})

Use the next states as the inputs to the target network and extract the V value, from within the head, to get V(s_{t+1} ). Then, update the online network using the current states and actions as inputs, and y_t as the targets. After every training step, use a soft update in order to copy the weights from the online network to the target network.