coach/docs_raw/source/components/agents/value_optimization/nec.rst at ddffac8570e88bd0adf3591bb29a08578a0c1383

gryf/coach

mirror of https://github.com/gryf/coach.git synced 2025-12-18 11:40:18 +01:00

Files

Itai Caspi 6d40ad1650 update of api docstrings across coach and tutorials [WIP] (#91 )

* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation

2018-11-15 15:00:13 +02:00

2.1 KiB

Raw Blame History

Actions space: Discrete

References: Neural Episodic Control

Network Structure

Algorithm Description

Choosing an action

Use the current state as an input to the online network and extract the state embedding, which is the intermediate output from the middleware.
For each possible action a_i, run the DND head using the state embedding and the selected action a_i as inputs. The DND is queried and returns the P nearest neighbor keys and values. The keys and values are used to calculate and return the action Q value from the network.
Pass all the Q values to the exploration policy and choose an action accordingly.
Store the state embeddings and actions taken during the current episode in a small buffer B, in order to accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.

Finalizing an episode

For each step in the episode, the state embeddings and the taken actions are stored in the buffer B. When the episode is finished, the replay buffer calculates the N-step total return of each transition in the buffer, bootstrapped using the maximum Q value of the N-th transition. Those values are inserted along with the total return into the DND, and the buffer B is reset.

Training the network

Train the network only when the DND has enough entries for querying.

To train the network, the current states are used as the inputs and the N-step returns are used as the targets. The N-step return used takes into account N consecutive steps, and bootstraps the last value from the network if necessary: y_t = ∑^N − 1_j = 0γ^jr(s_t + j, a_t + j) + γ^Nmax_aQ(s_t + N, a)

System Message: ERROR/3 (<string>, line 50)

Unknown directive type "autoclass".

.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters