1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 11:40:18 +01:00
Files
coach/docs_raw/source/components/agents/value_optimization/nec.rst
Itai Caspi 6d40ad1650 update of api docstrings across coach and tutorials [WIP] (#91)
* updating the documentation website
* adding the built docs
* update of api docstrings across coach and tutorials 0-2
* added some missing api documentation
* New Sphinx based documentation
2018-11-15 15:00:13 +02:00

2.1 KiB
Raw Blame History

Actions space: Discrete

References: Neural Episodic Control

Network Structure

/_static/img/design_imgs/nec.png

Algorithm Description

Choosing an action

  1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate output from the middleware.

  2. For each possible action ai, run the DND head using the state embedding and the selected action ai as inputs. The DND is queried and returns the P nearest neighbor keys and values. The keys and values are used to calculate and return the action Q value from the network.

  3. Pass all the Q values to the exploration policy and choose an action accordingly.

  4. Store the state embeddings and actions taken during the current episode in a small buffer B, in order to accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.

Finalizing an episode

For each step in the episode, the state embeddings and the taken actions are stored in the buffer B. When the episode is finished, the replay buffer calculates the N-step total return of each transition in the buffer, bootstrapped using the maximum Q value of the N-th transition. Those values are inserted along with the total return into the DND, and the buffer B is reset.

Training the network

Train the network only when the DND has enough entries for querying.

To train the network, the current states are used as the inputs and the N-step returns are used as the targets. The N-step return used takes into account N consecutive steps, and bootstraps the last value from the network if necessary: yt=N1j=0γjr(st+j,at+j)+γNmaxaQ(st+N,a)