1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-17 11:10:20 +01:00
Files
coach/docs/docs/algorithms/value_optimization/mmc.md
Gal Leibovich 1d4c3455e7 coach v0.8.0
2017-10-19 13:10:15 +03:00

1.0 KiB

Mixed Monte Carlo

Action space: Discrete

Paper

Network Structure

Algorithmic Description

Training the network

In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).

The DDQN targets are calculated in the same manner as in the DDQN agent:

y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))

The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:

y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )

A mixing ratio \alpha is then used to get the final targets:

y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}

Finally, the online network is trained using the current states as inputs, and the calculated targets. Once in every few thousand steps, copy the weights from the online network to the target network.