* initial ACER commit * Code cleanup + several fixes * Q-retrace bug fix + small clean-ups * added documentation for acer * ACER benchmarks * update benchmarks table * Add nightly running of golden and trace tests. (#202) Resolves #200 * comment out nightly trace tests until values reset. * remove redundant observe ignore (#168) * ensure nightly test env containers exist. (#205) Also bump integration test timeout * wxPython removal (#207) Replacing wxPython with Python's Tkinter. Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner. * Create CONTRIBUTING.md (#210) * Create CONTRIBUTING.md. Resolves #188 * run nightly golden tests sequentially. (#217) Should reduce resource requirements and potential CPU contention but increases overall execution time. * tests: added new setup configuration + test args (#211) - added utils for future tests and conftest - added test args * new docs build * golden test update
2.5 KiB
Actions space: Discrete
References: Sample Efficient Actor-Critic with Experience Replay
Network Structure
Algorithm Description
Choosing an action - Discrete actions
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
Each iteration perform one on-policy update with a batch of the last Tmax transitions, and n (replay ratio) off-policy updates from batches of Tmax transitions sampled from the replay buffer.
Each update perform the following procedure:
Calculate state values:
V(st)โ=โ๐ผaโโผโฯ[Q(st,โa)]Calculate Q retrace:
Qret(st,โat)โ=โrtโ+โฮณโฯtโ+โ1[Qret(stโ+โ1,โatโ+โ1)โโโQ(stโ+โ1,โatโ+โ1)]โ+โฮณV(stโ+โ1)whereโโฯtโ=โmin{c,โฯt},โโฯtโ=โ(ฯ(atโโฃโst))/(ฮผ(atโโฃโst))- Accumulate gradients:
โข Policy gradients (with bias correction):
ฤpolicyt โ=โ โฯtโlogฯ(atโโฃโst)[Qret(st,โat)โโโV(st)] โ โ โ โ+โ๐ผaโโผโฯโโโกโฃ(ฯt(a)โโโc)/(ฯt(a))โคโฆโlogฯ(aโโฃโst)[Q(st,โa)โโโV(st)]โโโข Q-Head gradients (MSE):
ฤQtโ=โ(Qret(st,โat)โโโQ(st,โat))โQ(st,โat) โ
(Optional) Trust region update: change the policy loss gradient w.r.t network output:
ฤtrustโโโregiontโ=โฤpolicytโโโmaxโงโฉ0,โ(kTฤpolicytโโโฮด)/(โkโ22)โซโญkwhereโkโ=โโDKL[ฯavgโโฅโฯ]The average policy network is an exponential moving average of the parameters of the network (ฮธavgโ=โฮฑฮธavgโ+โ(1โโโฮฑ)ฮธ). The goal of the trust region update is to the difference between the updated policy and the average policy to ensure stability.