* SAC algorithm * SAC - updates to agent (learn_from_batch), sac_head and sac_q_head to fix problem in gradient calculation. Now SAC agents is able to train. gym_environment - fixing an error in access to gym.spaces * Soft Actor Critic - code cleanup * code cleanup * V-head initialization fix * SAC benchmarks * SAC Documentation * typo fix * documentation fixes * documentation and version update * README typo
2.5 KiB
Actions space: Discrete
References: Sample Efficient Actor-Critic with Experience Replay
Network Structure
Algorithm Description
Choosing an action - Discrete actions
The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical distribution assigned with these probabilities. When testing, the action with the highest probability is used.
Training the network
Each iteration perform one on-policy update with a batch of the last Tmax transitions, and n (replay ratio) off-policy updates from batches of Tmax transitions sampled from the replay buffer.
Each update perform the following procedure:
Calculate state values:
V(st)โ=โ๐ผaโโผโฯ[Q(st,โa)]Calculate Q retrace:
Qret(st,โat)โ=โrtโ+โฮณโฯtโ+โ1[Qret(stโ+โ1,โatโ+โ1)โโโQ(stโ+โ1,โatโ+โ1)]โ+โฮณV(stโ+โ1)whereโโฯtโ=โmin{c,โฯt},โโฯtโ=โ(ฯ(atโโฃโst))/(ฮผ(atโโฃโst))Accumulate gradients:
โข Policy gradients (with bias correction):
ฤpolicyt โ=โ โฯtโlogฯ(atโโฃโst)[Qret(st,โat)โโโV(st)] โ โ โ โ+โ๐ผaโโผโฯโโโกโฃ(ฯt(a)โโโc)/(ฯt(a))โคโฆโlogฯ(aโโฃโst)[Q(st,โa)โโโV(st)]โโโข Q-Head gradients (MSE):
ฤQtโ=โ(Qret(st,โat)โโโQ(st,โat))โQ(st,โat) โ(Optional) Trust region update: change the policy loss gradient w.r.t network output:
ฤtrustโโโregiontโ=โฤpolicytโโโmaxโงโฉ0,โ(kTฤpolicytโโโฮด)/(โkโ22)โซโญkwhereโkโ=โโDKL[ฯavgโโฅโฯ]The average policy network is an exponential moving average of the parameters of the network (ฮธavgโ=โฮฑฮธavgโ+โ(1โโโฮฑ)ฮธ). The goal of the trust region update is to the difference between the updated policy and the average policy to ensure stability.