TD3 (#338)

2026-03-19 16:33:34 +01:00 · 2019-06-16 11:11:21 +03:00
parent 8df3c46756
commit 7eb884c5b2
107 changed files with 2200 additions and 495 deletions
--- a/docs/selecting_an_algorithm.html
+++ b/docs/selecting_an_algorithm.html
@@ -391,6 +391,16 @@ $(document).ready(function() {
            and therefore it is able to use a replay buffer in order to improve sample efficiency.
         </span>
      </div>
+      <div class="algorithm continuous off-policy" data-year="201509">
+         <span class="badge">
+            <a href="components/agents/policy_optimization/td3.html">TD3</a>
+            <br>
+            Very similar to DDPG, i.e. an actor-critic for continuous action spaces, that uses a replay buffer in
+            order to improve sample efficiency. TD3 uses two critic networks in order to mitigate the overestimation
+            in the Q state-action value prediction, slows down the actor updates in order to increase stability and
+            adds noise to actions while training the critic in order to smooth out the critic's predictions.
+         </span>
+      </div>
      <div class="algorithm continuous discrete on-policy" data-year="201706">
         <span class="badge">
            <a href="components/agents/policy_optimization/ppo.html">PPO</a>