update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
2025-12-18 11:40:18 +01:00 · 2018-11-15 15:00:13 +02:00
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions
--- a/docs_raw/source/components/agents/policy_optimization/ac.rst
+++ b/docs_raw/source/components/agents/policy_optimization/ac.rst
@@ -0,0 +1,40 @@
+Actor-Critic
+============
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ac.png
+   :width: 500px
+   :align: center
+
+Algorithm Description
+---------------------
+
+Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
+
+The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
+distribution assigned with these probabilities. When testing, the action with the highest probability is used.
+
+Training the network
++++++++++++++++++++
+A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
+
+Advantages can be calculated by either of the following methods (configured by the selected preset) -
+
+1. **A_VALUE** - Estimating advantage directly:
+   :math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
+   where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
+
+2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
+
+The advantages are then used in order to accumulate gradients according to 
+:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
+
+
+.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/cppo.rst
+++ b/docs_raw/source/components/agents/policy_optimization/cppo.rst
@@ -0,0 +1,44 @@
+Clipped Proximal Policy Optimization
+====================================
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ppo.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action - Continuous action
++++++++++++++++++++++++++++++++++++++
+
+Same as in PPO.
+
+Training the network
++++++++++++++++++++
+
+Very similar to PPO, with several small (but very simplifying) changes:
+
+1. Train both the value and policy networks, simultaneously, by defining a single loss function,
+   which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
+
+2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO). 
+
+3. Value targets are now also calculated based on the GAE advantages.
+   In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
+   in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
+   each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
+
+4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
+   :math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
+   This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
+   clipped surrogate loss:
+
+   :math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
+
+
+.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/ddpg.rst
+++ b/docs_raw/source/components/agents/policy_optimization/ddpg.rst
@@ -0,0 +1,50 @@
+Deep Deterministic Policy Gradient
+==================================
+
+**Actions space:** Continuous
+
+**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ddpg.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+
+Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
+While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
+to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
+
+Training the network
++++++++++++++++++++
+
+Start by sampling a batch of transitions from the experience replay.
+
+* To train the **critic network**, use the following targets:
+
+  :math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
+
+  First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
+  Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
+  calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
+  as the inputs, and :math:`y_t` as the targets.
+
+* To train the **actor network**, use the following equation:
+
+  :math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
+
+  Use the actor's online network to get the action mean values using the current states as the inputs.
+  Then, use the critic online network in order to get the gradients of the critic output with respect to the
+  action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
+  Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
+  given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
+
+After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
+
+
+.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/hac.rst
+++ b/docs_raw/source/components/agents/policy_optimization/hac.rst
@@ -0,0 +1,24 @@
+Hierarchical Actor Critic
+=========================
+
+**Actions space:** Continuous
+
+**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ddpg.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+
+Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
+While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
+to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
+
+Training the network
++++++++++++++++++++
--- a/docs_raw/source/components/agents/policy_optimization/pg.rst
+++ b/docs_raw/source/components/agents/policy_optimization/pg.rst
@@ -0,0 +1,39 @@
+Policy Gradient
+===============
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/pg.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
+Run the current states through the network and get a policy distribution over the actions.
+While training, sample from the policy distribution. When testing, take the action with the highest probability.
+
+Training the network
++++++++++++++++++++
+The policy head loss is defined as :math:`L=-log (\pi) \cdot  PolicyGradientRescaler`.
+The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
+This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
+convergence. The rescaler is a configurable parameter and there are few options to choose from:
+
+* **Total Episode Return** - The sum of all the discounted rewards during the episode.
+* **Future Return** - Return from each transition until the end of the episode.
+* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
+* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
+  which are calculated seperately for each timestep, across different episodes.
+
+Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
+serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
+the gradients are then applied to the network.
+
+
+.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/ppo.rst
+++ b/docs_raw/source/components/agents/policy_optimization/ppo.rst
@@ -0,0 +1,45 @@
+Proximal Policy Optimization
+============================
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ppo.png
+   :align: center
+
+
+Algorithm Description
+---------------------
+Choosing an action - Continuous actions
+++++++++++++++++++++++++++++++++++++++
+Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
+While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
+When testing, just take the mean values predicted by the network.
+
+Training the network
++++++++++++++++++++
+
+1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
+
+2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
+
+3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
+   the L-BFGS optimizer runs on the entire dataset at once, without batching.
+   It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
+   the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
+   discounted returns of each state in each episode.
+
+4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
+   targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
+   starting to run the current set of training iterations) using a regularization term.
+
+5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
+   in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
+   increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
+
+
+.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters