update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
2026-03-19 08:23:33 +01:00 · 2018-11-15 15:00:13 +02:00
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions
--- a/docs_raw/source/components/additional_parameters.rst
+++ b/docs_raw/source/components/additional_parameters.rst
@@ -0,0 +1,18 @@
+Additional Parameters
+=====================
+
+VisualizationParameters
+-----------------------
+.. autoclass:: rl_coach.base_parameters.VisualizationParameters
+
+PresetValidationParameters
+--------------------------
+.. autoclass:: rl_coach.base_parameters.PresetValidationParameters
+
+TaskParameters
+--------------
+.. autoclass:: rl_coach.base_parameters.TaskParameters
+
+DistributedTaskParameters
+-------------------------
+.. autoclass:: rl_coach.base_parameters.DistributedTaskParameters
--- a/docs_raw/source/components/agents/imitation/bc.rst
+++ b/docs_raw/source/components/agents/imitation/bc.rst
@@ -0,0 +1,29 @@
+Behavioral Cloning
+==================
+
+**Actions space:** Discrete | Continuous
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/pg.png
+   :align: center
+
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+The replay buffer contains the expert demonstrations for the task.
+These demonstrations are given as state, action tuples, and with no reward.
+The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
+the expert for each state.
+
+1. Sample a batch of transitions from the replay buffer.
+2. Use the current states as input to the network, and the expert actions as the targets of the network.
+3. For the network head, we use the policy head, which uses the cross entropy loss function.
+
+
+.. autoclass:: rl_coach.agents.bc_agent.BCAlgorithmParameters
--- a/docs_raw/source/components/agents/imitation/cil.rst
+++ b/docs_raw/source/components/agents/imitation/cil.rst
@@ -0,0 +1,36 @@
+Conditional Imitation Learning
+==============================
+
+**Actions space:** Discrete | Continuous
+
+**References:** `End-to-end Driving via Conditional Imitation Learning <https://arxiv.org/abs/1710.02410>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/cil.png
+   :align: center
+
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+The replay buffer contains the expert demonstrations for the task.
+These demonstrations are given as state, action tuples, and with no reward.
+The training goal is to reduce the difference between the actions predicted by the network and the actions taken by
+the expert for each state.
+In conditional imitation learning, each transition is assigned a class, which determines the goal that was pursuit
+in that transitions. For example, 3 possible classes can be: turn right, turn left and follow lane.
+
+1. Sample a batch of transitions from the replay buffer, where the batch is balanced, meaning that an equal number
+   of transitions will be sampled from each class index.
+2. Use the current states as input to the network, and assign the expert actions as the targets of the network heads
+   corresponding to the state classes. For the other heads, set the targets to match the currently predicted values,
+   so that the loss for the other heads will be zeroed out.
+3. We use a regression head, that minimizes the MSE loss between the network predicted values and the target values.
+
+
+.. autoclass:: rl_coach.agents.cil_agent.CILAlgorithmParameters
--- a/docs_raw/source/components/agents/index.rst
+++ b/docs_raw/source/components/agents/index.rst
@@ -0,0 +1,43 @@
+Agents
+======
+
+Coach supports many state-of-the-art reinforcement learning algorithms, which are separated into three main classes -
+value optimization, policy optimization and imitation learning.
+A detailed description of those algorithms can be found by navigating to each of the algorithm pages.
+
+.. image:: /_static/img/algorithms.png
+   :width: 600px
+   :align: center
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Agents
+
+   policy_optimization/ac
+   imitation/bc
+   value_optimization/bs_dqn
+   value_optimization/categorical_dqn
+   imitation/cil
+   policy_optimization/cppo
+   policy_optimization/ddpg
+   other/dfp
+   value_optimization/double_dqn
+   value_optimization/dqn
+   value_optimization/dueling_dqn
+   value_optimization/mmc
+   value_optimization/n_step
+   value_optimization/naf
+   value_optimization/nec
+   value_optimization/pal
+   policy_optimization/pg
+   policy_optimization/ppo
+   value_optimization/rainbow
+   value_optimization/qr_dqn
+
+
+.. autoclass:: rl_coach.base_parameters.AgentParameters
+
+.. autoclass:: rl_coach.agents.agent.Agent
+   :members:
+   :inherited-members:
+
--- a/docs_raw/source/components/agents/other/dfp.rst
+++ b/docs_raw/source/components/agents/other/dfp.rst
@@ -0,0 +1,39 @@
+Direct Future Prediction
+========================
+
+**Actions space:** Discrete
+
+**References:** `Learning to Act by Predicting the Future <https://arxiv.org/abs/1611.01779>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dfp.png
+   :width: 600px
+   :align: center
+
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+
+1. The current states (observations and measurements) and the corresponding goal vector are passed as an input to the network.
+   The output of the network is the predicted future measurements for time-steps :math:`t+1,t+2,t+4,t+8,t+16` and
+   :math:`t+32` for each possible action.
+2. For each action, the measurements of each predicted time-step are multiplied by the goal vector,
+   and the result is a single vector of future values for each action.
+3. Then, a weighted sum of the future values of each action is calculated, and the result is a single value for each action. 
+4. The action values are passed to the exploration policy to decide on the action to use.
+
+Training the network
++++++++++++++++++++
+
+Given a batch of transitions, run them through the network to get the current predictions of the future measurements
+per action, and set them as the initial targets for training the network. For each transition
+:math:`(s_t,a_t,r_t,s_{t+1} )` in the batch, the target of the network for the action that was taken, is the actual
+ measurements that were seen in time-steps :math:`t+1,t+2,t+4,t+8,t+16` and :math:`t+32`.
+ For the actions that were not taken, the targets are the current values.
+
+
+.. autoclass:: rl_coach.agents.dfp_agent.DFPAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/ac.rst
+++ b/docs_raw/source/components/agents/policy_optimization/ac.rst
@@ -0,0 +1,40 @@
+Actor-Critic
+============
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ac.png
+   :width: 500px
+   :align: center
+
+Algorithm Description
+---------------------
+
+Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
+
+The policy network is used in order to predict action probabilites. While training, a sample is taken from a categorical
+distribution assigned with these probabilities. When testing, the action with the highest probability is used.
+
+Training the network
++++++++++++++++++++
+A batch of :math:`T_{max}` transitions is used, and the advantages are calculated upon it.
+
+Advantages can be calculated by either of the following methods (configured by the selected preset) -
+
+1. **A_VALUE** - Estimating advantage directly:
+   :math:`A(s_t, a_t) = \underbrace{\sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})}_{Q(s_t, a_t)} - V(s_t)`
+   where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch.
+
+2. **GAE** - By following the `Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>`_ paper.
+
+The advantages are then used in order to accumulate gradients according to 
+:math:`L = -\mathop{\mathbb{E}} [log (\pi) \cdot A]`
+
+
+.. autoclass:: rl_coach.agents.actor_critic_agent.ActorCriticAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/cppo.rst
+++ b/docs_raw/source/components/agents/policy_optimization/cppo.rst
@@ -0,0 +1,44 @@
+Clipped Proximal Policy Optimization
+====================================
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ppo.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action - Continuous action
++++++++++++++++++++++++++++++++++++++
+
+Same as in PPO.
+
+Training the network
++++++++++++++++++++
+
+Very similar to PPO, with several small (but very simplifying) changes:
+
+1. Train both the value and policy networks, simultaneously, by defining a single loss function,
+   which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
+
+2. The unified network's optimizer is set to Adam (instead of L-BFGS for the value network as in PPO). 
+
+3. Value targets are now also calculated based on the GAE advantages.
+   In this method, the :math:`V` values are predicted from the critic network, and then added to the GAE based advantages,
+   in order to get a :math:`Q` value for each action. Now, since our critic network is predicting a :math:`V` value for
+   each state, setting the :math:`Q` calculated action-values as a target, will on average serve as a :math:`V` state-value target.
+
+4. Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio
+   :math:`r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}` is clipped, to achieve a similar effect.
+   This is done by defining the policy's loss function to be the minimum between the standard surrogate loss and an epsilon
+   clipped surrogate loss:
+
+   :math:`L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]`
+
+
+.. autoclass:: rl_coach.agents.clipped_ppo_agent.ClippedPPOAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/ddpg.rst
+++ b/docs_raw/source/components/agents/policy_optimization/ddpg.rst
@@ -0,0 +1,50 @@
+Deep Deterministic Policy Gradient
+==================================
+
+**Actions space:** Continuous
+
+**References:** `Continuous control with deep reinforcement learning <https://arxiv.org/abs/1509.02971>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ddpg.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+
+Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
+While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
+to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
+
+Training the network
++++++++++++++++++++
+
+Start by sampling a batch of transitions from the experience replay.
+
+* To train the **critic network**, use the following targets:
+
+  :math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},\mu(s_{t+1} ))`
+
+  First run the actor target network, using the next states as the inputs, and get :math:`\mu (s_{t+1} )`.
+  Next, run the critic target network using the next states and :math:`\mu (s_{t+1} )`, and use the output to
+  calculate :math:`y_t` according to the equation above. To train the network, use the current states and actions
+  as the inputs, and :math:`y_t` as the targets.
+
+* To train the **actor network**, use the following equation:
+
+  :math:`\nabla_{\theta^\mu } J \approx E_{s_t \tilde{} \rho^\beta } [\nabla_a Q(s,a)|_{s=s_t,a=\mu (s_t ) } \cdot \nabla_{\theta^\mu} \mu(s)|_{s=s_t} ]`
+
+  Use the actor's online network to get the action mean values using the current states as the inputs.
+  Then, use the critic online network in order to get the gradients of the critic output with respect to the
+  action mean values :math:`\nabla _a Q(s,a)|_{s=s_t,a=\mu(s_t ) }`.
+  Using the chain rule, calculate the gradients of the actor's output, with respect to the actor weights,
+  given :math:`\nabla_a Q(s,a)`. Finally, apply those gradients to the actor network.
+
+After every training step, do a soft update of the critic and actor target networks' weights from the online networks.
+
+
+.. autoclass:: rl_coach.agents.ddpg_agent.DDPGAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/hac.rst
+++ b/docs_raw/source/components/agents/policy_optimization/hac.rst
@@ -0,0 +1,24 @@
+Hierarchical Actor Critic
+=========================
+
+**Actions space:** Continuous
+
+**References:** `Hierarchical Reinforcement Learning with Hindsight <https://arxiv.org/abs/1805.08180>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ddpg.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+
+Pass the current states through the actor network, and get an action mean vector :math:`\mu`.
+While in training phase, use a continuous exploration policy, such as the Ornstein-Uhlenbeck process,
+to add exploration noise to the action. When testing, use the mean vector :math:`\mu` as-is.
+
+Training the network
++++++++++++++++++++
--- a/docs_raw/source/components/agents/policy_optimization/pg.rst
+++ b/docs_raw/source/components/agents/policy_optimization/pg.rst
@@ -0,0 +1,39 @@
+Policy Gradient
+===============
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning <http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/pg.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action - Discrete actions
+++++++++++++++++++++++++++++++++++++
+Run the current states through the network and get a policy distribution over the actions.
+While training, sample from the policy distribution. When testing, take the action with the highest probability.
+
+Training the network
++++++++++++++++++++
+The policy head loss is defined as :math:`L=-log (\pi) \cdot  PolicyGradientRescaler`.
+The :code:`PolicyGradientRescaler` is used in order to reduce the policy gradient variance, which might be very noisy.
+This is done in order to reduce the variance of the updates, since noisy gradient updates might destabilize the policy's
+convergence. The rescaler is a configurable parameter and there are few options to choose from:
+
+* **Total Episode Return** - The sum of all the discounted rewards during the episode.
+* **Future Return** - Return from each transition until the end of the episode.
+* **Future Return Normalized by Episode** - Future returns across the episode normalized by the episode's mean and standard deviation.
+* **Future Return Normalized by Timestep** - Future returns normalized using running means and standard deviations,
+  which are calculated seperately for each timestep, across different episodes.
+
+Gradients are accumulated over a number of full played episodes. The gradients accumulation over several episodes
+serves the same purpose - reducing the update variance. After accumulating gradients for several episodes,
+the gradients are then applied to the network.
+
+
+.. autoclass:: rl_coach.agents.policy_gradients_agent.PolicyGradientAlgorithmParameters
--- a/docs_raw/source/components/agents/policy_optimization/ppo.rst
+++ b/docs_raw/source/components/agents/policy_optimization/ppo.rst
@@ -0,0 +1,45 @@
+Proximal Policy Optimization
+============================
+
+**Actions space:** Discrete | Continuous
+
+**References:** `Proximal Policy Optimization Algorithms <https://arxiv.org/pdf/1707.06347.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/ppo.png
+   :align: center
+
+
+Algorithm Description
+---------------------
+Choosing an action - Continuous actions
+++++++++++++++++++++++++++++++++++++++
+Run the observation through the policy network, and get the mean and standard deviation vectors for this observation.
+While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values.
+When testing, just take the mean values predicted by the network.
+
+Training the network
++++++++++++++++++++
+
+1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).
+
+2. Calculate the advantages for each transition, using the *Generalized Advantage Estimation* method (Schulman '2015).
+
+3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers,
+   the L-BFGS optimizer runs on the entire dataset at once, without batching.
+   It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset,
+   the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total
+   discounted returns of each state in each episode.
+
+4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as
+   targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used *before*
+   starting to run the current set of training iterations) using a regularization term.
+
+5. After training is done, the last sampled KL divergence value will be compared with the *target KL divergence* value,
+   in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high,
+   increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.
+
+
+.. autoclass:: rl_coach.agents.ppo_agent.PPOAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/bs_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/bs_dqn.rst
@@ -0,0 +1,43 @@
+Bootstrapped DQN
+================
+
+**Actions space:** Discrete
+
+**References:** `Deep Exploration via Bootstrapped DQN <https://arxiv.org/abs/1602.04621>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/bs_dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+The current states are used as the input to the network. The network contains several $Q$ heads, which  are used
+for returning different estimations of the action :math:`Q` values. For each episode, the bootstrapped exploration policy
+selects a single head to play with during the episode. According to the selected head, only the relevant
+output :math:`Q` values are used. Using those :math:`Q` values, the exploration policy then selects the action for acting.
+
+Storing the transitions
+++++++++++++++++++++++
+For each transition, a Binomial mask is generated according to a predefined probability, and the number of output heads.
+The mask is a binary vector where each element holds a 0 for heads that shouldn't train on the specific transition,
+and 1 for heads that should use the transition for training. The mask is stored as part of the transition info in
+the replay buffer.
+
+Training the network
++++++++++++++++++++
+First, sample a batch of transitions from the replay buffer. Run the current states through the network and get the
+current :math:`Q` value predictions for all the heads and all the actions. For each transition in the batch,
+and for each output head, if the transition mask is 1 - change the targets of the played action to :math:`y_t`,
+according to the standard DQN update rule:
+
+:math:`y_t=r(s_t,a_t )+\gamma\cdot max_a Q(s_{t+1},a)`
+
+Otherwise, leave it intact so that the transition does not affect the learning of this head.
+Then, train the online network according to the calculated targets.
+
+As in DQN, once in every few thousand steps, copy the weights from the online network to the target network.
+
--- a/docs_raw/source/components/agents/value_optimization/categorical_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/categorical_dqn.rst
@@ -0,0 +1,39 @@
+Categorical DQN
+===============
+
+**Actions space:** Discrete
+
+**References:** `A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/distributional_dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
+   that the :math:`i-th` component of the projected update is calculated as follows:
+
+   :math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
+
+   where:
+   *  :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
+   *  :math:`\hat{T}_{z_{j}}` is the Bellman update for atom :math:`z_j`: :math:`\hat{T}_{z_{j}} := r+\gamma z_j`
+
+
+3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
+   probability distribution.   Only the target of the actions that were actually taken is updated.
+
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+
+
+.. autoclass:: rl_coach.agents.categorical_dqn_agent.CategoricalDQNAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/double_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/double_dqn.rst
@@ -0,0 +1,35 @@
+Double DQN
+==========
+
+**Actions space:** Discrete
+
+**References:** `Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. Using the next states from the sampled batch, run the online network in order to find the $Q$ maximizing
+   action :math:`argmax_a Q(s_{t+1},a)`. For these actions, use the corresponding next states and run the target
+   network to calculate :math:`Q(s_{t+1},argmax_a Q(s_{t+1},a))`.
+
+3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
+   use the current states from the sampled batch, and run the online network to get the current Q values predictions.
+   Set those values as the targets for the actions that were not actually played.
+
+4. For each action that was played, use the following equation for calculating the targets of the network:
+   :math:`y_t=r(s_t,a_t )+\gamma \cdot Q(s_{t+1},argmax_a Q(s_{t+1},a))`
+
+5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
+
+6. Once in every few thousand steps, copy the weights from the online network to the target network.
--- a/docs_raw/source/components/agents/value_optimization/dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/dqn.rst
@@ -0,0 +1,37 @@
+Deep Q Networks
+===============
+
+**Actions space:** Discrete
+
+**References:** `Playing Atari with Deep Reinforcement Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. Using the next states from the sampled batch, run the target network to calculate the :math:`Q` values for each of
+   the actions :math:`Q(s_{t+1},a)`, and keep only the maximum value for each state.
+
+3. In order to zero out the updates for the actions that were not played (resulting from zeroing the MSE loss),
+   use the current states from the sampled batch, and run the online network to get the current Q values predictions.
+   Set those values as the targets for the actions that were not actually played.
+
+4. For each action that was played, use the following equation for calculating the targets of the network:                                                         $$ y_t=r(s_t,a_t)+γ\cdot max_a {Q(s_{t+1},a)} $$ 
+   :math:`y_t=r(s_t,a_t )+\gamma \cdot max_a Q(s_{t+1})`
+
+5. Finally, train the online network using the current states as inputs, and with the aforementioned targets.
+
+6. Once in every few thousand steps, copy the weights from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.dqn_agent.DQNAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/dueling_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/dueling_dqn.rst
@@ -0,0 +1,27 @@
+Dueling DQN
+===========
+
+**Actions space:** Discrete
+
+**References:** `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dueling_dqn.png
+   :align: center
+
+General Description
+-------------------
+Dueling DQN presents a change in the network structure comparing to DQN.
+
+Dueling DQN uses a specialized *Dueling Q Head* in order to separate :math:`Q` to an :math:`A` (advantage)
+stream and a :math:`V` stream. Adding this type of structure to the network head allows the network to better differentiate
+actions from one another, and significantly improves the learning.
+
+In many states, the values of the different actions are very similar, and it is less important which action to take.
+This is especially important in environments where there are many actions to choose from. In DQN, on each training
+iteration, for each of the states in the batch, we update the :ath:`Q` values only for the specific actions taken in
+those states. This results in slower learning as we do not learn the :math:`Q` values for actions that were not taken yet.
+On dueling architecture, on the other hand, learning is faster - as we start learning the state-value even if only a
+single action has been taken at this state.
--- a/docs_raw/source/components/agents/value_optimization/mmc.rst
+++ b/docs_raw/source/components/agents/value_optimization/mmc.rst
@@ -0,0 +1,37 @@
+Mixed Monte Carlo
+=================
+
+**Actions space:** Discrete
+
+**References:** `Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+Training the network
++++++++++++++++++++
+
+In MMC, targets are calculated as a mixture between Double DQN targets and full Monte Carlo samples (total discounted returns).
+
+The DDQN targets are calculated in the same manner as in the DDQN agent:
+
+:math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
+
+The Monte Carlo targets are calculated by summing up the discounted rewards across the entire episode:
+
+:math:`y_t^{MC}=\sum_{j=0}^T\gamma^j r(s_{t+j},a_{t+j} )`
+
+A mixing ratio $\alpha$ is then used to get the final targets:
+
+:math:`y_t=(1-\alpha)\cdot y_t^{DDQN}+\alpha \cdot y_t^{MC}`
+
+Finally, the online network is trained using the current states as inputs, and the calculated targets.
+Once in every few thousand steps, copy the weights from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.mmc_agent.MixedMonteCarloAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/n_step.rst
+++ b/docs_raw/source/components/agents/value_optimization/n_step.rst
@@ -0,0 +1,35 @@
+N-Step Q Learning
+=================
+
+**Actions space:** Discrete
+
+**References:** `Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+The :math:`N`-step Q learning algorithm works in similar manner to DQN except for the following changes:
+
+1. No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every
+   :math:`N` steps using the latest :math:`N` steps played by the agent.
+
+2. In order to stabilize the learning, multiple workers work together to update the network.
+   This creates the same effect as uncorrelating the samples used for training.
+
+3. Instead of using single-step Q targets for the network, the rewards from $N$ consequent steps are accumulated
+   to form the :math:`N`-step Q targets, according to the following equation:
+   :math:`R(s_t, a_t) = \sum_{i=t}^{i=t + k - 1} \gamma^{i-t}r_i +\gamma^{k} V(s_{t+k})`
+   where :math:`k` is :math:`T_{max} - State\_Index` for each state in the batch
+
+
+
+.. autoclass:: rl_coach.agents.n_step_q_agent.NStepQAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/naf.rst
+++ b/docs_raw/source/components/agents/value_optimization/naf.rst
@@ -0,0 +1,33 @@
+Normalized Advantage Functions
+==============================
+
+**Actions space:** Continuous
+
+**References:** `Continuous Deep Q-Learning with Model-based Acceleration <https://arxiv.org/abs/1603.00748.pdf>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/naf.png
+   :width: 600px
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+The current state is used as an input to the network. The action mean :math:`\mu(s_t )` is extracted from the output head.
+It is then passed to the exploration policy which adds noise in order to encourage exploration.
+
+Training the network
++++++++++++++++++++
+The network is trained by using the following targets:
+:math:`y_t=r(s_t,a_t )+\gamma\cdot V(s_{t+1})`
+Use the next states as the inputs to the target network and extract the :math:`V` value, from within the head,
+to get :math:`V(s_{t+1} )`. Then, update the online network using the current states and actions as inputs,
+and :math:`y_t` as the targets.
+After every training step, use a soft update in order to copy the weights from the online network to the target network.
+
+
+
+.. autoclass:: rl_coach.agents.naf_agent.NAFAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/nec.rst
+++ b/docs_raw/source/components/agents/value_optimization/nec.rst
@@ -0,0 +1,50 @@
+Neural Episodic Control
+=======================
+
+**Actions space:** Discrete
+
+**References:** `Neural Episodic Control <https://arxiv.org/abs/1703.01988>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/nec.png
+   :width: 500px
+   :align: center
+
+Algorithm Description
+---------------------
+Choosing an action
++++++++++++++++++
+
+1. Use the current state as an input to the online network and extract the state embedding, which is the intermediate
+   output from the middleware.
+
+2. For each possible action :math:`a_i`, run the DND head using the state embedding and the selected action :math:`a_i` as inputs.
+   The DND is queried and returns the :math:`P` nearest neighbor keys and values. The keys and values are used to calculate
+   and return the action :math:`Q` value from the network.
+
+3. Pass all the :math:`Q` values to the exploration policy and choose an action accordingly.
+
+4. Store the state embeddings and actions taken during the current episode in a small buffer :math:`B`, in order to
+   accumulate transitions until it is possible to calculate the total discounted returns over the entire episode.
+
+Finalizing an episode
+++++++++++++++++++++
+For each step in the episode, the state embeddings and the taken actions are stored in the buffer :math:`B`.
+When the episode is finished, the replay buffer calculates the :math:`N`-step total return of each transition in the
+buffer, bootstrapped using the maximum :math:`Q` value of the :math:`N`-th transition. Those values are inserted
+along with the total return into the DND, and the buffer :math:`B` is reset.
+
+Training the network
++++++++++++++++++++
+Train the network only when the DND has enough entries for querying.
+
+To train the network, the current states are used as the inputs and the :math:`N`-step returns are used as the targets.
+The :math:`N`-step return used takes into account :math:`N` consecutive steps, and bootstraps the last value from
+the network if necessary:
+:math:`y_t=\sum_{j=0}^{N-1}\gamma^j r(s_{t+j},a_{t+j} ) +\gamma^N   max_a Q(s_{t+N},a)`
+
+
+
+.. autoclass:: rl_coach.agents.nec_agent.NECAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/pal.rst
+++ b/docs_raw/source/components/agents/value_optimization/pal.rst
@@ -0,0 +1,45 @@
+Persistent Advantage Learning
+=============================
+
+**Actions space:** Discrete
+
+**References:** `Increasing the Action Gap: New Operators for Reinforcement Learning <https://arxiv.org/abs/1512.04860>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer. 
+
+2. Start by calculating the initial target values in the same manner as they are calculated in DDQN
+   :math:`y_t^{DDQN}=r(s_t,a_t )+\gamma Q(s_{t+1},argmax_a Q(s_{t+1},a))`
+
+3. The action gap :math:`V(s_t )-Q(s_t,a_t)` should then be subtracted from each of the calculated targets.
+   To calculate the action gap, run the target network using the current states and get the :math:`Q` values
+   for all the actions. Then estimate :math:`V` as the maximum predicted :math:`Q` value for the current state:
+   :math:`V(s_t )=max_a Q(s_t,a)`
+
+4. For *advantage learning (AL)*, reduce the action gap weighted by a predefined parameter :math:`\alpha` from
+   the targets :math:`y_t^{DDQN}`:
+   :math:`y_t=y_t^{DDQN}-\alpha \cdot (V(s_t )-Q(s_t,a_t ))`
+
+5. For *persistent advantage learning (PAL)*, the target network is also used in order to calculate the action
+   gap for the next state:
+   :math:`V(s_{t+1} )-Q(s_{t+1},a_{t+1})`
+   where :math:`a_{t+1}` is chosen by running the next states through the online network and choosing the action that
+   has the highest predicted :math:`Q` value. Finally, the targets will be defined as -
+   :math:`y_t=y_t^{DDQN}-\alpha \cdot min(V(s_t )-Q(s_t,a_t ),V(s_{t+1} )-Q(s_{t+1},a_{t+1} ))`
+
+6. Train the online network using the current states as inputs, and with the aforementioned targets.
+
+7. Once in every few thousand steps, copy the weights from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.pal_agent.PALAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/qr_dqn.rst
+++ b/docs_raw/source/components/agents/value_optimization/qr_dqn.rst
@@ -0,0 +1,33 @@
+Quantile Regression DQN
+=======================
+
+**Actions space:** Discrete
+
+**References:** `Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/qr_dqn.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. First, the next state quantiles are predicted. These are used in order to calculate the targets for the network,
+   by following the Bellman equation.
+   Next, the current quantile locations for the current states are predicted, sorted, and used for calculating the
+   quantile midpoints targets.
+
+3. The network is trained with the quantile regression loss between the resulting quantile locations and the target
+   quantile locations. Only the targets of the actions that were actually taken are updated.
+
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+
+.. autoclass:: rl_coach.agents.qr_dqn_agent.QuantileRegressionDQNAlgorithmParameters
--- a/docs_raw/source/components/agents/value_optimization/rainbow.rst
+++ b/docs_raw/source/components/agents/value_optimization/rainbow.rst
@@ -0,0 +1,51 @@
+Rainbow
+=======
+
+**Actions space:** Discrete
+
+**References:** `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_
+
+Network Structure
+-----------------
+
+.. image:: /_static/img/design_imgs/rainbow.png
+   :align: center
+
+Algorithm Description
+---------------------
+
+Rainbow combines 6 recent advancements in reinforcement learning:
+
+* N-step returns
+* Distributional state-action value learning
+* Dueling networks
+* Noisy Networks
+* Double DQN
+* Prioritized Experience Replay
+
+Training the network
++++++++++++++++++++
+
+1. Sample a batch of transitions from the replay buffer.
+
+2. The Bellman update is projected to the set of atoms representing the :math:`Q` values distribution, such
+   that the :math:`i-th` component of the projected update is calculated as follows:
+
+   :math:`(\Phi \hat{T} Z_{\theta}(s_t,a_t))_i=\sum_{j=0}^{N-1}\Big[1-\frac{\lvert[\hat{T}_{z_{j}}]^{V_{MAX}}_{V_{MIN}}-z_i\rvert}{\Delta z}\Big]^1_0 \ p_j(s_{t+1}, \pi(s_{t+1}))`
+
+   where:
+   *  :math:`[ \cdot ]` bounds its argument in the range :math:`[a, b]`
+   *  :math:`\hat{T}_{z_{j}}` is the Bellman update for atom
+   :math:`z_j`: :math:`\hat{T}_{z_{j}} := r_t+\gamma r_{t+1} + ... + \gamma r_{t+n-1} + \gamma^{n-1} z_j`
+
+
+3. Network is trained with the cross entropy loss between the resulting probability distribution and the target
+   probability distribution.   Only the target of the actions that were actually taken is updated.
+
+4. Once in every few thousand steps, weights are copied from the online network to the target network.
+
+5. After every training step, the priorities of the batch transitions are updated in the prioritized replay buffer
+   using the KL divergence loss that is returned from the network.
+
+
+.. autoclass:: rl_coach.agents.rainbow_dqn_agent.RainbowDQNAlgorithmParameters
--- a/docs_raw/source/components/architectures/index.rst
+++ b/docs_raw/source/components/architectures/index.rst
@@ -0,0 +1,27 @@
+Architectures
+=============
+
+Architectures contain all the classes that implement the neural network related stuff for the agent.
+Since Coach is intended to work with multiple neural network frameworks, each framework will implement its
+own components under a dedicated directory. For example, tensorflow components will contain all the neural network
+parts that are implemented using TensorFlow.
+
+.. autoclass:: rl_coach.base_parameters.NetworkParameters
+
+Architecture
+------------
+.. autoclass:: rl_coach.architectures.architecture.Architecture
+   :members:
+   :inherited-members:
+
+NetworkWrapper
+--------------
+
+.. image:: /_static/img/distributed.png
+   :width: 600px
+   :align: center
+
+.. autoclass:: rl_coach.architectures.network_wrapper.NetworkWrapper
+   :members:
+   :inherited-members:
+
--- a/docs_raw/source/components/core_types.rst
+++ b/docs_raw/source/components/core_types.rst
@@ -0,0 +1,33 @@
+Core Types
+==========
+
+ActionInfo
+----------
+.. autoclass:: rl_coach.core_types.ActionInfo
+   :members:
+   :inherited-members:
+
+Batch
+-----
+.. autoclass:: rl_coach.core_types.Batch
+   :members:
+   :inherited-members:
+
+EnvResponse
+-----------
+.. autoclass:: rl_coach.core_types.EnvResponse
+   :members:
+   :inherited-members:
+
+Episode
+-------
+.. autoclass:: rl_coach.core_types.Episode
+   :members:
+   :inherited-members:
+
+Transition
+----------
+.. autoclass:: rl_coach.core_types.Transition
+   :members:
+   :inherited-members:
+
--- a/docs_raw/source/components/environments/index.rst
+++ b/docs_raw/source/components/environments/index.rst
@@ -0,0 +1,70 @@
+Environments
+============
+
+.. autoclass:: rl_coach.environments.environment.Environment
+   :members:
+   :inherited-members:
+
+DeepMind Control Suite
+----------------------
+
+A set of reinforcement learning environments powered by the MuJoCo physics engine.
+
+Website: `DeepMind Control Suite <https://github.com/deepmind/dm_control>`_
+
+.. autoclass:: rl_coach.environments.control_suite_environment.ControlSuiteEnvironment
+
+
+Blizzard Starcraft II
+---------------------
+
+A popular strategy game which was wrapped with a python interface by DeepMind.
+
+Website: `Blizzard Starcraft II <https://github.com/deepmind/pysc2>`_
+
+.. autoclass:: rl_coach.environments.starcraft2_environment.StarCraft2Environment
+
+
+ViZDoom
+--------
+
+A Doom-based AI research platform for reinforcement learning from raw visual information.
+
+Website: `ViZDoom <http://vizdoom.cs.put.edu.pl/>`_
+
+.. autoclass:: rl_coach.environments.doom_environment.DoomEnvironment
+
+
+CARLA
+-----
+
+An open-source simulator for autonomous driving research.
+
+Website: `CARLA <https://github.com/carla-simulator/carla>`_
+
+.. autoclass:: rl_coach.environments.carla_environment.CarlaEnvironment
+
+OpenAI Gym
+----------
+
+A library which consists of a set of environments, from games to robotics.
+Additionally, it can be extended using the API defined by the authors.
+
+Website: `OpenAI Gym <https://gym.openai.com/>`_
+
+In Coach, we support all the native environments in Gym, along with several extensions such as:
+
+* `Roboschool <https://github.com/openai/roboschool>`_  - a set of environments powered by the PyBullet engine,
+  that offer a free alternative to MuJoCo.
+
+* `Gym Extensions <https://github.com/Breakend/gym-extensions>`_  - a set of environments that extends Gym for
+  auxiliary tasks (multitask learning, transfer learning, inverse reinforcement learning, etc.)
+
+* `PyBullet <https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet>`_  - a physics engine that
+  includes a set of robotics environments.
+
+
+.. autoclass:: rl_coach.environments.gym_environment.GymEnvironment
+
+
+
--- a/docs_raw/source/components/exploration_policies/index.rst
+++ b/docs_raw/source/components/exploration_policies/index.rst
@@ -0,0 +1,87 @@
+Exploration Policies
+====================
+
+Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to a
+predefined policy. This is one of the most important aspects of reinforcement learning agents, and can require some
+tuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended with
+custom policies. Note that not all exploration policies are expected to work for both discrete and continuous action
+spaces.
+
+.. role:: green
+.. role:: red
+
+----------------------+-----------------------+------------------+
+| Exploration Policy   | Discrete Action Space | Box Action Space |
+======================+=======================+==================+
+| AdditiveNoise        | :red:`X`              | :green:`V`       |
+----------------------+-----------------------+------------------+
+| Boltzmann            | :green:`V`            | :red:`X`         |
+----------------------+-----------------------+------------------+
+| Bootstrapped         | :green:`V`            | :red:`X`         |
+----------------------+-----------------------+------------------+
+| Categorical          | :green:`V`            | :red:`X`         |
+----------------------+-----------------------+------------------+
+| ContinuousEntropy    | :red:`X`              | :green:`V`       |
+----------------------+-----------------------+------------------+
+| EGreedy              | :green:`V`            | :green:`V`       |
+----------------------+-----------------------+------------------+
+| Greedy               | :green:`V`            | :green:`V`       |
+----------------------+-----------------------+------------------+
+| OUProcess            | :red:`X`              | :green:`V`       |
+----------------------+-----------------------+------------------+
+| ParameterNoise       | :green:`V`            | :green:`V`       |
+----------------------+-----------------------+------------------+
+| TruncatedNormal      | :red:`X`              | :green:`V`       |
+----------------------+-----------------------+------------------+
+| UCB                  | :green:`V`            | :red:`X`         |
+----------------------+-----------------------+------------------+
+
+ExplorationPolicy
+-----------------
+.. autoclass:: rl_coach.exploration_policies.ExplorationPolicy
+   :members:
+   :inherited-members:
+
+AdditiveNoise
+-------------
+.. autoclass:: rl_coach.exploration_policies.AdditiveNoise
+
+Boltzmann
+---------
+.. autoclass:: rl_coach.exploration_policies.Boltzmann
+
+Bootstrapped
+------------
+.. autoclass:: rl_coach.exploration_policies.Bootstrapped
+
+Categorical
+-----------
+.. autoclass:: rl_coach.exploration_policies.Categorical
+
+ContinuousEntropy
+-----------------
+.. autoclass:: rl_coach.exploration_policies.ContinuousEntropy
+
+EGreedy
+-------
+.. autoclass:: rl_coach.exploration_policies.EGreedy
+
+Greedy
+------
+.. autoclass:: rl_coach.exploration_policies.Greedy
+
+OUProcess
+---------
+.. autoclass:: rl_coach.exploration_policies.OUProcess
+
+ParameterNoise
+--------------
+.. autoclass:: rl_coach.exploration_policies.ParameterNoise
+
+TruncatedNormal
+---------------
+.. autoclass:: rl_coach.exploration_policies.TruncatedNormal
+
+UCB
+---
+.. autoclass:: rl_coach.exploration_policies.UCB
--- a/docs_raw/source/components/filters/index.rst
+++ b/docs_raw/source/components/filters/index.rst
@@ -0,0 +1,28 @@
+Filters
+=======
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Filters
+
+   input_filters
+   output_filters
+
+Filters are a mechanism in Coach that allows doing pre-processing and post-processing of the internal agent information.
+There are two filter categories -
+
+* **Input filters** - these are filters that process the information passed **into** the agent from the environment.
+  This information includes the observation and the reward. Input filters therefore allow rescaling observations,
+  normalizing rewards, stack observations, etc.
+
+* **Output filters** - these are filters that process the information going **out** of the agent into the environment.
+  This information includes the action the agent chooses to take. Output filters therefore allow conversion of
+  actions from one space into another. For example, the agent can take :math:`N` discrete actions, that will be mapped by
+  the output filter onto :math:`N` continuous actions.
+
+Filters can be stacked on top of each other in order to build complex processing flows of the inputs or outputs.
+
+.. image:: /_static/img/filters.png
+   :width: 350px
+   :align: center
+
--- a/docs_raw/source/components/filters/input_filters.rst
+++ b/docs_raw/source/components/filters/input_filters.rst
@@ -0,0 +1,67 @@
+Input Filters
+=============
+
+The input filters are separated into two categories - **observation filters** and **reward filters**.
+
+Observation Filters
+-------------------
+
+ObservationClippingFilter
+++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationClippingFilter
+
+ObservationCropFilter
+++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationCropFilter
+
+ObservationMoveAxisFilter
+++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationMoveAxisFilter
+
+ObservationNormalizationFilter
++++++++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationNormalizationFilter
+
+ObservationReductionBySubPartsNameFilter
++++++++++++++++++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationReductionBySubPartsNameFilter
+
+ObservationRescaleSizeByFactorFilter
++++++++++++++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationRescaleSizeByFactorFilter
+
+ObservationRescaleToSizeFilter
++++++++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationRescaleToSizeFilter
+
+ObservationRGBToYFilter
+++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationRGBToYFilter
+
+ObservationSqueezeFilter
++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationSqueezeFilter
+
+ObservationStackingFilter
+++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationStackingFilter
+
+ObservationToUInt8Filter
++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.observation.ObservationToUInt8Filter
+
+
+Reward Filters
+--------------
+
+RewardClippingFilter
++++++++++++++++++++
+.. autoclass:: rl_coach.filters.reward.RewardClippingFilter
+
+RewardNormalizationFilter
+++++++++++++++++++++++++
+.. autoclass:: rl_coach.filters.reward.RewardNormalizationFilter
+
+RewardRescaleFilter
+++++++++++++++++++
+.. autoclass:: rl_coach.filters.reward.RewardRescaleFilter
--- a/docs_raw/source/components/filters/output_filters.rst
+++ b/docs_raw/source/components/filters/output_filters.rst
@@ -0,0 +1,37 @@
+Output Filters
+--------------
+
+The output filters only process the actions.
+
+Action Filters
++++++++++++++
+
+.. autoclass:: rl_coach.filters.action.AttentionDiscretization
+
+.. image:: /_static/img/attention_discretization.png
+   :align: center
+
+.. autoclass:: rl_coach.filters.action.BoxDiscretization
+
+.. image:: /_static/img/box_discretization.png
+   :align: center
+
+.. autoclass:: rl_coach.filters.action.BoxMasking
+
+.. image:: /_static/img/box_masking.png
+   :align: center
+
+.. autoclass:: rl_coach.filters.action.PartialDiscreteActionSpaceMap
+
+.. image:: /_static/img/partial_discrete_action_space_map.png
+   :align: center
+
+.. autoclass:: rl_coach.filters.action.FullDiscreteActionSpaceMap
+
+.. image:: /_static/img/full_discrete_action_space_map.png
+   :align: center
+
+.. autoclass:: rl_coach.filters.action.LinearBoxToBoxMap
+
+.. image:: /_static/img/linear_box_to_box_map.png
+   :align: center
--- a/docs_raw/source/components/memories/index.rst
+++ b/docs_raw/source/components/memories/index.rst
@@ -0,0 +1,44 @@
+Memories
+========
+
+Episodic Memories
+-----------------
+
+EpisodicExperienceReplay
++++++++++++++++++++++++
+.. autoclass:: rl_coach.memories.episodic.EpisodicExperienceReplay
+
+EpisodicHindsightExperienceReplay
+++++++++++++++++++++++++++++++++
+.. autoclass:: rl_coach.memories.episodic.EpisodicHindsightExperienceReplay
+
+EpisodicHRLHindsightExperienceReplay
++++++++++++++++++++++++++++++++++++
+.. autoclass:: rl_coach.memories.episodic.EpisodicHRLHindsightExperienceReplay
+
+SingleEpisodeBuffer
+++++++++++++++++++
+.. autoclass:: rl_coach.memories.episodic.SingleEpisodeBuffer
+
+
+Non-Episodic Memories
+---------------------
+BalancedExperienceReplay
++++++++++++++++++++++++
+.. autoclass:: rl_coach.memories.non_episodic.BalancedExperienceReplay
+
+QDND
++++
+.. autoclass:: rl_coach.memories.non_episodic.QDND
+
+ExperienceReplay
++++++++++++++++
+.. autoclass:: rl_coach.memories.non_episodic.ExperienceReplay
+
+PrioritizedExperienceReplay
+++++++++++++++++++++++++++
+.. autoclass:: rl_coach.memories.non_episodic.PrioritizedExperienceReplay
+
+TransitionCollection
++++++++++++++++++++
+.. autoclass:: rl_coach.memories.non_episodic.TransitionCollection
--- a/docs_raw/source/components/spaces.rst
+++ b/docs_raw/source/components/spaces.rst
@@ -0,0 +1,64 @@
+Spaces
+======
+
+Space
+-----
+.. autoclass:: rl_coach.spaces.Space
+   :members:
+   :inherited-members:
+
+
+
+Observation Spaces
+------------------
+.. autoclass:: rl_coach.spaces.ObservationSpace
+   :members:
+   :inherited-members:
+
+VectorObservationSpace
++++++++++++++++++++++
+.. autoclass:: rl_coach.spaces.VectorObservationSpace
+
+PlanarMapsObservationSpace
++++++++++++++++++++++++++
+.. autoclass:: rl_coach.spaces.PlanarMapsObservationSpace
+
+ImageObservationSpace
+++++++++++++++++++++
+.. autoclass:: rl_coach.spaces.ImageObservationSpace
+
+
+
+Action Spaces
+-------------
+.. autoclass:: rl_coach.spaces.ActionSpace
+   :members:
+   :inherited-members:
+
+AttentionActionSpace
++++++++++++++++++++
+.. autoclass:: rl_coach.spaces.AttentionActionSpace
+
+BoxActionSpace
++++++++++++++
+.. autoclass:: rl_coach.spaces.BoxActionSpace
+
+DiscreteActionSpace
++++++++++++++++++++
+.. autoclass:: rl_coach.spaces.DiscreteActionSpace
+
+MultiSelectActionSpace
++++++++++++++++++++++
+.. autoclass:: rl_coach.spaces.MultiSelectActionSpace
+
+CompoundActionSpace
+++++++++++++++++++
+.. autoclass:: rl_coach.spaces.CompoundActionSpace
+
+
+
+Goal Spaces
+-----------
+.. autoclass:: rl_coach.spaces.GoalsSpace
+   :members:
+   :inherited-members: