update of api docstrings across coach and tutorials [WIP] (#91)

* updating the documentation website * adding the built docs * update of api docstrings across coach and tutorials 0-2 * added some missing api documentation * New Sphinx based documentation
2026-02-15 13:35:55 +01:00 · 2018-11-15 15:00:13 +02:00
parent 524f8436a2
commit 6d40ad1650
517 changed files with 71034 additions and 12834 deletions
--- a/docs/_sources/design/control_flow.rst.txt
+++ b/docs/_sources/design/control_flow.rst.txt
@@ -0,0 +1,102 @@
+
+Control Flow
+============
+
+Coach is built in a modular way, encouraging modules reuse and reducing the amount of boilerplate code needed
+for developing new algorithms or integrating a new challenge as an environment.
+On the other hand, it can be overwhelming for new users to ramp up on the code.
+To help with that, here's a short overview of the control flow.
+
+Graph Manager
+-------------
+
+The main entry point for Coach is :code:`coach.py`.
+The main functionality of this script is to parse the command line arguments and invoke all the sub-processes needed
+for the given experiment.
+:code:`coach.py` executes the given **preset** file which returns a :code:`GraphManager` object.
+
+A **preset** is a design pattern that is intended for concentrating the entire definition of an experiment in a single
+file. This helps with experiments reproducibility, improves readability and prevents confusion.
+The outcome of a preset is a :code:`GraphManager` which will usually be instantiated in the final lines of the preset.
+
+A :code:`GraphManager` is an object that holds all the agents and environments of an experiment, and is mostly responsible
+for scheduling their work. Why is it called a **graph** manager? Because agents and environments are structured into
+a graph of interactions. For example, in hierarchical reinforcement learning schemes, there will often be a master
+policy agent, that will control a sub-policy agent, which will interact with the environment. Other schemes can have
+much more complex graphs of control, such as several hierarchy layers, each with multiple agents.
+The graph manager's main loop is the improve loop.
+
+.. image:: /_static/img/improve.png
+   :width: 400px
+   :align: center
+
+The improve loop skips between 3 main phases - heatup, training and evaluation:
+
+* **Heatup** - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phase
+  takes place only in the beginning of the experiment, and the agents will act completely randomly during this phase.
+  Importantly, the agents do not train their networks during this phase. DQN for example, uses 50k random steps in order
+  to initialize the replay buffers.
+
+* **Training** - the training phase is the main phase of the experiment. This phase can change between agent types,
+  but essentially consists of repeated cycles of acting, collecting data from the environment, and training the agent
+  networks. During this phase, the agent will use its exploration policy in training mode, which will add noise to its
+  actions in order to improve its knowledge about the environment state space.
+
+* **Evaluation** - the evaluation phase is intended for evaluating the current performance of the agent. The agents
+  will act greedily in order to exploit the knowledge aggregated so far and the performance over multiple episodes of
+  evaluation will be averaged in order to reduce the stochasticity effects of all the components.
+
+
+Level Manager
+-------------
+
+In each of the 3 phases described above, the graph manager will invoke all the hierarchy levels in the graph in a
+synchronized manner. In Coach, agents do not interact directly with the environment. Instead, they go through a
+*LevelManager*, which is a proxy that manages their interaction. The level manager passes the current state and reward
+from the environment to the agent, and the actions from the agent to the environment.
+
+The motivation for having a level manager is to disentangle the code of the environment and the agent, so to allow more
+complex interactions. Each level can have multiple agents which interact with the environment. Who gets to choose the
+action for each step is controlled by the level manager.
+Additionally, each level manager can act as an environment for the hierarchy level above it, such that each hierarchy
+level can be seen as an interaction between an agent and an environment, even if the environment is just more agents in
+a lower hierarchy level.
+
+
+Agent
+-----
+
+The base agent class has 3 main function that will be used during those phases - observe, act and train.
+
+* **Observe** - this function gets the latest response from the environment as input, and updates the internal state
+  of the agent with the new information. The environment response will
+  be first passed through the agent's :code:`InputFilter` object, which will process the values in the response, according
+  to the specific agent definition. The environment response will then be converted into a
+  :code:`Transition` which will contain the information from a single step
+  :math:`(s_{t}, a_{t}, r_{t}, s_{t+1}, \textrm{terminal signal})`, and store it in the memory.
+
+.. image:: /_static/img/observe.png
+   :width: 700px
+   :align: center
+
+
+* **Act** - this function uses the current internal state of the agent in order to select the next action to take on
+  the environment. This function will call the per-agent custom function :code:`choose_action` that will use the network
+  and the exploration policy in order to select an action. The action will be stored, together with any additional
+  information (like the action value for example) in an :code:`ActionInfo` object. The ActionInfo object will then be
+  passed through the agent's :code:`OutputFilter` to allow any processing of the action (like discretization,
+  or shifting, for example), before passing it to the environment.
+
+.. image:: /_static/img/act.png
+   :width: 700px
+   :align: center
+
+* **Train** - this function will sample a batch from the memory and train on it. The batch of transitions will be
+  first wrapped into a :code:`Batch` object to allow efficient querying of the batch values. It will then be passed into
+  the agent specific :code:`learn_from_batch` function, that will extract network target values from the batch and will
+  train the networks accordingly. Lastly, if there's a target network defined for the agent, it will sync the target
+  network weights with the online network.
+
+.. image:: /_static/img/train.png
+   :width: 700px
+   :align: center
--- a/docs/_sources/design/horizontal_scaling.rst.txt
+++ b/docs/_sources/design/horizontal_scaling.rst.txt
@@ -0,0 +1,148 @@
+# Scaling out rollout workers
+
+This document contains some options for how we could implement horizontal scaling of rollout workers in coach, though most details are not specific to coach. A few options are laid out, my current suggestion would be to start with Option 1, and move on to Option 1a or Option 1b as required.
+
+## Off Policy Algorithms
+
+### Option 1 - master polls file system
+
+- one master process samples memories and updates the policy
+- many worker processes execute rollouts
+- coordinate using a single shared networked file system: nfs, ceph, dat, s3fs, etc.
+- policy sync communication method:
+  - master process occasionally writes policy to shared file system
+  - worker processes occasionally read policy from shared file system
+  - prevent workers from reading a policy which has not been completely written to disk using either:
+    - redis lock
+    - write to temporary files and then rename
+- rollout memories:
+  - sync communication method:
+    - worker processes write rollout memories as they are generated to shared filesystem
+    - master process occasionally reads rollout memories from shared file system
+    - master process must be resilient to corrupted or incompletely written memories
+  - sampling method:
+    - master process keeps all rollouts in memory utilizing existing coach memory classes
+- control flow:
+  - master:
+    - run training updates interleaved with loading of any newly available rollouts in memory
+    - periodically write policy to disk
+  - workers:
+    - periodically read policy from disk
+    - evaluate rollouts and write them to disk
+- ops:
+  - kubernetes yaml, kml, docker compose, etc
+  - a default shared file system can be provided, while allowing the user to specify something else if desired
+  - a default method of launching the workers and master (in kubernetes, gce, aws, etc) can be provided
+
+#### Pros
+
+- very simple to implement, infrastructure already available in ai-lab-kubernetes
+- fast enough for proof of concept and iteration of interface design
+- rollout memories are durable and can be easily reused in later off policy training
+- if designed properly, there is a clear path towards:
+  - decreasing latency using in-memory store (option 1a/b)
+  - increasing rollout memory size using distributed sampling methods (option 1c)
+
+#### Cons
+
+- file system interface incurs additional latency. rollout memories must be written to disk, and later read from disk, instead of going directly from memory to memory.
+- will require modifying standard control flow. there will be an impact on algorithms which expect particular training regimens. Specifically, algorithms which are sensitive to the number of update steps between target/online network updates
+- will not be particularly efficient in strictly on policy algorithms where each rollout must use the most recent policy available
+
+### Option 1a - master polls (redis) list
+
+- instead of using a file system as in Option 1, redis lists can be used
+- policy is stored as a single key/value pair (locking no longer necessary)
+- rollout memory communication:
+  - workers: redis list push
+  - master: redis list len, redis list range
+- note: many databases are interchangeable with redis protocol: google memorystore, aws elasticache, etc.
+- note: many databases can implement this interface with minimal glue: SQL, any objectstore, etc.
+
+#### Pros
+
+- lower latency than disk since it is all in memory
+- clear path toward scaling to large number of workers
+- no concern about reading partially written rollouts
+- no synchronization or additional threads necessary, though an additional thread would be helpful for concurrent reads from redis and training
+- will be slightly more efficient in the case of strictly on policy algorithms
+
+#### Cons
+
+- more complex to set up, especially if you are concerned about rollout memory durability
+
+### Option 1b - master subscribes to (redis) pub sub
+
+- instead of using a file system as in Option 1, redis pub sub can be used
+- policy is stored as a single key/value pair (locking no longer necessary)
+- rollout memory communication:
+  - workers: redis publish
+  - master: redis subscribe
+- no synchronization necessary, however an additional thread would be necessary?
+  - it looks like the python client might handle this already, would need further investigation
+- note: many possible pub sub systems could be used with different characteristics under specific contexts: kafka, google pub/sub, aws kinesis, etc
+
+#### Pros
+
+- lower latency than disk since it is all in memory
+- clear path toward scaling to large number of workers
+- no concern about reading partially written rollouts
+- will be slightly more efficient in the case of strictly on policy algorithms
+
+#### Cons
+
+- more complex to set up then shared file system
+- on its own, does not persist worker rollouts for future off policy training
+
+### Option 1c - distributed rollout memory sampling
+
+- if rollout memories do not fit in memory of a single machine, a distributed storage and sampling method would be necessary
+- for example:
+  - rollout memory store: redis set add
+  - rollout memory sample: redis set randmember
+
+#### Pros
+
+- capable of taking advantage of rollout memory larger than the available memory of a single machine
+- reduce resource constraints on training machine
+
+#### Cons
+
+- distributed versions of each memory type/sampling method need to be custom built
+- off-the-shelf implementations may not be available for complex memory types/sampling methods
+
+### Option 2 - master listens to workers
+
+- rollout memories:
+  - workers send memories directly to master via: mpi, 0mq, etc
+  - master policy thread listens for new memories and stores them in shared memory
+- policy sync communication memory:
+  - master policy occasionally sends policies directly to workers via: mpi, 0mq, etc
+  - master and workers must synchronize so that all workers are listening when the master is ready to send a new policy
+
+#### Pros
+
+- lower latency than option 1 (for a small number of workers)
+- will potentially be the optimal choice in the case of strictly on policy algorithms with relatively small number of worker nodes (small enough that more complex communication typologies would be necessary: rings, p2p, etc)
+
+#### Cons
+
+- much less robust and more difficult to debug requiring lots of synchronization
+- much more difficult to be resiliency worker failure
+- more custom communication/synchronization code
+- as the number of workers scale up, a larger and larger fraction of time will be spent waiting and synchronizing
+
+### Option 3 - Ray
+
+#### Pros
+
+- Ray would allow us to easily convert our current algorithms to distributed versions, with minimal change to our code.
+
+#### Cons
+
+- performance from naïve/simple use would be very similar to Option 2
+- nontrivial to replace with a higher performance system if desired. Additional performance will require significant code changes.
+
+## On Policy Algorithms
+
+TODO
--- a/docs/_sources/design/network.rst.txt
+++ b/docs/_sources/design/network.rst.txt
@@ -0,0 +1,56 @@
+Network Design
+==============
+
+Each agent has at least one neural network, used as the function approximator, for choosing the actions.
+The network is designed in a modular way to allow reusability in different agents.
+It is separated into three main parts:
+
+* **Input Embedders** - This is the first stage of the network, meant to convert the input into a feature vector representation.
+  It is possible to combine several instances of any of the supported embedders, in order to allow varied combinations of inputs.
+
+    There are two main types of input embedders: 
+
+    1. Image embedder - Convolutional neural network. 
+    2. Vector embedder - Multi-layer perceptron. 
+
+
+* **Middlewares** - The middleware gets the output of the input embedder, and processes it into a different representation domain,
+  before sending it through the output head. The goal of the middleware is to enable processing the combined outputs of
+  several input embedders, and pass them through some extra processing.
+  This, for instance, might include an LSTM or just a plain simple FC layer.
+
+* **Output Heads** - The output head is used in order to predict the values required from the network.
+  These might include action-values, state-values or a policy. As with the input embedders,
+  it is possible to use several output heads in the same network. For example, the *Actor Critic* agent combines two
+  heads - a policy head and a state-value head.
+  In addition, the output heads defines the loss function according to the head type.
+
+  
+.. image:: /_static/img/network.png
+   :width: 400px
+   :align: center
+
+Keeping Network Copies in Sync
+------------------------------
+
+Most of the reinforcement learning agents include more than one copy of the neural network.
+These copies serve as counterparts of the main network which are updated in different rates,
+and are often synchronized either locally or between parallel workers. For easier synchronization of those copies,
+a wrapper around these copies exposes a simplified API, which allows hiding these complexities from the agent.
+In this wrapper, 3 types of networks can be defined:
+
+* **online network** - A mandatory network which is the main network the agent will use
+
+* **global network** - An optional network which is shared between workers in single-node multi-process distributed learning.
+  It is updated by all the workers directly, and holds the most up-to-date weights.
+
+* **target network** - An optional network which is local for each worker. It can be used in order to keep a copy of
+  the weights stable for a long period of time. This is used in different agents, like DQN for example, in order to
+  have stable targets for the online network while training it.
+
+
+.. image:: /_static/img/distributed.png
+   :width: 600px
+   :align: center
+
+