* tests: new mxnet test + fix utils
new test added:
- test_restore_checkpoint[tensorflow, mxnet]
fix failed tests in CI
improve utils
* tests: fix comments for mxnet checkpoint test and utils
* introduce dockerfiles.
* ensure golden tests are run not just collected.
* Skip CI download of dockerfiles.
* add StarCraft environment and tests.
* add minimaps starcraft validation parameters.
* Add functional test running (from Ayoob)
* pin mujoco_py version to a 1.5 compatible release.
* fix config syntax issue.
* pin remaining mujoco_py install calls.
* Relax pin of gym version in gym Dockerfile.
* update makefile based on functional test filtering.
* integration test changes to override heatup to 1000 steps + run each preset for 30 sec (to make sure we reach the train part)
* fixes to failing presets uncovered with this change + changes in the golden testing to properly test BatchRL
* fix for rainbow dqn
* fix to gym_environment (due to a change in Gym 0.12.1) + fix for rainbow DQN + some bug-fix in utils.squeeze_list
* fix for NEC agent
allowing for the last training batch drawn to be smaller than batch_size + adding support for more agents in BatchRL by adding softmax with temperature to the corresponding heads + adding a CartPole_QR_DQN preset with a golden test + cleanups
* initial ACER commit
* Code cleanup + several fixes
* Q-retrace bug fix + small clean-ups
* added documentation for acer
* ACER benchmarks
* update benchmarks table
* Add nightly running of golden and trace tests. (#202)
Resolves#200
* comment out nightly trace tests until values reset.
* remove redundant observe ignore (#168)
* ensure nightly test env containers exist. (#205)
Also bump integration test timeout
* wxPython removal (#207)
Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.
* Create CONTRIBUTING.md (#210)
* Create CONTRIBUTING.md. Resolves#188
* run nightly golden tests sequentially. (#217)
Should reduce resource requirements and potential CPU contention but increases
overall execution time.
* tests: added new setup configuration + test args (#211)
- added utils for future tests and conftest
- added test args
* new docs build
* golden test update
Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.
* add additional info during exception of eks runs.
* ensure we refresh k8s config after long calls.
Kubernetes client on EKS has a 10 minute token time to live, so will
result in unauthorized errors if tokens are not refreshed on long jobs.
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.
FIX: We use TF placeholder to update the variables which avoids the
memory leak.
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.
FIX: We reset the Tensorflow graph and recreate the Global, Online and
Target networks on every restore. This ensures that the old unused nodes
in TF graph is dropped.
During heatup we may want to add agent-generated-noise (i.e. not "simple" random noise).
This is enabled by setting 'heatup_using_network_decisions' to True. For example:
agent_params = DDPGAgentParameters()
agent_params.algorithm.heatup_using_network_decisions = True
The fix ensures that the correct noise is added not just while in the TRAINING phase, but
also during the HEATUP phase.
No one has enabled 'heatup_using_network_decisions' yet, which explains why this problem
arose only now (in my configuration I do enable 'heatup_using_network_decisions').
Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.
This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.
Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.