* tests: stabling CI
* tests: fix failed tests - stabling CI
* fix get csv files.
- fixed seed test
* fix clres on conftest - now can modify paths during test run.
- this fixed the mxnet checkpoint test
* tests: fix comments
* tests: new mxnet test + fix utils
new test added:
- test_restore_checkpoint[tensorflow, mxnet]
fix failed tests in CI
improve utils
* tests: fix comments for mxnet checkpoint test and utils
* integration test changes to override heatup to 1000 steps + run each preset for 30 sec (to make sure we reach the train part)
* fixes to failing presets uncovered with this change + changes in the golden testing to properly test BatchRL
* fix for rainbow dqn
* fix to gym_environment (due to a change in Gym 0.12.1) + fix for rainbow DQN + some bug-fix in utils.squeeze_list
* fix for NEC agent
* initial ACER commit
* Code cleanup + several fixes
* Q-retrace bug fix + small clean-ups
* added documentation for acer
* ACER benchmarks
* update benchmarks table
* Add nightly running of golden and trace tests. (#202)
Resolves#200
* comment out nightly trace tests until values reset.
* remove redundant observe ignore (#168)
* ensure nightly test env containers exist. (#205)
Also bump integration test timeout
* wxPython removal (#207)
Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.
* Create CONTRIBUTING.md (#210)
* Create CONTRIBUTING.md. Resolves#188
* run nightly golden tests sequentially. (#217)
Should reduce resource requirements and potential CPU contention but increases
overall execution time.
* tests: added new setup configuration + test args (#211)
- added utils for future tests and conftest
- added test args
* new docs build
* golden test update
* add additional info during exception of eks runs.
* ensure we refresh k8s config after long calls.
Kubernetes client on EKS has a 10 minute token time to live, so will
result in unauthorized errors if tokens are not refreshed on long jobs.
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.
FIX: We use TF placeholder to update the variables which avoids the
memory leak.
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.
FIX: We reset the Tensorflow graph and recreate the Global, Online and
Target networks on every restore. This ensures that the old unused nodes
in TF graph is dropped.
Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.
This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.
Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.
* Adding checkpointing framework as well as mxnet checkpointing implementation.
- MXNet checkpoint for each network is saved in a separate file.
* Adding checkpoint restore for mxnet to graph-manager
* Add unit-test for get_checkpoint_state()
* Added match.group() to fix unit-test failing on CI
* Added ONNX export support for MXNet
* Added NormalizedRSSInitializer, using same method as TensorFlow backend, but changed name since ‘columns’ have different meaning in dense layer weight matrix in MXNet.
* Added unit test for NormalizedRSSInitializer.
* create per environment Dockerfiles.
Adjust CI setup to better parallelize runs.
Fix a couple of issues in golden and trace tests.
Update a few of the docs.
* bugfix in mmc agent.
Also install kubectl for CI, update badge branch.
* remove integration test parallelism.
Adding mxnet components to rl_coach architectures.
- Supports PPO and DQN
- Tested with CartPole_PPO and CarPole_DQN
- Normalizing filters don't work right now (see #49) and are disabled in CartPole_PPO preset
- Checkpointing is disabled for MXNet
NOTE: tensorflow framework works fine if mxnet is not installed in env, but mxnet will not work if tensorflow is not installed because of the code in network_wrapper.