1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 03:30:19 +01:00
Commit Graph

69 Commits

Author SHA1 Message Date
Gal Novik
2697142d5a Release 1.0.0 (#382)
* Updating README
* Shortening test cycles
2019-07-24 16:10:58 +03:00
Gal Novik
b82414138d Workaround the OSError due to bad address failure on the CI runs (#370)
workaround the OSError due to bad address failure on the CI runs
2019-07-07 17:11:19 +03:00
Gal Leibovich
587b74e04a Remove double call to reset_internal_state() on gym environments (#364) 2019-07-02 13:43:23 +03:00
anabwan
a576ab5659 tests: Removed mxnet from functional tests + minor fix on rewards (#362)
* ci: change workflow

* changed timeout

* fix function reach reward

* print logs

* removing mxnet

* res'
2019-06-27 18:52:29 +03:00
Gal Leibovich
d6795bd524 batchnorm fixes + disabling batchnorm in DDPG (#353)
Co-authored-by: James Casbon <casbon+gh@gmail.com>
2019-06-23 11:28:22 +03:00
anabwan
7b5d6a3f03 tests: stabling functional tests (#355)
* tests: stabling functional tests

* functional removed
2019-06-20 15:30:47 +03:00
Timo Kaufmann
8df3c46756 Do not hardcode path to bash (#332) 2019-06-10 20:10:28 +03:00
anabwan
0aa5359d63 tests: added assert for cp param and changing test args order (#342) 2019-06-05 00:16:50 +03:00
anabwan
f5ba14575c tests: print logs on failure + fix -cp param (#327)
* tests: pring logs on failure

* fix import

* added job to circleci

* fix functional

* removed debug job
2019-05-28 13:45:43 +03:00
Gal Leibovich
251dc9ccc0 Preset dependent number of csv read attempts in golden testing (#334) 2019-05-28 12:19:57 +03:00
Gal Leibovich
9e9c4fd332 Create a dataset using an agent (#306)
Generate a dataset using an agent (allowing to select between this and a random dataset)
2019-05-28 09:34:49 +03:00
anabwan
3b6e413532 tests: fix traces and changing workflow jobs (#316)
* tests: fix traces export presets

* tests: increase time for traces

* tests

* remove approval

* fix approval

* fix ap

* change worflow jobs

* fix path

* fix repo path

* change run traces

* adding assert

* fix assert
2019-05-26 15:27:36 +03:00
Gal Leibovich
30c2b2fc45 moving to skimage.transform.resize (#321) 2019-05-23 13:38:01 +03:00
anabwan
ffb55b4142 tests: update traces (#302)
* Traces folder removed from repo and moved to S3
* Traces jobs and update will use directly the S3 files
2019-05-07 10:04:05 +03:00
anabwan
740359587d tests: fixed nightly (#301)
* tests: fixed nightly

* tests: temp testing functional tests

* tests: temp testing functional tests

* tests: add seed to -cp

* test: last fix
2019-05-05 08:28:57 +03:00
anabwan
b3db9ce77d tests: fixed failed tests - stabling CI (#298)
* tests: stabling CI

* tests: fix failed tests - stabling CI

* fix get csv files.
  - fixed seed test
* fix clres on conftest - now can modify paths during test run.
  - this fixed the mxnet checkpoint test

* tests: fix comments
2019-04-23 15:12:11 +03:00
anabwan
20a8dea0dd tests: minor fix for functional tests (#289)
* tests: minor fix for functional tests

* tests: fix value
2019-04-15 12:28:23 +03:00
zach dwiel
2cb078b4c2 add __truediv__, __rtruediv__ and __eq__ to StepMethod 2019-04-09 12:14:27 -04:00
anabwan
881f78f45a tests: new checkpoint mxnet test + fix utils (#273)
* tests: new mxnet test + fix utils

new test added:
- test_restore_checkpoint[tensorflow, mxnet]

fix failed tests in CI
improve utils

* tests: fix comments for mxnet checkpoint test and utils
2019-04-07 07:36:44 +03:00
Zach Dwiel
2291cee2c6 allow serializing from/to arrays/str from GlobalVariableSaver (#285) 2019-04-04 11:09:19 -04:00
shadiendrawis
0b808f0794 remove -ept flag (#283) 2019-04-03 16:32:24 +03:00
anabwan
869bd421a3 tests: added new checkpoint and functional tests (#265)
* added new tests
- test_preset_n_and_ew
- test_preset_n_and_ew_and_onnx

* code utils improvements (all utils)
* improve checkpoint_test
* new functionality for functional_test markers and presets lists
* removed special environment container
* add xfail to certain tests
2019-03-28 13:57:31 -07:00
Gal Leibovich
310d31c227 integration test changes to reach the train part (#254)
* integration test changes to override heatup to 1000 steps +  run each preset for 30 sec (to make sure we reach the train part)

* fixes to failing presets uncovered with this change + changes in the golden testing to properly test BatchRL

* fix for rainbow dqn

* fix to gym_environment (due to a change in Gym 0.12.1) + fix for rainbow DQN + some bug-fix in utils.squeeze_list

* fix for NEC agent
2019-03-27 21:14:19 +02:00
anabwan
4a8451ff02 tests: added new tests + utils code improved (#221)
* tests: added new tests + utils code improved

* new tests:
- test_preset_args_combination
- test_preset_mxnet_framework

* added more flags to test_preset_args
* added validation for flags in utils

* tests: added new tests + fixed utils

* tests: added new checkpoint test

* tests: added checkpoint test improve utils

* tests: added tests + improve validations

* bump integration CI run timeout.

* tests: improve timerun + add functional test marker
2019-03-18 11:21:43 +02:00
Gal Leibovich
d6158a5cfc restoring from a checkpoint file (#247) 2019-03-17 16:28:09 +02:00
Ajay Deshpande
2c1a9dbf20 Adding framework for multinode tests (#149)
* Currently runs CartPole_ClippedPPO and Mujoco_ClippedPPO with inverted_pendulum level.
2019-02-26 13:53:12 -08:00
shadiendrawis
2b5d1dabe6 ACER algorithm (#184)
* initial ACER commit

* Code cleanup + several fixes

* Q-retrace bug fix + small clean-ups

* added documentation for acer

* ACER benchmarks

* update benchmarks table

* Add nightly running of golden and trace tests. (#202)

Resolves #200

* comment out nightly trace tests until values reset.

* remove redundant observe ignore (#168)

* ensure nightly test env containers exist. (#205)

Also bump integration test timeout

* wxPython removal (#207)

Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.

* Create CONTRIBUTING.md (#210)

* Create CONTRIBUTING.md.  Resolves #188

* run nightly golden tests sequentially. (#217)

Should reduce resource requirements and potential CPU contention but increases
overall execution time.

* tests: added new setup configuration + test args (#211)

- added utils for future tests and conftest
- added test args

* new docs build

* golden test update
2019-02-20 23:52:34 +02:00
anabwan
7253f511ed tests: added new setup configuration + test args (#211)
- added utils for future tests and conftest
- added test args
2019-02-13 07:43:59 -05:00
Zach Dwiel
8672f8b542 Fix golden tests (#199)
* remove unused functions utils.read_json and utils.write_json
* increase verbosity of golden tests; detect errors in golden tests
2019-01-16 17:38:11 -08:00
Zach Dwiel
cd812b0d25 more clear names for methods of Space (#181)
* rename Space.val_matches_space_definition -> contains; Space.is_point_in_space_shape -> valid_index
* rename valid_index -> is_valid_index
2019-01-14 15:02:53 -05:00
Scott Leishman
053adf0ca9 prevent long job CI timeouts owing to lack of EKS token refresh (#183)
* add additional info during exception of eks runs.

* ensure we refresh k8s config after long calls.

Kubernetes client on EKS has a 10 minute token time to live, so will
result in unauthorized errors if tokens are not refreshed on long jobs.
2019-01-09 15:12:00 -08:00
Gourav Roy
619ea0944e Avoid Memory Leak in Rollout worker
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We use TF placeholder to update the variables which avoids the
memory leak.
2019-01-02 23:09:09 -08:00
Gourav Roy
c377363e50 Revert "Changes to avoid memory leak in rollout worker"
This reverts commit 801aed5e10.
2019-01-02 23:09:09 -08:00
Gourav Roy
779d3694b4 Revert "comment out the part of test in 'test_basic_rl_graph_manager_with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop"
This reverts commit b8d21c73bf.
2019-01-02 23:09:09 -08:00
Gourav Roy
6dd7ae2343 Revert "Avoid Memory Leak in Rollout worker"
This reverts commit c694766fad.
2019-01-02 23:09:09 -08:00
Gourav Roy
c694766fad Avoid Memory Leak in Rollout worker
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We reset the Tensorflow graph and recreate the Global, Online and
Target networks on every restore. This ensures that the old unused nodes
in TF graph is dropped.
2018-12-25 21:04:21 -08:00
gouravr
b8d21c73bf comment out the part of test in 'test_basic_rl_graph_manager_with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop 2018-12-16 10:56:40 -08:00
gouravr
801aed5e10 Changes to avoid memory leak in rollout worker
Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.

This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.

Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.
2018-12-15 12:26:31 -08:00
Sina Afrooze
19a68812f6 Added ONNX compatible broadcast_like function (#152)
- Also simplified the hybrid_clip implementation.
2018-11-25 11:23:18 +02:00
Sina Afrooze
5332013bd1 Implement frame-work agnostic rollout and training workers (#137)
* Added checkpoint state file to coach checkpointing.

* Removed TF specific code from rollout_worker, training_worker, and s3_data_store
2018-11-23 18:05:44 -08:00
Sina Afrooze
16cdd9a9c1 Tf checkpointing using saver mechanism (#134) 2018-11-22 14:08:10 +02:00
Sina Afrooze
67eb9e4c28 Adding checkpointing framework (#74)
* Adding checkpointing framework as well as mxnet checkpointing implementation.

- MXNet checkpoint for each network is saved in a separate file.

* Adding checkpoint restore for mxnet to graph-manager

* Add unit-test for get_checkpoint_state()

* Added match.group() to fix unit-test failing on CI

* Added ONNX export support for MXNet
2018-11-19 19:45:49 +02:00
Thom Lane
7ba1a4393f Channel order transpose, for image embedder. Updated unit test. (#87) 2018-11-19 15:39:03 +02:00
Thom Lane
81bac050d7 Added Custom Initialisation for MXNet Heads (#86)
* Added NormalizedRSSInitializer, using same method as TensorFlow backend, but changed name since ‘columns’ have different meaning in dense layer weight matrix in MXNet.

* Added unit test for NormalizedRSSInitializer.
2018-11-16 08:15:43 -08:00
Scott Leishman
524f8436a2 create per environment Dockerfiles. (#70)
* create per environment Dockerfiles.

Adjust CI setup to better parallelize runs.
Fix a couple of issues in golden and trace tests.
Update a few of the docs.

* bugfix in mmc agent.

Also install kubectl for CI, update badge branch.

* remove integration test parallelism.
2018-11-14 07:40:22 -08:00
Gal Leibovich
49dea39d34 N-step returns for rainbow (#67)
* n_step returns for rainbow
* Rename CartPole_PPO -> CartPole_ClippedPPO
2018-11-07 18:33:08 +02:00
Sina Afrooze
5fadb9c18e Adding mxnet components to rl_coach/architectures (#60)
Adding mxnet components to rl_coach architectures.

- Supports PPO and DQN
- Tested with CartPole_PPO and CarPole_DQN
- Normalizing filters don't work right now (see #49) and are disabled in CartPole_PPO preset
- Checkpointing is disabled for MXNet
2018-11-07 17:07:15 +02:00
Sina Afrooze
95b4fc6888 Added ability to switch between tensorflow and mxnet using -f commandline argument. (#48)
NOTE: tensorflow framework works fine if mxnet is not installed in env, but mxnet will not work if tensorflow is not installed because of the code in network_wrapper.
2018-10-30 15:29:34 -07:00
Ajay Deshpande
16b3e99f37 Setup basic CI flow (#38)
Adds automated running of unit, integration tests (and optionally longer running tests)
2018-10-24 18:27:58 -07:00
zach dwiel
430ca198e5 convert golden tests into pytest format 2018-10-23 19:58:17 -04:00