1
0
mirror of https://github.com/gryf/coach.git synced 2025-12-18 03:30:19 +01:00
Commit Graph

418 Commits

Author SHA1 Message Date
anabwan
881f78f45a tests: new checkpoint mxnet test + fix utils (#273)
* tests: new mxnet test + fix utils

new test added:
- test_restore_checkpoint[tensorflow, mxnet]

fix failed tests in CI
improve utils

* tests: fix comments for mxnet checkpoint test and utils
2019-04-07 07:36:44 +03:00
anabwan
e1e335a4ef disabled Starcraft from nightly (#286)
* disabled Starcraft from nightly

* tests: added comments
2019-04-04 22:26:25 +03:00
Zach Dwiel
2291cee2c6 allow serializing from/to arrays/str from GlobalVariableSaver (#285) 2019-04-04 11:09:19 -04:00
anabwan
cdb8d9e518 tests: fix multi environment variables in configci (#284)
* tests: fix multi environment variables in configci

- fix multi environment vairables in configci
- removing bitflip from mujoco tests
- add bitflip to gym

* tests: disable mujoco_a3c_lstm + fix timeout and fix docker
2019-04-04 16:11:41 +03:00
Scott Leishman
f173e69187 introduce dockerfiles. (#169)
* introduce dockerfiles.

* ensure golden tests are run not just collected.

* Skip CI download of dockerfiles.

* add StarCraft environment and tests.

* add minimaps starcraft validation parameters.

* Add functional test running (from Ayoob)

* pin mujoco_py version to a 1.5 compatible release.

* fix config syntax issue.

* pin remaining mujoco_py install calls.

* Relax pin of gym version in gym Dockerfile.

* update makefile based on functional test filtering.
2019-04-03 19:33:17 +03:00
shadiendrawis
0b808f0794 remove -ept flag (#283) 2019-04-03 16:32:24 +03:00
shadiendrawis
a543f10c1a fix Intel tensorflow installation issue (#281)
* fix intel tensorflow installation issue

* update version
2019-04-03 13:03:30 +03:00
anabwan
5d4b9c7399 added functional environments to CircleCI (#268)
added functional environments to CircleCI
2019-03-28 15:45:19 -07:00
anabwan
869bd421a3 tests: added new checkpoint and functional tests (#265)
* added new tests
- test_preset_n_and_ew
- test_preset_n_and_ew_and_onnx

* code utils improvements (all utils)
* improve checkpoint_test
* new functionality for functional_test markers and presets lists
* removed special environment container
* add xfail to certain tests
2019-03-28 13:57:31 -07:00
Gal Leibovich
310d31c227 integration test changes to reach the train part (#254)
* integration test changes to override heatup to 1000 steps +  run each preset for 30 sec (to make sure we reach the train part)

* fixes to failing presets uncovered with this change + changes in the golden testing to properly test BatchRL

* fix for rainbow dqn

* fix to gym_environment (due to a change in Gym 0.12.1) + fix for rainbow DQN + some bug-fix in utils.squeeze_list

* fix for NEC agent
2019-03-27 21:14:19 +02:00
Gal Leibovich
6e08c55ad5 Enabling-more-agents-for-Batch-RL-and-cleanup (#258)
allowing for the last training batch drawn to be smaller than batch_size + adding support for more agents in BatchRL by adding softmax with temperature to the corresponding heads + adding a CartPole_QR_DQN preset with a golden test + cleanups
2019-03-21 16:10:29 +02:00
Gal Leibovich
abec59f367 fixes to rainbow dqn + a cartpole based golden test (#253) 2019-03-21 12:57:56 +02:00
anabwan
83741fa92a tests: added function tests to nightly CircleCI (#252) 2019-03-20 15:39:22 -07:00
Gal Leibovich
e3c7e526c7 Batch RL (#238) 2019-03-19 18:07:09 +02:00
anabwan
4a8451ff02 tests: added new tests + utils code improved (#221)
* tests: added new tests + utils code improved

* new tests:
- test_preset_args_combination
- test_preset_mxnet_framework

* added more flags to test_preset_args
* added validation for flags in utils

* tests: added new tests + fixed utils

* tests: added new checkpoint test

* tests: added checkpoint test improve utils

* tests: added tests + improve validations

* bump integration CI run timeout.

* tests: improve timerun + add functional test marker
2019-03-18 11:21:43 +02:00
Gal Leibovich
d6158a5cfc restoring from a checkpoint file (#247) 2019-03-17 16:28:09 +02:00
shadiendrawis
f03bd7ad93 benchmark update (#250) 2019-03-17 15:33:28 +02:00
Nikhil Barhate
537b549e1d fixed broken url in README (#246) 2019-03-13 22:38:33 -07:00
Scott Leishman
9c449507e0 update CARLA install docs to note python client. (#234) 2019-03-13 22:21:44 -07:00
Gal Leibovich
8be9ea5dc9 Update setup.py (#245) 2019-03-12 11:08:10 +02:00
Gal Leibovich
c02333b1ba fix dashboard to allow connections from a remote machine. (#231) 2019-03-10 13:15:14 +02:00
Gal Leibovich
9a895a1ac7 bug-fix for l2_regularization not in use (#230)
* bug-fix for l2_regularization not in use
* removing not in use TF REGULARIZATION_LOSSES collection
2019-03-03 15:11:06 +02:00
Gal Novik
10220be9be Adding support for evaluation only mode with predefined number of steps (#225) 2019-03-03 10:03:45 +02:00
Ajay Deshpande
2c1a9dbf20 Adding framework for multinode tests (#149)
* Currently runs CartPole_ClippedPPO and Mujoco_ClippedPPO with inverted_pendulum level.
2019-02-26 13:53:12 -08:00
shadiendrawis
b461a1b8ab readme fix (#228) 2019-02-24 13:46:21 +02:00
shadiendrawis
2b5d1dabe6 ACER algorithm (#184)
* initial ACER commit

* Code cleanup + several fixes

* Q-retrace bug fix + small clean-ups

* added documentation for acer

* ACER benchmarks

* update benchmarks table

* Add nightly running of golden and trace tests. (#202)

Resolves #200

* comment out nightly trace tests until values reset.

* remove redundant observe ignore (#168)

* ensure nightly test env containers exist. (#205)

Also bump integration test timeout

* wxPython removal (#207)

Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.

* Create CONTRIBUTING.md (#210)

* Create CONTRIBUTING.md.  Resolves #188

* run nightly golden tests sequentially. (#217)

Should reduce resource requirements and potential CPU contention but increases
overall execution time.

* tests: added new setup configuration + test args (#211)

- added utils for future tests and conftest
- added test args

* new docs build

* golden test update
2019-02-20 23:52:34 +02:00
anabwan
7253f511ed tests: added new setup configuration + test args (#211)
- added utils for future tests and conftest
- added test args
2019-02-13 07:43:59 -05:00
Scott Leishman
9d0fed84a3 run nightly golden tests sequentially. (#217)
Should reduce resource requirements and potential CPU contention but increases
overall execution time.
2019-02-04 17:18:35 +02:00
Gal Novik
b4fd1b3c93 Create CONTRIBUTING.md (#210)
* Create CONTRIBUTING.md.  Resolves #188
2019-01-29 13:47:22 -08:00
Gal Novik
135f02fb46 wxPython removal (#207)
Replacing wxPython with Python's Tkinter.
Also removing the option to choose multiple files as it is unused and causes errors, and fixing the load file/directory spinner.
2019-01-23 20:49:37 +02:00
Scott Leishman
516547e3df ensure nightly test env containers exist. (#205)
Also bump integration test timeout
2019-01-18 13:43:42 -08:00
Cody Hsieh
bf0a65eefd remove redundant observe ignore (#168) 2019-01-17 14:08:05 -08:00
Scott Leishman
a048024bf5 Add nightly running of golden and trace tests. (#202)
Resolves #200

* comment out nightly trace tests until values reset.
2019-01-17 11:52:50 -08:00
Zach Dwiel
8672f8b542 Fix golden tests (#199)
* remove unused functions utils.read_json and utils.write_json
* increase verbosity of golden tests; detect errors in golden tests
2019-01-16 17:38:11 -08:00
Zach Dwiel
fedb4cbd7c Cleanup and refactoring (#171) 2019-01-15 10:04:53 +02:00
Zach Dwiel
cd812b0d25 more clear names for methods of Space (#181)
* rename Space.val_matches_space_definition -> contains; Space.is_point_in_space_shape -> valid_index
* rename valid_index -> is_valid_index
2019-01-14 15:02:53 -05:00
Zach Dwiel
0ccc333d77 raise value error if there is an invalid action space (#179) 2019-01-13 11:06:48 +02:00
Scott Leishman
053adf0ca9 prevent long job CI timeouts owing to lack of EKS token refresh (#183)
* add additional info during exception of eks runs.

* ensure we refresh k8s config after long calls.

Kubernetes client on EKS has a 10 minute token time to live, so will
result in unauthorized errors if tokens are not refreshed on long jobs.
2019-01-09 15:12:00 -08:00
Gal Novik
0fa9d8e602 Update README.md (#182) 2019-01-08 13:48:17 +02:00
Ajay Deshpande
8a1ea3d915 Merge pull request #161 from x77a1/master
Changes to avoid memory leak in Rollout worker
2019-01-03 21:15:04 -08:00
Gourav Roy
b1e9ea48d8 Refactored GlobalVariableSaver 2019-01-03 15:08:34 -08:00
Gourav Roy
619ea0944e Avoid Memory Leak in Rollout worker
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We use TF placeholder to update the variables which avoids the
memory leak.
2019-01-02 23:09:09 -08:00
Gourav Roy
c377363e50 Revert "Changes to avoid memory leak in rollout worker"
This reverts commit 801aed5e10.
2019-01-02 23:09:09 -08:00
Gourav Roy
779d3694b4 Revert "comment out the part of test in 'test_basic_rl_graph_manager_with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop"
This reverts commit b8d21c73bf.
2019-01-02 23:09:09 -08:00
Gourav Roy
6dd7ae2343 Revert "Avoid Memory Leak in Rollout worker"
This reverts commit c694766fad.
2019-01-02 23:09:09 -08:00
Gourav Roy
2461892c9e Revert "Updated comments"
This reverts commit 740f7937cd.
2019-01-02 23:09:09 -08:00
Gourav Roy
740f7937cd Updated comments 2018-12-25 21:52:07 -08:00
x77a1
73c4c850a5 Merge branch 'master' into master 2018-12-25 21:05:41 -08:00
Gourav Roy
c694766fad Avoid Memory Leak in Rollout worker
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We reset the Tensorflow graph and recreate the Global, Online and
Target networks on every restore. This ensures that the old unused nodes
in TF graph is dropped.
2018-12-25 21:04:21 -08:00
Gal Novik
56735624ca Merge pull request #160 from NervanaSystems/tf_version_bump
Bump intel optimized tensorflow to 1.12.0
2018-12-25 10:51:58 +02:00