1
0
mirror of https://github.com/gryf/coach.git synced 2026-04-20 06:33:31 +02:00
Commit Graph

427 Commits

Author SHA1 Message Date
Gourav Roy 619ea0944e Avoid Memory Leak in Rollout worker
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We use TF placeholder to update the variables which avoids the
memory leak.
2019-01-02 23:09:09 -08:00
Gourav Roy c377363e50 Revert "Changes to avoid memory leak in rollout worker"
This reverts commit 801aed5e10.
2019-01-02 23:09:09 -08:00
Gourav Roy 779d3694b4 Revert "comment out the part of test in 'test_basic_rl_graph_manager_with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop"
This reverts commit b8d21c73bf.
2019-01-02 23:09:09 -08:00
Gourav Roy 6dd7ae2343 Revert "Avoid Memory Leak in Rollout worker"
This reverts commit c694766fad.
2019-01-02 23:09:09 -08:00
Gourav Roy 2461892c9e Revert "Updated comments"
This reverts commit 740f7937cd.
2019-01-02 23:09:09 -08:00
Gourav Roy 740f7937cd Updated comments 2018-12-25 21:52:07 -08:00
x77a1 73c4c850a5 Merge branch 'master' into master 2018-12-25 21:05:41 -08:00
Gourav Roy c694766fad Avoid Memory Leak in Rollout worker
ISSUE: When we restore checkpoints, we create new nodes in the
Tensorflow graph. This happens when we assign new value (op node) to
RefVariable in GlobalVariableSaver. With every restore the size of TF
graph increases as new nodes are created and old unused nodes are not
removed from the graph. This causes the memory leak in
restore_checkpoint codepath.

FIX: We reset the Tensorflow graph and recreate the Global, Online and
Target networks on every restore. This ensures that the old unused nodes
in TF graph is dropped.
2018-12-25 21:04:21 -08:00
Gal Novik 56735624ca Merge pull request #160 from NervanaSystems/tf_version_bump
Bump intel optimized tensorflow to 1.12.0
2018-12-25 10:51:58 +02:00
Gal Novik 85fae0f626 Merge branch 'master' into tf_version_bump 2018-12-24 15:50:55 +02:00
Gal Novik d7c138342b Merge pull request #170 from NervanaSystems/ci_badge
add CI status badge.
2018-12-24 14:39:38 +02:00
Scott Leishman 0823d30839 Merge branch 'master' into tf_version_bump 2018-12-21 10:58:41 -05:00
Scott Leishman 7cda5179c6 add CI status badge. 2018-12-21 10:50:28 -05:00
Zach Dwiel 8e3ee818f8 update circle ci config to match new golden test presets (#167) 2018-12-21 10:10:31 -05:00
x77a1 02f2db1264 Merge branch 'master' into master 2018-12-17 12:44:27 -08:00
Gal Leibovich 4c914c057c fix for finding the right filter checkpoint to restore + do not update internal filter state when evaluating + fix SharedRunningStats checkpoint filenames (#147) 2018-12-17 21:36:27 +02:00
Neta Zmora b4bc8a476c Bug fix: when enabling 'heatup_using_network_decisions', we should add the configured noise (#162)
During heatup we may want to add agent-generated-noise (i.e. not "simple" random noise).
This is enabled by setting 'heatup_using_network_decisions' to True.  For example:
	agent_params = DDPGAgentParameters()
	agent_params.algorithm.heatup_using_network_decisions = True

The fix ensures that the correct noise is added not just while in the TRAINING phase, but
also during the HEATUP phase.

No one has enabled 'heatup_using_network_decisions' yet, which explains why this problem
arose only now (in my configuration I do enable 'heatup_using_network_decisions').
2018-12-17 10:08:54 +02:00
gouravr b8d21c73bf comment out the part of test in 'test_basic_rl_graph_manager_with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop 2018-12-16 10:56:40 -08:00
x77a1 1f0980c448 Merge branch 'master' into master 2018-12-16 09:37:00 -08:00
Gal Leibovich f9ee526536 Fix for issue #128 - circular DQN import (#130) 2018-12-16 16:06:44 +02:00
gouravr 801aed5e10 Changes to avoid memory leak in rollout worker
Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker.

This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path.

Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.
2018-12-15 12:26:31 -08:00
Scott Leishman aa1dfd7599 Bump intel optimized tensorflow to 1.12.0 2018-12-14 10:15:19 -05:00
zach dwiel e08accdc22 allow case insensitive selected level name matching 2018-12-11 12:35:30 -05:00
Zach Dwiel d0248e03c6 add meaningful error message in the event that the action space is not one that can be used (#151) 2018-12-11 09:09:24 +02:00
Gal Leibovich f12857a8c7 Docs changes - fixing blogpost links, removing importing all exploration policies (#139)
* updated docs

* removing imports for all exploration policies in __init__ + setting the right blog-post link

* small cleanups
2018-12-05 16:16:16 -05:00
Sina Afrooze 155b78b995 Fix warning on import TF or MxNet, when only one of the frameworks is installed (#140) 2018-12-05 11:52:24 +02:00
Ryan Peach 9e66bb653e Enable creating custom tensorflow heads, embedders, and middleware. (#135)
Allowing components to have a path property.
2018-12-05 11:40:06 +02:00
Ryan Peach 3c58ed740b 'CompositeAgent' object has no attribute 'handle_episode_ended' (#136) 2018-12-05 11:28:16 +02:00
Ryan Peach 436b16016e Added num_transitions to Memory interface (#137) 2018-12-05 10:33:25 +02:00
Gal Leibovich 3e281b467b Update docs_raw README.md (#138)
* Update README.md
2018-12-03 05:39:17 -08:00
Ryan Peach 28e5b8b612 Minor bugfix on RewardFilter in Readme (#133) 2018-11-30 16:02:08 -08:00
Scott Leishman 3e67eac9e6 Merge pull request #131 from ryanpeach/patch-2
NoOutputFilter isn't set in tutorial.
2018-11-30 15:55:34 -08:00
Ryan Peach f678ae7cb8 NoOutputFilter isn't set in tutorial. 2018-11-29 17:50:50 -05:00
Ajay Deshpande 0dd39b20ca Removing badge 2018-11-28 09:59:08 -08:00
Ajay Deshpande 15fabf6ec3 Removing badge 2018-11-28 09:19:32 -08:00
Gal Novik 533bb43720 Merge pull request #125 from NervanaSystems/0.11.0-release
0.11.0 release
2018-11-28 01:16:01 +02:00
Ajay Deshpande e877920dd5 Merge pull request #126 from NervanaSystems/ci_updates
CI related updates
2018-11-27 14:58:26 -08:00
Scott Leishman 3601d9bc45 CI related updates 2018-11-27 21:53:46 +00:00
Gal Novik 4e0d018d5f updated algorithms image in README 2018-11-27 23:12:13 +02:00
Gal Novik fc6604c09c added missing license headers 2018-11-27 22:43:40 +02:00
Gal Novik 1e618647ab adding .nojekyll file for github pages to function properly 2018-11-27 22:35:16 +02:00
Gal Novik 7e3aca22eb Documentation fix 2018-11-27 22:32:46 +02:00
Gal Novik 05c1005e94 Updated README and added .nojekyll file for github pages to work properly 2018-11-27 22:11:28 +02:00
Balaji Subramaniam d06197f663 Add documentation on distributed Coach. (#158)
* Added documentation on distributed Coach.
2018-11-27 12:26:15 +02:00
Scott Leishman e3ecf445e2 ensure we pull from main coach container layers as cache. (#106) 2018-11-26 17:09:02 -08:00
Gal Leibovich 5674749ed5 workaround for resolving the issue of restoring a multi-node training checkpoint to single worker (#156) 2018-11-26 00:08:43 +02:00
Gal Leibovich ab10852ad9 hacky way to resolve the checkpointing issue (#154) 2018-11-25 16:14:15 +02:00
Gal Leibovich 11170d5ba3 fix dist. tf (#153) 2018-11-25 14:02:24 +02:00
Sina Afrooze 19a68812f6 Added ONNX compatible broadcast_like function (#152)
- Also simplified the hybrid_clip implementation.
2018-11-25 11:23:18 +02:00
Balaji Subramaniam 8df425b6e1 Update how save checkpoint secs arg is handled in distributed Coach. (#151) 2018-11-25 00:05:24 -08:00