coach

gryf/coach

mirror of https://github.com/gryf/coach.git synced 2026-02-21 01:05:50 +01:00

Author	SHA1	Message	Date
Gal Leibovich	d6158a5cfc	restoring from a checkpoint file (#247 )	2019-03-17 16:28:09 +02:00
Gourav Roy	619ea0944e	Avoid Memory Leak in Rollout worker ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We use TF placeholder to update the variables which avoids the memory leak.	2019-01-02 23:09:09 -08:00
Gourav Roy	c377363e50	Revert "Changes to avoid memory leak in rollout worker" This reverts commit `801aed5e10`.	2019-01-02 23:09:09 -08:00
Gourav Roy	779d3694b4	Revert "comment out the part of test in 'test_basic_rl_graph_manager_with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop" This reverts commit `b8d21c73bf`.	2019-01-02 23:09:09 -08:00
Gourav Roy	6dd7ae2343	Revert "Avoid Memory Leak in Rollout worker" This reverts commit `c694766fad`.	2019-01-02 23:09:09 -08:00
Gourav Roy	c694766fad	Avoid Memory Leak in Rollout worker ISSUE: When we restore checkpoints, we create new nodes in the Tensorflow graph. This happens when we assign new value (op node) to RefVariable in GlobalVariableSaver. With every restore the size of TF graph increases as new nodes are created and old unused nodes are not removed from the graph. This causes the memory leak in restore_checkpoint codepath. FIX: We reset the Tensorflow graph and recreate the Global, Online and Target networks on every restore. This ensures that the old unused nodes in TF graph is dropped.	2018-12-25 21:04:21 -08:00
gouravr	b8d21c73bf	comment out the part of test in 'test_basic_rl_graph_manager_with_cartpole_dqn_and_repeated_checkpoint_restore' that run in infinite loop	2018-12-16 10:56:40 -08:00
gouravr	801aed5e10	Changes to avoid memory leak in rollout worker Currently in rollout worker, we call restore_checkpoint repeatedly to load the latest model in memory. The restore checkpoint functions calls checkpoint_saver. Checkpoint saver uses GlobalVariablesSaver which does not release the references of the previous model variables. This leads to the situation where the memory keeps on growing before crashing the rollout worker. This change avoid using the checkpoint saver in the rollout worker as I believe it is not needed in this code path. Also added a test to easily reproduce the issue using CartPole example. We were also seeing this issue with the AWS DeepRacer implementation and the current implementation avoid the memory leak there as well.	2018-12-15 12:26:31 -08:00
Sina Afrooze	95b4fc6888	Added ability to switch between tensorflow and mxnet using -f commandline argument. (#48 ) NOTE: tensorflow framework works fine if mxnet is not installed in env, but mxnet will not work if tensorflow is not installed because of the code in network_wrapper.	2018-10-30 15:29:34 -07:00
Zach Dwiel	517aac163a	introduce graph_manager.phase_context; make sure that calls to graph_manager.train automatically set training phase	2018-10-23 16:57:43 -04:00
Gal Novik	19ca5c24b1	pre-release 0.10.0	2018-08-13 17:11:34 +03:00

11 Commits