1
0
mirror of https://github.com/gryf/coach.git synced 2026-02-22 10:05:45 +01:00

Itaicaspi/episode reset refactoring (#105)

* reordering of the episode reset operation and allowing to store episodes only when they are terminated

* reordering of the episode reset operation and allowing to store episodes only when they are terminated

* revert tensorflow-gpu to 1.9.0 + bug fix in should_train()

* tests readme file and refactoring of policy optimization agent train function

* Update README.md

* Update README.md

* additional policy optimization train function simplifications

* Updated the traces after the reordering of the environment reset

* docker and jenkins files

* updated the traces to the ones from within the docker container

* updated traces and added control suite to the docker

* updated jenkins file with the intel proxy + updated doom basic a3c test params

* updated line breaks in jenkins file

* added a missing line break in jenkins file

* refining trace tests ignored presets + adding a configurable beta entropy value

* switch the order of trace and golden tests in jenkins + fix golden tests processes not killed issue

* updated benchmarks for dueling ddqn breakout and pong

* allowing dynamic updates to the loss weights + bug fix in episode.update_returns

* remove docker and jenkins file
This commit is contained in:
Itai Caspi
2018-09-04 15:07:54 +03:00
committed by GitHub
parent 7086492127
commit 72a1d9d426
92 changed files with 9803 additions and 9740 deletions

View File

@@ -1,6 +1,6 @@
Episode #,Training Iter,In Heatup,ER #Transitions,ER #Episodes,Episode Length,Total steps,Epsilon,Shaped Training Reward,Training Reward,Update Target Network,Evaluation Reward,Shaped Evaluation Reward,Success Rate,Loss/Mean,Loss/Stdev,Loss/Max,Loss/Min,Learning Rate/Mean,Learning Rate/Stdev,Learning Rate/Max,Learning Rate/Min,Grads (unclipped)/Mean,Grads (unclipped)/Stdev,Grads (unclipped)/Max,Grads (unclipped)/Min,Entropy/Mean,Entropy/Stdev,Entropy/Max,Entropy/Min,Advantages/Mean,Advantages/Stdev,Advantages/Max,Advantages/Min,Values/Mean,Values/Stdev,Values/Max,Values/Min,Value Loss/Mean,Value Loss/Stdev,Value Loss/Max,Value Loss/Min,Policy Loss/Mean,Policy Loss/Stdev,Policy Loss/Max,Policy Loss/Min,Q/Mean,Q/Stdev,Q/Max,Q/Min,TD targets/Mean,TD targets/Stdev,TD targets/Max,TD targets/Min,actions/Mean,actions/Stdev,actions/Max,actions/Min
1,0.0,1.0,97.0,1.0,25.0,25.0,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,0.0,1.0,194.0,2.0,25.0,50.0,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,0.0,0.0,291.0,3.0,25.0,75.0,-0.03819695695002292,-1000.0,-1000.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.05867912,0.040427182,0.038633604,-0.13119522,,,,,-0.5875804651149715,0.9883034640881114,0.2924503923099136,-3.1955509185791016
4,0.0,0.0,388.0,4.0,25.0,100.0,0.008508156342542239,-1000.0,-1000.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.04915462,0.027965656000000002,0.015574882,-0.11603892,,,,,-0.5310139374222866,0.9150246753002113,0.2726461971315715,-2.9480751131842533
5,0.0,0.0,485.0,5.0,25.0,125.0,0.0,-1000.0,-1000.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.047291752,0.027684617999999998,0.030320742999999997,-0.11130883,,,,,-0.5612901779256286,0.929480152044698,0.23112091422080994,-2.8455907461559957
3,0.0,0.0,291.0,3.0,25.0,75.0,-0.013705192291281485,-1000.0,-1000.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.51026434,0.22476047,-0.15544460000000002,-0.9295912000000001,,,,,2.0812359514743166,3.3284790187301674,12.234674698678914,-0.08146359109321984
4,0.0,0.0,388.0,4.0,25.0,100.0,-0.02430443169727376,-1000.0,-1000.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.42551166,0.15804265,-0.14439134,-0.71600544,,,,,1.7233822661852551,2.691847085563749,10.017017240560527,-0.08547367510074966
5,0.0,0.0,485.0,5.0,25.0,125.0,0.0,-1000.0,-1000.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.4319562,0.17422763,-0.1460396,-0.7337566999999999,,,,,1.742798057982355,2.725836758125469,10.305663257960603,-0.09830476343631744
1 Episode # Training Iter In Heatup ER #Transitions ER #Episodes Episode Length Total steps Epsilon Shaped Training Reward Training Reward Update Target Network Evaluation Reward Shaped Evaluation Reward Success Rate Loss/Mean Loss/Stdev Loss/Max Loss/Min Learning Rate/Mean Learning Rate/Stdev Learning Rate/Max Learning Rate/Min Grads (unclipped)/Mean Grads (unclipped)/Stdev Grads (unclipped)/Max Grads (unclipped)/Min Entropy/Mean Entropy/Stdev Entropy/Max Entropy/Min Advantages/Mean Advantages/Stdev Advantages/Max Advantages/Min Values/Mean Values/Stdev Values/Max Values/Min Value Loss/Mean Value Loss/Stdev Value Loss/Max Value Loss/Min Policy Loss/Mean Policy Loss/Stdev Policy Loss/Max Policy Loss/Min Q/Mean Q/Stdev Q/Max Q/Min TD targets/Mean TD targets/Stdev TD targets/Max TD targets/Min actions/Mean actions/Stdev actions/Max actions/Min
2 1 0.0 1.0 97.0 1.0 25.0 25.0 0.0 0.0
3 2 0.0 1.0 194.0 2.0 25.0 50.0 0.0 0.0
4 3 0.0 0.0 291.0 3.0 25.0 75.0 -0.03819695695002292 -0.013705192291281485 -1000.0 -1000.0 0.0 -0.05867912 -0.51026434 0.040427182 0.22476047 0.038633604 -0.15544460000000002 -0.13119522 -0.9295912000000001 -0.5875804651149715 2.0812359514743166 0.9883034640881114 3.3284790187301674 0.2924503923099136 12.234674698678914 -3.1955509185791016 -0.08146359109321984
5 4 0.0 0.0 388.0 4.0 25.0 100.0 0.008508156342542239 -0.02430443169727376 -1000.0 -1000.0 0.0 -0.04915462 -0.42551166 0.027965656000000002 0.15804265 0.015574882 -0.14439134 -0.11603892 -0.71600544 -0.5310139374222866 1.7233822661852551 0.9150246753002113 2.691847085563749 0.2726461971315715 10.017017240560527 -2.9480751131842533 -0.08547367510074966
6 5 0.0 0.0 485.0 5.0 25.0 125.0 0.0 -1000.0 -1000.0 0.0 -0.047291752 -0.4319562 0.027684617999999998 0.17422763 0.030320742999999997 -0.1460396 -0.11130883 -0.7337566999999999 -0.5612901779256286 1.742798057982355 0.929480152044698 2.725836758125469 0.23112091422080994 10.305663257960603 -2.8455907461559957 -0.09830476343631744