🌞 Reference

[1] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wicrstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), pages 387-395, 2014 . [2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Pannecrshelvam, and Marc Lanctot. Mastering the game of Go with deep neural networks and tree search. nature, $529(7587)$ : $484,2016 .$ [3] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), pages 1057-1063, $2000 .$ [4] Richard S Sutton, Csaba Szepesvári, and Hamid Reza Maci. A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in Neural Information Processing Systems (NIPS), 21(21):1609-1616, $2008 .$ [5] Richard S Sutton, Hamid Reza Maci, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning (ICML), pages $993-1000,2009$