A Fast Q(λ) Algorithm Based on Second-Order TD Error
FU Qi-Ming1,LIU Quan1,2,SUN Hong-Kun1,GAO Long1,LI Jing1,WANG Hui1
1. School of Computer Science and Technology,Soochow University,Suzhou 215006 2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012
Abstract Q(λ) algorithm is a classic model-free-based off policy reinforcement learning with multiple steps which combines the value iteration and stochastic approximation. Aiming at the low efficiency and slow convergence for traditional Q(λ) algorithm,the n-order TD Error is defined from the aspect of the TD Error which is used to the traditional Q(λ) algorithm,and a fast Q(λ) algorithm based on the second-order TD Error (SOE-FQ(λ)) is presented. The algorithm adjusts the Q value with the second-order TD Error and broadcasts the TD Error to the whole state-action space,which speeds up the convergence of the algorithm. In addition,the convergence rate is analyzed,and the number of iteration mainly depends on 11-γ、1ε under the condition of one-step update. Finally,the SOE-FQ(λ) algorithm is used to the random walk and mountain car,and the experimental results show that the algorithm has the faster convergence rate and better convergence performance.
[1] Liu Quan,Fu Qiming,Gong Shengrong,et al. Reinforcement Learning Algorithm Based on Minimum State Method and Average Reward. Journal on Communications,2011,32(1): 66-71 (in Chinese) (刘 全,傅启明,龚声蓉,等.最小状态变元平均奖赏的强化学习方法.通信学报,2011,32(1): 66-71) [2] Watkins C J C H. Learning from Delayed Rewards. Ph.D Dissertation. Cambridge,UK: Cambridge University,1989 [3] Watkins C J C H,Dayan P. Q-Learning. Machine Learning,1992,8(3/4): 279-292 [4] Szepesv ri C S. The Asymptotic Convergence-Rate of Q-Learning // Proc of the 10th Neural Information Processing Systems. Cambridge,USA: MIT Press,1997: 1064-1070 [5] Sutton R S,Barto G A. Reinforcement Learning. Cambridge,USA: MIT Press,1998 [6] Even-Dar E,Mansour Y. Learning Rates for Q-Learning. Journal of Machine Learning Research,2003,5: 1-25 [7] Ernst D,Geurts P,Wehenkel L. Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research,2005,6(4): 503-556 [8] Strehl A L,Li L H,Wiewiora E,et al. PAC Model-Free Reinforcement Learning // Proc of the 23rd International Conference on Machine Learning. New York,USA,2006: 881-888 [9] Maei H R,Szepesv ri C S,Bhatnagar S,et al. Toward Off-Policy Learning Control with Function Approximation // Proc of the 27th International Conference on Machine Learning. Haif,Israel,2010: 719-726 [10] Hasselt H V. Double Q-Learning // Proc of the Neural Information Processing Systems. Vancouver: Canada,2010: 2613-2621 [11] Engel Y,Mannor S,Meir R. Reinforcement Learning with Gaussian Processes // Proc of the 22nd International Conference on Machine Learning. New York,USA,2005: 201-208 [12] Xu Xin,Hu Dewen,Lu Xicheng. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning. IEEE Trans on Neural Networks,2007,18(4): 973-992 [13] Grzes' M,Kudenko D. Online Learning of Shaping Rewards in Reinforcement Learning. Neural Networks,2010,23(4): 541-550 [14] Sutton R S. Learning to Predict by the Method of Temporal Differences. Machine Learning,1988,3(1): 9-44 [15] Peng J,Williams R J. Incremental Multi-Step Q-Learning. Machine Learning,1996,22(1/2/3): 283-290