A Fast Q(λ) Algorithm Based on Second-Order TD Error
FU Qi-Ming1,LIU Quan1,2,SUN Hong-Kun1,GAO Long1,LI Jing1,WANG Hui1
1. School of Computer Science and Technology,Soochow University,Suzhou 215006 2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012
Abstract:Q(λ) algorithm is a classic model-free-based off policy reinforcement learning with multiple steps which combines the value iteration and stochastic approximation. Aiming at the low efficiency and slow convergence for traditional Q(λ) algorithm,the n-order TD Error is defined from the aspect of the TD Error which is used to the traditional Q(λ) algorithm,and a fast Q(λ) algorithm based on the second-order TD Error (SOE-FQ(λ)) is presented. The algorithm adjusts the Q value with the second-order TD Error and broadcasts the TD Error to the whole state-action space,which speeds up the convergence of the algorithm. In addition,the convergence rate is analyzed,and the number of iteration mainly depends on 11-γ、1ε under the condition of one-step update. Finally,the SOE-FQ(λ) algorithm is used to the random walk and mountain car,and the experimental results show that the algorithm has the faster convergence rate and better convergence performance.
[1] Liu Quan,Fu Qiming,Gong Shengrong,et al. Reinforcement Learning Algorithm Based on Minimum State Method and Average Reward. Journal on Communications,2011,32(1): 66-71 (in Chinese) (刘 全,傅启明,龚声蓉,等.最小状态变元平均奖赏的强化学习方法.通信学报,2011,32(1): 66-71) [2] Watkins C J C H. Learning from Delayed Rewards. Ph.D Dissertation. Cambridge,UK: Cambridge University,1989 [3] Watkins C J C H,Dayan P. Q-Learning. Machine Learning,1992,8(3/4): 279-292 [4] Szepesv ri C S. The Asymptotic Convergence-Rate of Q-Learning // Proc of the 10th Neural Information Processing Systems. Cambridge,USA: MIT Press,1997: 1064-1070 [5] Sutton R S,Barto G A. Reinforcement Learning. Cambridge,USA: MIT Press,1998 [6] Even-Dar E,Mansour Y. Learning Rates for Q-Learning. Journal of Machine Learning Research,2003,5: 1-25 [7] Ernst D,Geurts P,Wehenkel L. Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research,2005,6(4): 503-556 [8] Strehl A L,Li L H,Wiewiora E,et al. PAC Model-Free Reinforcement Learning // Proc of the 23rd International Conference on Machine Learning. New York,USA,2006: 881-888 [9] Maei H R,Szepesv ri C S,Bhatnagar S,et al. Toward Off-Policy Learning Control with Function Approximation // Proc of the 27th International Conference on Machine Learning. Haif,Israel,2010: 719-726 [10] Hasselt H V. Double Q-Learning // Proc of the Neural Information Processing Systems. Vancouver: Canada,2010: 2613-2621 [11] Engel Y,Mannor S,Meir R. Reinforcement Learning with Gaussian Processes // Proc of the 22nd International Conference on Machine Learning. New York,USA,2005: 201-208 [12] Xu Xin,Hu Dewen,Lu Xicheng. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning. IEEE Trans on Neural Networks,2007,18(4): 973-992 [13] Grzes' M,Kudenko D. Online Learning of Shaping Rewards in Reinforcement Learning. Neural Networks,2010,23(4): 541-550 [14] Sutton R S. Learning to Predict by the Method of Temporal Differences. Machine Learning,1988,3(1): 9-44 [15] Peng J,Williams R J. Incremental Multi-Step Q-Learning. Machine Learning,1996,22(1/2/3): 283-290