Abstract:Since action strategy learning is time-consuming for the reinforcement learning algorithm,a heuristic reinforcement learning algorithm is presented based on state backtracking. By analyzing the repetitive states and comparing the action policies of the reinforcement learning,a cost function is defined to indicate the importance of repetitive actions. A probability-based heuristic function is presented by combining an action reward with an action cost. The proposed algorithm reinforces the importance of an action to speed up learning by the heuristic function and measures the feasibility of an action to reduce unnecessary exploration by the cost function at the same time,thus the learning efficiency is steadily improve. This cost-based action strategy is proved to be reasonable. Two simulation scenarios are built and the experimental results of robot games prove that the proposed algorithm can learn by the tradeoff between rewards and costs,and effectively improve the convergence of Q-learning.
[1] Bianchi R A C,Ribeiro C H C ,Costa A H R. Heuristically Accelerated Q-Learning: A New Approach to Speed up Reinforcement Learning // Proc of the 17th Brazilian Symposinm on Artificial Intelligence. Maranhao,Brazil,2004: 245-254 [2] Barto A G,Mahadevan S. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Dynamic Systems: Theory and Applications,2003,13(1/2): 41-77 [3] Marthi B. Automatic Shaping and Decomposition of Reward Functions // Proc of the 24th International Conference on Machine Learning. Corvallis,USA,2007: 601-608 [4] Torrey L,Shavlik J,Walker T,et al. Skill Acquisition via Transfer Learning and Advice Taking // Proc of the 17th European Conference on Machine Learning. Berlin,Germany,2006: 425-436 [5] Bianchi R A C,Ribeiro C H C,Costa A H R. Accelerating Autonomous Learning by Using Heuristic Selection of Actions. Journal of Heuristics,2008,14(2): 135-168 [6] Liu Quan,Gao Yang,Chen Daoxu,et al. A Logical Reinforcement Learning Method Based on Heuristic Contour List. Journal of Computer Research and Development,2008,45(11): 1824-1830 (in Chinese) (刘 全,高 阳,陈道蓄,等.一种基于启发式轮廓表的逻辑强化学习方法.计算机研究与发展,2008,45(11): 1824-1830) [7] Liu Quan,Fu Qiming,Gong Shengrong,et al. Reinforcement Learning Algorithm Based on Minimum State Method and Average Reward. Journal on Communications,2011,32(1): 66-71 (in Chinese) (刘 全,傅启明,龚声蓉,等.最小状态变元平均奖赏的强化学习方法.通信学报,2011,32(1): 66-71) [8] Wei Yingzi,Zhao Mingyang. Design and Convergence Analysis of Heuristic Reward Function for Reinforcement Learning Algorithms. Computer Science,2005,32(3): 190-193 (in Chinese) (魏英姿,赵明扬.强化学习算法中启发式回报函数的设计及其收敛性分析.计算机科学,2005,32(3): 190-193) [9] Zhao Jin,Liu Weiyi,Jian Jinjian. State-Clusters Shared Cooperative Multiagent Reinforcement Learning // Proc of the 7th Asian Control Conference. Hong Kong,China,2009: 129-135 [10] Van Seijen H,Whiteson S,van Hasselt H. Exploiting Best-Match Equations for Efficient Reinforcement Learning. Machine Learning Research,2011,12(6): 2045-2094 [11] Fang Min,Li Hao,Zhang Xiaosong. A Heuristic Reinforcement Learning Based on State Backtracking Method // Proc of the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. Macau,China,2012: 673-678 [12] Gao Yang,Chen Shifu,Lu Xin. A Survey of Reinforcement Learning. Acta Automatiea Sinica,2004,30(1): 86-100 (in Chinese) (高 阳,陈世福,陆 鑫.强化学习研究综述.自动化学报,2004,30(1): 86-100) [13] Busoniu L,Babuska R,de Schutter B. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans on Systems,Man,and Cybernetics,2008,38(2): 156-172 [14] Stone P,Sutton R S,Kuhlmann G. Reinforcement Learning for RoboCup-Soccer Keepaway. Adaptive Behavior,2005,13(3): 165-188 [15] Ota J. Multiagent Robot Systems as Distributed Autonomous Systems. Advanced Engineering Informatics,2006,20(1): 59-70