|
|
Hybrid Heuristic Value Iteration POMDP Algorithm |
LIU Feng |
Software Institute, Nanjing University, Nanjing 210093 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093 |
|
|
Abstract Point-based value iteration methods are a kind of algorithms for effectively solving partially observable Markov decision process (POMDP) model. However, the algorithm efficiency is limited by the belief point set explored in most of the algorithms by single heuristic criterion. A hybrid heuristic value iteration algorithm (HHVI) for exploring belief point set is presented in this paper. The upper and lower bounds on the value function are maintained and only the belief points with its value function bounds difference greater than the threshold are selected to expand. Furthermore, the furthest belief point away from the explored point set among the subsequent belief points with the above difference also greater than the threshold is explored. The convergence effect of HHVI is guaranteed by making the explored point set fully distributed in the reachable belief space. Experimental results of four benchmarks show that HHVI can guarantee the convergence efficiency and obtain better global optimal solution.
|
Received: 04 May 2016
|
|
Fund:Supported by General Project of State Key Laboratory for Novel Software Technology (No.ZZKT2016B07) |
About author:: LIU Feng, born in 1976, Ph.D., lecturer. His research interests include intelligent planning and reinforcement learning. |
|
|
|
[1] RUSSELL S, NORVIG P. Artificial Intelligence: A Modern Approach. Upper Saddle, USA: Prentice Hall, 1995. [2] SMITH T. Probabilistic Planning for Robotic Exploration. Ph.D Dissertation. Pittsburgh, USA: Carnegie Mellon University, 2007. [3] 卓 琨,张衡阳,徐丁海,等.基于POMDP的机载网络信道接入策略研究.系统工程与电子技术, 2016, 38(3): 658-664. (ZHUO K, ZHANG H Y, XU D H, et al. Research on Channel Access Strategy in Airborne Network Based on POMDP. Systems Engineering and Electronics, 2016, 38(3): 658-664.) [4] 万开方,高晓光,李 波,等.基于部分可观察马尔可夫决策过程的多被动传感器组网协同反隐身探测任务规划.兵工学报,2015, 36(4): 731-743. (WAN K F, GAO X G, LI B, et al. Mission Planning of Passive Networked Sensors for Cooperative Anti-stealth Detection Based on POMDP. Acta Armamentarii, 2015, 36(4): 731-743.) [5] 许瑞琛,蒋 挺.基于POMDP的认知无线电自适应频谱感知算法.通信学报, 2013, 34(6): 49-56. (XU R C, JIANG T. Cognitive Radio Auto-Adaptive Sensing Algorithm Based on POMDP. Journal on Communications, 2013, 34(6): 49-56.) [6] WILLIAMS J D, YOUNG S. Partially Observable Markov Decision Processes for Spoken Dialog Systems. Computer Speech & Language, 2007, 21(2): 393-422. [7] PINEAU J, GORDON G J, THRUN S. Point-Based Value Iteration: An Anytime Algorithm for POMDPs[C/OL]. [2016-03-21]. http://www.ijcai.org/Proceedings/03/Papers/147.pdf. [8] PINEAU J, GORDON G J. POMDP Planning for Robust Robot Control // Proc of the 12th International Symposium on Robotics Research. Berlin, Germany: Springer, 2005: 69-82. [9] SHANI G, BRAFMAN R I, SHIMONY S E. Forward Search Value Iteration for POMDPs // Proc of the 20th International Joint Confe-rence on Artificial Intelligence. San Francisco, USA: Morgan Kaufmann Publishers, 2007: 2619-2624. [10] 冯 奇,周雪忠,黄厚宽,等.SHP-Ⅵ:一种基于最短哈密顿通路的POMDP值迭代算法.计算机研究与发展, 2011, 48(12): 2343-2351. (FENG Q, ZHOU X Z, HUANG H K, et al. SHP-VI: A Shortest Hamiltonian Path-Based POMDP Value Iteration Algorithm. Journal of Computer Research and Development, 2011, 48(12): 2343-2351.) [11] SMITH T, SIMMONS R G. Point-Based POMDP Algorithms: Improved Analysis and Implementation [C/OL]. [2016-04-15]. https://arxiv.org/ftp/arxiv/papers/1207/1207.1412.pdf. [12] 章宗长,陈小平.杂合启发式在线POMDP规划.软件学报, 2013, 24(7): 1589-1600.) (ZHANG Z Z, CHEN X P. Hybrid Heuristic Online Planning for POMDPs. Journal of Software, 2013, 24(7): 1589-1600.) [13] POUPART P, KIM K E, KIM D. Closing the Gap: Improved Bounds on Optimal POMDP Solutions // Proc of the 21st International Conference on Automated Planning and Scheduling. Palo Alto, USA: AAAI Press, 2011: 194-201. [14] 刘 峰,王崇骏,骆 斌.一种基于最优策略概率分布的POMDP值迭代算法.电子学报, 2016, 44(5): 1078-1084. (LIU F, WANG C J, LUO B. A Probability-Based Value Iteration on Optimal Policy Algorithm for POMDP. Acta Electronica Sinica, 2016, 44(5): 1078-1084.) [15] SMALLWORD R D, SONDIK E J. The Optimal Control of Partia-lly Observable Markov Processes over a Finite Horizon. Operations Research, 1973, 21(5): 1071-1088. [16] BELLMAN R. Dynamic Programming. Princeton, USA: Princeton University Press, 1957. [17] KAELBLING L P. Learning in Embedded Systems. Cambridge, USA: MIT Press, 1993. [18] LITTMAN M L, SUTTON R S, SINGH S P. Predictive Representations of State[C/OL]. [2016-03-21].http://web.eecs.umich.edu/~baveja/Papers/psr.pdf. [19] ALBORE A, PALACIOS H, GEFFNER H. A Translation-Based approach to Contingent Planning[C/OL]. [2016-03-21]. http://www.dtic.upf.edu/~hgeffner/alex-ijcai09.pdf. |
|
|
|