Abstract:Reinforcement learning enables robots learn the optimal action policy through the interaction with the environment, representing an important frontier direction in the field of robotics. In this paper, the formal modeling of robot task planning problem is briefly introduced, and the main methods of reinforcement learning are analyzed, including model-free reinforcement learning, model-based reinforcement learning and hierarchical reinforcement learning. The research progress in robot task planning based on reinforcement learning is explored. Various reinforcement learning methods and their applications are discussed as well. Finally, the key problems of reinforcement learning in robot applications are summarized, and the future research directions are prospected.
[1] KARPAS E, MAGAZZENI D. Automated Planning for Robotics. Annual Review of Control, Robotics, and Autonomous Systems, 2020, 3(1): 417-439. [2] OSA T, PAJARINEN J, NEUMANN G,et al. An Algorithmic Perspective on Imitation Learning. Foundations and Trends® in Robo-tics, 2018, 7(1/2): 1-179. [3] SHAKYA A K, PILLAI G, CHAKRABARTY S. Reinforcement Lear-ning Algorithms: A Brief Survey. Expert Systems with Applications, 2023, 231(11). DOI: 10.1016/J.Eswa.2023.120495. [4] GARAFFA L C, BASSO M, KONZEN A A, et al. Reinforcement Learning for Mobile Robotics Exploration: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(8): 3796-3810. [5] FONG J, CAMPOLO D, ACAR C, et al. Model-Based Reinforcement Learning with LSTM Networks for Non-prehensile Manipulation Planning // Proc of the 21st International Conference on Control, Automation and Systems. Washington, USA: IEEE, 2021: 1152-1159. [6] LUO F M, XU T, LAI H, et al. A Survey on Model-Based Reinforcement Learning[C/OL].[2023-2-03]. https://arxiv.org/pdf/2206.09328.pdf. [7] HERZOG A, RAO K, HAUSMAN K, et al. Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators[C/OL].[2023-2-03]. https://arxiv.org/pdf/2305.03270.pdf. [8] FANG B, XIA Z W, SUN F C, et al. Soft Magnetic Fingertip with Particle-Jamming Structure for Tactile Perception and Grasping. IEEE Transactions on Industrial Electronics, 2023, 70(6): 6027-6035. [9] LI C H, XIA F, MARTIN-MARTIN R, et al. HRL4IN: Hierarchical Reinforcement Learning for Interactive Navigation with Mobile Manipulators[C/OL]. [2023-2-03]. https://arxiv.org/pdf/1910.11432.pdf [10] EPPE M, GUMBSCH C, KERZEL M, et al. Intelligent Problem-Solving as Integrated Hierarchical Reinforcement Learning. Nature Machine Intelligence, 2022, 4(1): 11-20. [11] ANTONYSHYN L, SILVEIRA J, GIVIGI S, et al. Multiple Mobile Robot Task and Motion Planning: A Survey. ACM Computing Surveys, 2023, 55(10). DOI: 10.1145/3564696. [12] KROEMER O, NIEKUM S, KONIDARIS G. A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms. The Journal of Machine Learning Research, 2021, 22(1): 1395-1476. [13] SUTTON R S, BARTO A G. Reinforcement Learning: An Introduction. Cambridge, USA: The MIT Press, 1998. [14] PATERIA S, SUBAGDJA B, TAN A H, et al. Hierarchical Reinforcement Learning: A Comprehensive Survey. ACM Computing Surveys, 2021, 54(5). DOI: 10.1145/3453160. [15] ÇALIŞIR S, PEHLIVANOĜLU M K. Model-Free Reinforcement Lear-ning Algorithms: A Survey // Proc of the 27th Signal Processing and Communications Applications Conference. Washington, USA: IEEE, 2019. DOI: 10.1109/SIU.2019.8806389. [16] KRÖSE B J A. Learning from Delayed Rewards. Robotics and Autonomous Systems, 1995, 15(4): 233-235. [17] DEARDEN R, FRIEDMAN N, RUSSELL S. Bayesian Q-Learning // Proc of the 15th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 1998: 761-768. [18] MNIH V, KAVUKCUOGLU K, SILVER D, ,et al. Playing Atari with Deep Reinforcement Learning[C/OL]. [2023-2-03]. https://arxiv.org/pdf/1312.5602.pdf. [19] BOUZY B, CHASLOT G. Monte-Carlo Go Reinforcement Learning Experiments // Proc of the IEEE Symposium on Computational Intelligence and Games. Washington, USA: IEEE, 2006: 187-194. [20] RUMMERY G A, NIRANJAN M. On-Line Q-Learning Using Connectionist Systems. Technical Report, 166. Cambridge, UK: University of Cambridge, 1994. [21] SEWAK M. Deep Reinforcement Learning. Berlin, Germany: Sprin-ger, 2019. [22] KAKADE S. A Natural Policy Gradient // Proc of the 14th International Conference on Neural Information Processing Systems(Na-tural and Synthetic). Cambridge, USA: MIT Press, 2001: 1531-1538. [23] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[C/OL].[2023-2-03]. https://arxiv.org/pdf/1707.06347.pdf. [24] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust Region Policy Optimization // Proc of the 32nd International Conference on Machine Learning. San Diego, USA: JMLR, 2015: 1889-1897. [25] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous Methods for Deep Reinforcement Learning // Proc of the 33rd International Conference on Machine Learning. San Diego, USA: JMLR, 2016: 1928-1937. [26] SILVER D, LEVER G, HEESS N, et al. Deterministic Policy Gradient Algorithms // Proc of the 31st International Conference on Machine Learning. San Diego, USA: JMLR, 2014, I: 387-395. [27] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous Control with Deep Reinforcement Learning[C/OL].[2023-2-03]. https://arxiv.org/pdf/1509.02971.pdf. [28] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing Function Approximation Error in Actor-Critic Methods // Proc of the 35th International Conference on Machine Learning. San Diego, USA: JMLR, 2018: 1587-1596. [29] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor // Proc of the 35th International Conference on Machine Learning. San Diego, USA: JMLR, 2018: 1861-1870. [30] BLAIS M A, AKHLOUFI M A. Reinforcement Learning for Swarm Robotics: An Overview of Applications, Algorithms and Simulators. Cognitive Robotics, 2023, 3(1): 226-256. [31] PLAAT A, KOSTERS W, PREUSS M. High-Accuracy Model-Based Reinforcement Learning, A Survey. Artificial Intelligence Review, 2023, 56(9): 9541-9573. [32] SUTTON R S. Dyna, An Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 1991, 2(4): 160-163. [33] PEI M, AN H, LIU B, et al. An Improved Dyna-Q Algorithm for Mobile Robot Path Planning in Unknown Dynamic Environment. IEEE Transactions on Systems, Man, and Cybernetics(Systems), 2022, 52(7): 4415-4425. [34] MAYNE D Q. Model Predictive Control: Recent Developments and Future Promise. Automatica, 2014, 50(12): 2967-2986. [35] ZHANG T H, KAHN G, LEVINE S, et al. Learning Deep Control Policies for Autonomous Aerial Vehicles with MPC-Guided Policy Search // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2016: 528-535. [36] FINN C, LEVINE S. Deep Visual Foresight for Planning Robot Mo-tion // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2017: 2786-2793. [37] JANNER M, FU J, ZHANG M, et al. When to Trust Your Model: Model-Based Policy Optimization // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019, 12519-12530. [38] SHEN J, LAI H, LIU M H, et al. Adaptation Augmented Model-Based Policy Optimization. Journal of Machine Learning Research, 2023, 24(218): 1-35. [39] DEISENROTH M P, RASMUSSEN C E. PILCO: A Model-Based and Data-Efficient Approach to Policy Search // Proc of the 28th International Conference on Machine Learning. Madison, USA: Omnipress, 2011: 465-472. [40] HEESS N, WAYNE G, SILVER D, et al. Learning Continuous Control Policies by Stochastic Value Gradients // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015, II: 2944-2952. [41] 孙世光,兰旭光,张翰博,等.基于模型的机器人强化学习研究综述.模式识别与人工智能, 2022, 35(1): 1-16. (SUN S G, LAN X G, ZHANG H B, et al. Model-Based Reinforcement Learning in Robotics: A Survey. Pattern Recognition and Artificial Intelligence, 2022, 35(1): 1-16.) [42] CLAVERA I, FU Y, ABBEEL P. Model-Augmented Actor-Critic: Backpropagating through Paths[C/OL]. [2023-2-03]. https://arxiv.org/pdf/2005.08068.pdf. [43] SUTTON R S, PRECUP D, SINGH S. Between MDPS and Semi-MDPS: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 1999, 112(1/2): 181-211. [44] SCHAUL T, HORGAN D, GREGOR K, et al. Universal Value Function Approximators // Proc of the 32nd International Confe-rence On Machine Learning. San Diego, USA: JMLR, 2015: 1312-1320. [45] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight Experience Replay // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 5055-5065. [46] LUO Y L, WANG Y X, DONG K, et al. Relay Hindsight Experience Replay: Self-Guided Continual Reinforcement Learning for Sequential Object Manipulation Tasks with Sparse Rewards. Neurocomputing, 2023, 557. DOI: 10.1016/J.Neucom.2023.126620. [47] LEVY A, KONIDARIS G, PLATT R, et al. Learning Multi-level Hierarchies with Hindsight[C/OL].[2023-2-03]. https://arxiv.org/pdf/1712.00948.pdf. [48] KULKARNI T D, NARASIMHAN K, SAEEDI A, et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2016: 3682-3690. [49] VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al. Feudal Networks for Hierarchical Reinforcement Learning // Proc of the 34th International Conference on Machine Learning. San Diego, USA: JMLR, 2017: 3540-3549. [50] NACHUM O, GU S X, LEE H, et al. Data-Efficient Hierarchical Reinforcement Learning // Proc of the 32nd International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 3307-3317. [51] PARR R, RUSSELL S. Reinforcement Learning with Hierarchies of Machines // Proc of the 10th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 1997: 1043-1049. [52] DIETTERICH T G. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research, 2000, 13(1): 227-303. [53] LI S, WANG R, TANG M, et al. Hierarchical Reinforcement Lear-ning with Advantage-Based Auxiliary Rewards // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 1409-1419. [54] BACON P L, HARB J, PRECUP D. The Option-Critic Architecture // Proc of the 31st AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2017: 1726-1734. [55] HARB J, BACON P L, KLISSAROV M, et al. When Waiting Is Not an Option: Learning Options with a Deliberation Cost // Proc of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2018: 3165-3172. [56] TESSLER C, GIVONY S, ZAHAVY T, et al. A Deep Hierarchical Approach to Lifelong Learning in Minecraft // Proc of the 31st AAAI Conference on Artificial Intelligence,. Palo Alto USA: AAAI Press, 2017: 1553-1561. [57] KONIDARIS G, BARTO A. Skill Discovery in Continuous Reinforcement Learning Domains Using Skill Chaining // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2009: 1015-1023. [58] BAGARIA A, KONIDARIS G.Option Discovery Using Deep Skill Chaining[C/OL]. [2023-2-03].https://cs.brown.edu/people/gdk/pubs/dsc_deeprl_ws.pdf. [59] EYSENBACH B, GUPTA A, IBARZ J, et al. Diversity Is All You Need: Learning Skills without a Reward Function[C/OL].[2023-2-03]. https://arxiv.org/pdf/1802.06070.pdf. [60] KOBER J, BAGNELL J A, PETERS J. Reinforcement Learning in Robotics: A Survey. International Journal of Robotics Research, 2013, 32(11): 1238-1274. [61] FIORITO G, BIEDERMAN G B, DAVEY V A, et al. The Role of Stimulus Preexposure in Problem Solving by Octopus Vulgaris. Ani-mal Cognition, 1998, 1(2): 107-112. [62] WANG C, WU L Z, YAN C, et al. Coactive Design of Explai-nable Agent-Based Task Planning and Deep Reinforcement Lear-ning for Human-UAVs Teamwork. Chinese Journal of Aeronautics, 2020, 33(11): 2930-2945. [63] IJSPEERT A J, NAKANISHI J, SCHAAL S. Movement Imitation with Nonlinear Dynamical Systems in Humanoid Robots // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2002: 1398-1403. [64] WANG Y, DE SILVA C W. Multi-robot Box-Pushing: Single-Agent Q-Learning vs. Team Q-Learning // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2006: 3694-3699. [65] PETERS J, SCHAAL S. Reinforcement Learning of Motor Skills with Policy Gradients. Neural Networks, 2008, 21(4): 682-697. [66] KONIDARIS G, KUINDERSMA S, GRUPEN R, et al. Robot Lear-ning from Demonstration by Constructing Skill Trees. The International Journal of Robotics Research, 2012, 31(3): 360-375. [67] KUINDERSMA S R, HANNIGAN E, RUIKEN D, et al. Dexterous Mobility with the uBot-5 Mobile Manipulator // Proc of the International Conference on Advanced Robotics. Washington, USA: IEEE, 2009: 1-7. [68] GU S X, HOLLY E, LILLICRAP T, et al. Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2017: 3389-3396. [69] HAARNOJA T, PONG V, ZHOU A, et al. Composable Deep Reinforcement Learning for Robotic Manipulation // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE,2018: 6244-6251. [70] LI F M, JIANG Q, QUAN W, et al. Manipulation Skill Acquisition for Robotic Assembly Using Deep Reinforcement Learning // Proc of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics. Washington, USA: IEEE, 2019: 13-18. [71] LIU L Z, MAO Q W. Research and Application of Mechanical Arm Grasping Method Based on Deep Reinforcement Learning // Proc of the International Conference on Artificial Intelligence in Everything. Washington, USA: IEEE, 2022: 123-128. [72] IBARZ J, TAN J, FINN C, et al. How to Train Your Robot with Deep Reinforcement Learning: Lessons We Have Learned. The International Journal of Robotics Research, 2021, 40(4/5): 698-721. [73] LEE M A, ZHU Y K, SRINIVASAN K, et al. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks // Proc of the International Conference on Robotics and Automation. Washington, USA: IEEE, 2019: 8943-8950. [74] GHADIRZADEH A, MAKI A, KRAGIC D, et al. Deep Predictive Policy Training Using Reinforcement Learning // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2017: 2351-2358. [75] ZHU H, GUPTA A, RAJESWARAN A, et al. Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost // Proc of the International Conference on Robotics and Automation. Washington, USA: IEEE, 2019: 3651-3657. [76] SINGH A, YANG L, HARTIKAINEN K, et al. End-to-End Robotic Reinforcement Learning without Reward Engineering. Technical Report, UCB/EECS-2019-40. Berkeley, USA: University of California, 2019. [77] SCHOETTLER G, NAIR A, LUO J L, et al. Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2020: 5548-5555. [78] SCHWAB D, SPRINGENBERG T, MARTINS M F, et al. Simultaneously Learning Vision and Feature-Based Control Policies for Real-World Ball-in-a-Cup[C/OL].[2023-2-03]. https://arxiv.org/pdf/1902.04706.pdf. [79] POLYDOROS A S, NALPANTIDIS L. Survey of Model-Based Reinforcement Learning: Applications on Robotics. Journal of Intelligent and Robotic Systems, 2017, 86(2): 153-173. [80] DEISENROTH M P, RASMUSSEN C E, FOX D. Learning to Con-trol a Low-Cost Manipulator Using Data-Efficient Reinforcement Learning[C/OL]. [2023-2-03]. https://rse-lab.cs.washington.edu/postscripts/robot-rl-rss-11.pdf [81] LOWREY K, RAJESWARAN A, KAKADE S, et al. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control[C/OL].[2023-2-03]. https://arxiv.org/pdf/1811.01848.pdf. [82] JIN Y B, LIU X W, SHAO Y C, et al. High-Speed Quadrupedal Locomotion by Imitation-Relaxation Reinforcement Learning. Nature Machine Intelligence, 2022, 4(12): 1198-1208. [83] FINN C, GOODFELLOW I, LEVINE S. Unsupervised Learning for Physical Interaction through Video Prediction // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2016: 64-72. [84] ZHANG M, VIKRAM S, SMITH L, et al. SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning // Proc of the 36th International Conference on Machine Learning. San Diego, USA: JMLR, 2019: 7444-7453. [85] NAIR A, PONG V, DALAL M, et al. Visual Reinforcement Lear-ning with Imagined Goals // Proc of the 32nd International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 9209-9220. [86] LEVINE S, FINN C, DARRELL T, et al. End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research, 2016, 17(1): 1334-1373. [87] YANG X T, JI Z, WU J, et al. Hierarchical Reinforcement Lear-ning with Universal Policies for Multistep Robotic Manipulation. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(9): 4727-4741. [88] CHEN K H, LIANG Z W, LIANG W Z, et al. Cross-Overlapping Hierarchical Reinforcement Learning in Humanoid Robots // Proc of the 33rd Chinese Control and Decision Conference. Washington, USA: IEEE, 2021: 3340-3345. [89] EYSENBACH B, SALAKHUTDINOV R, LEVINE S. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 15246-15257. [90] GIESELMANN R, POKORNY F T. Planning-Augmented Hierarchical Reinforcement Learning. IEEE Robotics and Automation Letters, 2021, 6(3): 5097-5104. [91] PANERATI J, ZHENG H H, ZHOU S Q, et al. Learning to Fly-A Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter Control // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2021: 7512-7519. [92] ZHAO X T, DU J L, WANG Z H. HCS-R-HER: Hierarchical Reinforcement Learning Based on Cross Subtasks Rainbow Hindsight Experience Replay. Journal of Computational Science, 2023, 72. DOI: 10.1016/J.Jocs.2023.102113. [93] SANCHEZ F R, WANG Q, BULENS D C, et al. Hierarchical Reinforcement Learning for In-Hand Robotic Manipulation Using Da-venport Chained Rotations // Proc of the 9th International Confe-rence on Automation, Robotics And Applications. Washington, USA: IEEE, 2023: 160-164. [94] EREZ T, TASSA Y, TODOROV E. Simulation Tools for Model-Based Robotics: Comparison of Bullet, Havok, MuJoCo, ODE and PhysX // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2015: 4397-4404. [95] LIU Y C, PALMIERI L, GEORGIEVSKI I, et al. Human-Flow-Aware Long-Term Mobile Robot Task Planning Based on Hierarchical Reinforcement Learning. IEEE Robotics and Automation Le-tters, 2023, 8(7): 4068-4075. [96] YUAN Y L, YU Z L, HUA L, et al. Hierarchical Dynamic Movement Primitive for the Smooth Movement of Robots Based on Deep Reinforcement Learning. Applied Intelligence, 2023, 53(2): 1417-1434. [97] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-Driven Exploration by Self-Supervised Prediction // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 488-489. [98] ZECH P, RENAUDO E, HALLER S, et al. Action Representations in Robotics: A Taxonomy and Systematic Classification. The International Journal of Robotics Research, 2019, 38(5): 518-562. [99] CUI J D, TRINKLE J. Toward Next-Generation Learned Robot Manipulation. Science Robotics, 2021, 6(54). DOI: 10.1126/Scirobotics.Abd94. [100] JOHANSSON R S, FLANAGAN J R. Sensory Control of Object Manipulation // NOWAK D A, HERMSDÖRFER J, eds. Sensorimotor Control of Grasping. Cambridge, UK: Cambridge University Press, 2009: 141-160. [101] TIDDI I, SCHLOBACH S. Knowledge Graphs as Tools for Explainable Machine Learning: A Survey. Artificial Intelligence, 2022, 302. DOI: 10.1016/J.Artint.2021.103627.