Multi-agent Reinforcement Learning Algorithm Based on State Space Exploration in Sparse Reward Scenarios
FANG Baofu1,2, YU Tingting1,2, WANG Hao1,2, WANG Zaijun3
1. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601; 2. Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, Hefei University of Technology,Hefei 230601; 3. Key Laboratory of Flight Techniques and Flight Safety, Civil Aviation Flight University of China, Guanghan 618307
Abstract:In multi-agent task scenarios, a large and diverse state space is often encountered. In some cases, the reward information provided by the external environment may be extremely limited, exhibiting sparse reward characteristics. Most existing multi-agent reinforcement learning algorithms present limited effectiveness in such sparse reward scenarios, as relying only on accidentally discovered reward sequences leads to a slow and inefficient learning process. To address this issue, a multi-agent reinforcement learning algorithm based on state space exploration(MASSE) in sparse reward scenarios is proposed. MASSE constructs a subset space of states, maps one state from this subset, and takes it as an intrinsic goal, enabling agents to more fully utilize the state space and reduce unnecessary exploration. The agent states are decomposed into self-states and environmental states, and the intrinsic rewards based on mutual information are generated by combining these two types of states with intrinsic goals. By constructing a state subset space and generating intrinsic rewards based on mutual information, the states close to the target states and the states understanding the environment are rewarded appropriately. Consequently, agents are motivated to move more actively towards the goal while enhancing their understanding of the environment, guiding them to flexibly adapt to sparse reward scenarios. The experimental results indicate the performance of MASSE is superior in multi-agent collaborative scenarios with varying degrees of sparsity.
方宝富, 余婷婷, 王浩, 王在俊. 稀疏奖励场景下基于状态空间探索的多智能体强化学习算法[J]. 模式识别与人工智能, 2024, 37(5): 435-446.
FANG Baofu, YU Tingting, WANG Hao, WANG Zaijun. Multi-agent Reinforcement Learning Algorithm Based on State Space Exploration in Sparse Reward Scenarios. Pattern Recognition and Artificial Intelligence, 2024, 37(5): 435-446.
[1] WU J D, HUANG Z Y, HU Z X, et al. Toward Human-in-the-Loop AI: Enhancing Deep Reinforcement Learning via Real-Time Human Guidance for Autonomous Driving. Engineering, 2023, 21: 75-91. [2] ZHU R J, LI L L, WU S N, et al. Multi-agent Broad Reinforcement Learning for Intelligent Traffic Light Control. Information Sciences, 2023, 619: 509-525. [3] 张晓明,高士杰,姚昌瑀,等. 强化学习及其在机器人任务规划中的进展与分析.模式识别与人工智能, 2023, 36(10): 902-917. (ZHANG X M, GAO S J, YAO C Y, et al. Reinforcement Lear-ning and Its Application in Robot Task Planning: A Survey. Pattern Recognition and Artificial Intelligence, 2023, 36(10): 902-917.) [4] LIAN B S, DONGE V S, LEWIS F L, et al. Data-Driven Inverse Reinforcement Learning Control for Linear Multiplayer Games. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(2): 2028-2041. [5] ZHANG K Q, YANG Z R, BASAR T. Multi-agent Reinforcement Learning: A Selective Overview of Theories and Algorithms // VAM- VOUDAKIS K G, WAN Y, LEWIS F L, et al., eds. Handbook of Reinforcement Learning and Control. Berlin, Germany: Springer, 2021: 321-384. [6] 万里鹏,兰旭光,张翰博,等. 深度强化学习理论及其应用综述.模式识别与人工智能, 2019, 32(1): 67-81. (WAN L P, LAN X G, ZHANG H B, et al. A Review of Deep Reinforcement Learning Theory and Application. Pattern Recognition and Artificial Intelligence, 2019, 32(1): 67-81.) [7] DEVIDZE R, KAMALARUBAN P, SINGLA A. Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 5829-5842. [8] NG A Y, RUSSELL S. Algorithms for Inverse Reinforcement Lear-ning[C/OL]. [2024-03-25]. http://ai.stanford.edu/~ang/papers/icml00-irl.pdf. [9] CHITNIS R, TULSIANI S, GUPTA S, et al. Intrinsic Motivation for Encouraging Synergistic Behavior[C/OL].[2024-03-25]. https://arxiv.org/abs/2002.05189v1. [10] TANG H R, HOUTHOOFT R, FOOTE D, et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 2750-2759. [11] RASHID T, SAMVELYAN M, DE WITT C S, et al. Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning. Journal of Machine Learning Research, 2020, 21(1): 7234-7284. [12] RASHID T, FARQUHAR G, PENG B, et al. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-agent Reinforcement Learning // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 10199-10210. [13] MA Z X, WANG R E, LI F F, et al. ELIGN: Expectation Alignment as a Multi-agent Intrinsic Reward // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 8304-8317. [14] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: Multi-agent Variational Exploration // Proc of the 33rd Internatio-nal Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 7613-7624. [15] LAGE C A, WOLMARANS D W, MOGRABI D C. An Evolutio-nary View of Self-Awareness. Behavioural Processes, 2022, 194. DOI: 10.1016/j.beproc.2021.104543. [16] ALTMAN E. Constrained Markov Decision Processes, RR-2574. Paris, France: INRIA, 1995. [17] AMATO C, CHOWDHARY G, GERAMIFARD A, et al. Decentralized Control of Partially Observable Markov Decision Processes // Proc of the 52nd IEEE Conference on Decision and Control. Washington, USA: IEEE, 2013 : 2398-2405. [18] AZZAM R, BOIKO I, ZWEIRI Y. Swarm Cooperative Navigation Using Centralized Training and Decentralized Execution. Drones, 2023, 7(3). DOI: 10.3390/drones7030193. [19] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-Decomposition Networks for Cooperative Multi-agent Learning Based on Team Reward // Proc of the 17th International Conference on Autonomous Agents and Multi-agent Systems. New York, USA: ACM, 2018: 2085-2087. [20] SON K, KIM D, KANG W J, et al. QTRAN: Learning to Facto-rize with Transformation for Cooperative Multi-agent Reinforcement Learning[C/OL].[2024-03-25]. https://proceedings.mlr.press/v97/son19a/son19a.pdf. [21] KRASKOV A, STÖGBAUER H, GRASSBERGER P. Estimating Mutual Information. Physical Review E, 2004, 69(6). DOI: 10.1103/PhysRevE.69.066138. [22] BELGHAZI M I, BARATIN A, RAJESHWAR S, et al. Mutual Information Neural Estimation. Journal of Machine Learning Research, 2018, 80: 531-540. [23] CHENG P Y, HAO W T, DAI S Y, et al. CLUB: A Contrastive Log-Ratio Upper Bound of Mutual Information. Journal of Machine Learning Research, 2020, 119: 1779-1788. [24] LI P Y, TANG H, YANG T P, et al. PMIC: Improving Multi-agent Reinforcement Learning with Progressive Mutual Information Collaboration. Journal of Machine Learning Research, 2022, 162: 12979-12997. [25] WANG T, WANG J, WU Y, et al. Influence-Based Multi-agent Exploration[C/OL].[2024-03-25]. https://arxiv.org/pdf/1910.05512. [26] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-Driven Exploration by Self-Supervised Prediction // Proc of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2017: 488-489. [27] BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by Random Network Distillation[C/OL].[2024-03-25]. https://arxiv.org/pdf/1810.12894. [28] WANG L, ZHANG Y P, HU Y J, et al. Individual Reward Assisted Multi-agent Reinforcement Learning. Journal of Machine Lear-ning Research, 2022, 162: 23417-23432. [29] JEON J, KIM W, JUNG W, et al. MASER: Multi-agent Reinforcement Learning with Subgoals Generated from Experience Replay Buffer. Journal of Machine Learning Research, 2022, 162: 10041-10052. [30] NA H, SEO Y, MOON I C. Efficient Episodic Memory Utilization of Cooperative Multi-agent Reinforcement Learning[C/OL]. [2024-03-25]. http://export.arxiv.org/pdf/2403.01112. [31] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster Level in StarCraft II Using Multi-agent Reinforcement Lear-ning. Nature, 2019, 575(7782): 350-354. [32] HINTON G, SRIVASTAVA N, SWERSKY K. Neural Networks for Machine Learning: Lecture 6a Overview of Mini-batch Gradient Descent[C/OL]. [2024-03-25].https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdfhttps://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.