Abstract:In two-player zero-sum Markov games, the traditional policy gradient theorem is only applied to alternate training of two players due to the influence of one player's policy on the other player's policy. To train two players at the same time, the policy gradient theorem in two-player zero-sum Markov games is proposed. Then, based on the policy gradient theorem, an extra-gradient based REINFORCE algorithm is proposed to achieve approximate Nash convergence of the joint policy of two players. The superiority of the proposed algorithm is analyzed in multiple dimensions. Firstly, the comparative experiments on simultaneous-move game show that the convergence and convergence speed of the proposed algorithm are better. Secondly, the characteristics of the joint policy obtained by the proposed algorithm are analyzed and these joint policies are verified to achieve approximate Nash equilibrium. Finally, the comparative experiments on simultaneous-move game with different difficulty levels show that the proposed algorithm holds a good convergence speed at higher difficulty levels.
[1] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the Game of Go without Human Knowledge. Nature, 2017, 550(7676): 354-359. [2] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-Play. Science, 2018, 362(6419): 1140-1144. [3] ZHAO E M, YAN R Y, LI J Q, et al. AlphaHoldem: High-Performance Artificial Intelligence for Heads-Up No-Limit Poker via End-to-End Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(4): 4689-4697. [4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster Level in StarCraft II Using Multi-agent Reinforcement Learning. Nature, 2019, 575(7782): 350-354. [5] YE D H, LIU Z, SUN M F, et al. Mastering Complex Control in MOBA Games with Deep Reinforcement Learning // Proc of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2020: 6672-6679. [6] YE D H, CHEN G B, ZHANG W, et al.Towards Playing Full MOBA Games with Deep Reinforcement Learning // Proc of the 34th International on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 621-632. [7] HEINRICH J, LANCTOT M, SILVER D.Fictitious Self-Play in Extensive-Form Games // Proc of the 32nd International Conference on Machine Learning. San Diego, USA: JMLR, 2015: 805-813. [8] HEINRICH J, SILVER D.Deep Reinforcement Learning from Self-Play in Imperfect-Information Games[C/OL]. [2022-07-02].https://arxiv.org/pdf/1603.01121.pdf. [9] XUE W Q, ZHANG Y Z, LI S X, et al. Solving Large-Scale Extensive-Form Network Security Games via Neural Fictitious Self-Play // Proc of the 30th International Joint Conference on Artificial Intelligence Main Track. San Francisco, USA: IJCAI, 2021: 3713-3720. [10] KAWAMURA K, MIZUKAMI N, TSURUOKA Y.Neural Fictitious Self-Play in Imperfect Information Games with Many Players // Proc of the Workshop on Computer Games. Berlin, Germany: Springer, 2017: 61-74. [11] LANCTOT M, WAUGH K, ZINKEVICH M, et al.Monte Carlo Sampling for Regret Minimization in Extensive Games // Proc of the 22nd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2009: 1078-1086. [12] BROWN N, LERER A, GROSS S, et al. Deep Counterfactual Regret Minimization // Proc of the 36th International Conference on Machine Learning. San Diego, USA: JMLR, 2019: 793-802. [13] JOHANSON M, BARD N, BURCH N, et al. Finding Optimal Abstract Strategies in Extensive-Form Games // Proc of the 26th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2021: 1371-1379. [14] NELLER T W, HNATH S.Approximating Optimal Dudo Play with Fixed-Strategy Iteration Counterfactual Regret Minimization // Proc of the Advances in Computer Games. Berlin, Germany: Springer, 2011: 170-183. [15] LITTMAN M L.Markov Games as a Framework for Multi-agent Reinforcement Learning // Proc of the 11th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 1994: 157-163. [16] GRAU-MOYA J, LEIBFRIED F, BOU-AMMAR H.Balancing Two-Player Stochastic Games with Soft Q-Learning // Proc of the 27th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2018: 268-274. [17] GUAN Y, ZHANG Q F, TSIOTRAS P.Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation[C/OL]. [2022-07-02].https://arxiv.org/pdf/2009.00162.pdf. [18] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[C/OL].[2022-07-02]. https://arxiv.org/pdf/1707.06347.pdf. [19] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor // Proc of the 35th International Conference on Machine Learning. San Diego, USA: JMLR, 2018: 1861-1870. [20] DASKALAKIS C, FOSTER D J, GOLOWICH N.Independent Policy Gradient Methods for Competitive Reinforcement Learning // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 5527-5540. [21] WILLIAMS R J.Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 1992, 8(3/4): 229-256. [22] SHAPLEY L S. Stochastic Games.Proceedings of the National Aca-demyof Sciences of the United States of America, 1953, 39(10): 1095-1100. [23] KORPELEVICH G M. The Extragradient Method for Finding Saddle Points and Other Problems[J/OL]. [2022-07-02]. http://cs.uwaterioo.ca/~y328yu/classics/extragrad.pdf. [24] DIAKONIKOLAS J, DASKALAKIS C, JORDAN M I.Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimiza-tion // Proc of the 24th International Conference on Artificial Intelligence and Statistics. Boston, USA: Microtome Publishing, 2021: 2746-2754. [25] LANCTOT M, LOCKHART E, LESPIAU J B, et al. OpenSpiel: A Framework for Reinforcement Learning in Games[C/OL].[2022-07-02]. https://arxiv.org/pdf/1908.09453v6.pdf. [26] HARSANYI J C.Games with Randomly Disturbed Payoffs: A New Rationale for Mixed-Strategy Equilibrium Points. International Journal of Game Theory, 1973, 2: 1-23.