Weight Adaptive Generative Adversarial Imitation Learning Based on Noise Contrastive Estimation
GUAN Weifan1,2, ZHANG Xi1
1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190; 2. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049
Abstract:The traditional imitation learning requires expert demonstrations of extremely high quality. This restriction not only increases the difficulty of data collection but also limits application scenarios of algorithms. To address this problem, weight adaptive generative adversarial imitation learning based on noise contrastive estimation(GLANCE) is proposed to maintain high performance in scenarios where the quality of expert demonstration is inconsistent. Firstly, a feature extractor is trained by noise contrastive estimation to improve the feature distribution of suboptimal expert demonstrations. Then, weight coefficients are set for the expert demonstrations, and generative adversarial imitation learning is performed on the expert demonstrations after redistribution based on the weight coefficients. Finally, ranking loss is calculated based on the known relative ranking evaluation data and weight coefficients are optimized through gradient descent to improve the data distribution. Experiments on multiple continuous control tasks show that GLANCE only needs to obtain 5% of the expert demonstrations dataset as evaluation data to achieve superior performance while the quality of the expert demonstration is inconsistent.
[1] ARGALL B D, CHERNOVA S, VELOSO M, et al. A Survey of Robot Learning from Demonstration. Robotics and Autonomous Systems, 2009, 57(5): 469-483. [2] SUTTON R S, BARTO A G.Reinforcement Learning: An Introduction. 2nd Edition. Cambridge, USA: MIT Press, 2018. [3] BAIN M, SAMMUT C.A Framework for Behavioural Cloning. Ma-chine Intelligence, 1995, 15: 103-129. [4] ROSS S, BAGNELL J A.Efficient Reductions for Imitation Learning // Proc of the 13th International Conference on Artificial Intelligence and Statistics. San Diego, USA: JMLR, 2010: 661-668. [5] ROSS S, GORDON G, BAGNELL D.A Reduction of Imitation Lear-ning and Structured Prediction to No-regret Online Learning // Proc of the 14th International Conference on Artificial Intelligence and Statistics. San Diego, USA: JMLR, 2011: 627-635. [6] SHALEV-SHWARTZ S.Online Learning and Online Convex Optimization. Foundations and Trends® in Machine Learning, 2011, 4(2): 107-194. [7] ABBEEL P, NG A Y.Apprenticeship Learning via Inverse Reinforcement Learning // Proc of the 21st International Conference on Machine Learning. New York, USA: ACM, 2004. DOI: 10.1145/1015330.1015430. [8] HO J, ERMON S.Generative Adversarial Imitation Learning[C/OL]. [2022-10-15].https://arxiv.org/pdf/1606.03476.pdf. [9] FU J, LUO K, LEVINE S.Learning Robust Rewards with Adversa-rial Inverse Reinforcement Learning[C/OL]. [2022-10-15].https://arxiv.org/pdf/1710.11248.pdf. [10] ZHANG S Y, CAO Z J, SADIGH D, et al. Confidence-Aware Imitation Learning from Demonstrations with Varying Optimality[C/OL].[2022-10-15]. https://arxiv.org/pdf/2110.14754.pdf. [11] WU Y H, CHAROENPHAKDEE N, BAO H, et al. Imitation Lear-ning from Imperfect Demonstration // Proc of the 36th International Conference on Machine Learning. San Diego, USA: JMLR, 2019: 6818-6827. [12] PENG X B, ABBEEL P, LEVINE S, et al. DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills. ACM Transactions on Graphics, 2018, 37(4). DOI: 10.1145/3197517.3201311. [13] RAFAILOV R, YU T H, RAJESWARAN A, et al. Visual Adver-sarial Imitation Learning Using Variational Models[C/OL].[2022-10-15]. https://arxiv.org/pdf/2107.08829.pdf. [14] FINN C, YU T H, ZHANG T H, et al. One-Shot Visual Imitation Learning via Meta-Learning // Proc of the 1st Annual Conference on Robot Learning. San Diego, USA: JMLR, 2017: 357-368. [15] YU T H, FINN C, XIE A N, et al. One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning[C/OL].[2022-10-15]. https://arxiv.org/pdf/1802.01557.pdf. [16] BROWN D, GOO W, NAGARAJAN P, et al. Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Lear-ning from Observations // Proc of the 36th International Conference on Machine Learning. San Diego, USA: JMLR, 2019: 783-792. [17] BROWN D S, GOO W, NIEKUM S. Better-Than-Demonstrator Imi-tation Learning via Automatically-Ranked Demonstrations[C/OL]. [2022-10-15]. http://proceedings.mlr.press/v100/brown20a/brown20a.pdf. [18] CHEN L T, PALEJA R R, GOMBOLAY M.Learning from Suboptimal Demonstration via Self-Supervised Reward Regression // Proc of the 4th Conference on Robot Learning. San Diego, USA: JMLR, 2020: 1262-1277. [19] XU H R, ZHAN X Y, YIN H L, et al. Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations // Proc of the 39th International Conference on Machine Learning. San Diego, USA: JMLR, 2022: 24725-24742. [20] BELIAEV M, SHIH A, ERMON S, et al. Imitation Learning by Estimating Expertise of Demonstrators // Proc of the 39th International Conference on Machine Learning. San Diego, USA: JMLR, 2022: 1732-1748. [21] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Gene-rative Adversarial Networks. Communications of the ACM, 2020, 63(11): 139-144. [22] KRASKOV A, STÖGBAUER H, GRASSBERGER P. Estimating Mutual Information. Physical Review E, 2004, 69. DOI: 10.1103/PhysRevE.69.066138. [23] HERBRICH R, GRAEPEL T, OBERMAYER K. Large Margin Rank Boundaries for Ordinal Regression // SMOLA A J, BARTLETT P L, SCHÖLKOPF B, et al., eds. Advances in Large Margin Classifiers. Cambridge, USA: MIT Press, 2000: 115-132. [24] BROCKMAN G, CHEUNG V, PETTERSSON L, et al. OpenAI Gym[C/OL].[2022-10-15]. https://arxiv.org/pdf/1606.01540.pdf. [25] TODOROV E, EREZ T, TASSA Y.MuJoCo: A Physics Engine for Model-Based Control // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington, USA: IEEE, 2012: 5026-5033. [26] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Algorithms[C/OL].[2022-10-15]. https://arxiv.org/pdf/1707.06347.pdf. [27] SCHULMAN J, LEVINE S, MORITZ P, et al. Trust Region Policy Optimization // Proc of the 32nd International Conference on Machine Learning. San Diego, USA: JMLR, 2015: 1889-1897. [28] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor // Proc of the 35th International Conference on Machine Learning. San Diego, USA: JMLR, 2018: 1861-1870. [29] AGARAP A F M. Deep Learning Using Rectified Linear Units (ReLU)[C/OL].[2022-10-15]. https://arxiv.org/pdf/1803.08375.pdf. [30] PASZKE A, GROSS S, MASSA F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library[C/OL].[2022-10-15]. https://papers.nips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf. [31] LUCE R D.Individual Choice Behavior: A Theoretical Analysis. New York, USA: Dover Publication, 2005.