Abstract:Adaptive moment estimation algorithms with momentum and adaptive step techniques are widely applied in deep learning. However, these algorithms cannot achieve the optimal performance in both theory and experiment. To solve the problem, an AdaBelief based heavy-ball momentum method, AdaBHB, is proposed. The AdaBelief technique of adjusting step size flexibly is introduced to improve the algorithm performance in experiments. The heavy ball momentum method with step size adjusted by exponential moving average strategy is employed to accelerate convergence. According to the convergence analysis techniques of AdaBelief and Heavy-ball momentum methods, time-varying step size and momentum coefficient are selected skillfully and the momentum term and adaptive matrix are added. It is proved that AdaBHB gains the optimal individual convergence rate for non-smooth general convex optimization problems. Finally, the correctness of the theoretical analysis of the proposed algorithm is verified by experiments on convex optimization problems and deep neural networks, and AdaBHB is validated to obtain the optimal convergence in theory with performance improved.
[1] ROBBINS H, MONRO S.A Stochastic Approximation Method. The Annals of Mathematical Statistics, 1951, 22(3): 400-407. [2] POLYAK B T.Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 1964, 4(5): 1-17. [3] NESTEROV Y.A Method of Solving a Convex Programming Problem with Convergence Rate. Soviet Mathematics Doklady, 1983, 27(2): 372-376. [4] DUCHI J, HAZAN E, SINGER Y.Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 2011, 12: 2121-2159. [5] MUKKAMATA M C, HEIN M.Variants of RMSProp and ADAGrad with Logarithmic Regret Bounds//Proc of the 34th International Conference on Machine Learning. New York, USA: ACM, 2017: 2545-2553. [6] KINGMA D P, BA J. Adam: A Method for Stochastic Optimization[C/OL]. [2021-04-20]. https://arxiv.org/pdf/1412.6980.pdf. [7] KRIZHEVSKY A, SUTSKEVER I, HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 2017, 60(6): 84-90. [8] REDDI S J, KALE S, KUMAR S.On the Convergence of Adam and Beyond[C/OL]. [2021-04-20].https://openreview.net/pdf?id=ryQu7f-RZ. [9] ZINKEVICH M.Online Convex Programming and Generalized Infi-nitesimal Gradient Ascent//Proc of the 20th International Confe-rence on Machine Learning. New York, USA: ACM, 2003: 928-936. [10] AZAMI H, SANEI S, MOHAMMADI K.Improving the Neural Network Training for Face Recognition Using Adaptive Learning Rate, Resilient Back Propagation and Conjugate Gradient Algorithm. International Journal of Computer Applications, 2011, 34(2): 22-26. [11] ZHUANG J T, TANG T, DING Y F, et al. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients[C/OL].[2021-04-20]. https://arxiv.org/pdf/2010.07468v2.pdf. [12] ZOU F Y, SHEN L, JIE Z Q, et al. A Sufficient Condition for Convergences of Adam and RMSProp//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 11119-11127. [13] TAO W, LONG S, WU G W, et al. The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-Ball Methods[C/OL].[2021-04-20]. https://openreview.net/pdf?id=L7WD8ZdscQ5. [14] 陶蔚,潘志松,储德军,等.使用Nesterov 步长策略投影次梯度方法的个体收敛性.计算机学报, 2018, 41(1): 164-176. (TAO W, PAN Z S, CHU D J, et al. The Individual Convergence of Projected Subgradient Methods Using the Nesterov's Step-Size Strategy. Chinese Journal of Computers, 2018,41(1): 164-176.) [15] 程禹嘉,陶蔚,刘宇翔,等.Heavy-Ball 型动量方法的最优个体收敛速率.计算机研究与发展, 2019, 56(8): 1686-1694. (CHENG Y J, TAO W, LIU Y X, et al. Optimal Individual Convergence Rate of the Heavy-Ball-Based Momentum Methods. Chinese Journal of Computer Research and Development, 2019, 56(8): 1686-1694.) [16] AGARWAL A, BARTLETT P L, RAVIKUMAR P, et al. Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization. IEEE Transactions on Information Theory, 2012, 58(5): 3235-3249. [17] SHAMIR O, ZHANG T.Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes//Proc of the 30th International Conference on Machine Learning. New York, USA: ACM, 2013: 71-79. [18] SHALEV-SHWARTZ S, SINGER Y, SREBRO N, et al. Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming, 2011, 127: 3-30. [19] RAKHLIN A, SHAMIR O, SRIDHARAN K.Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization//Proc of the 29th International Conference on Machine Learning. New York, USA: ACM, 2012: 449-456. [20] WANG G H, LU S Y, CHENG Q, et al. SAdam: A Variant of Adam for Strongly Convex Functions[C/OL].[2021-04-20]. https://arxiv.org/pdf/1905.02957v1.pdf.