Optimal Individual Convergence Rate of Heavy-Ball Based Momentum Methods Based on Adaptive Step-Size Strategy
HUANG Jianzhi1, LONG Sheng1, TAO Qing1
1. Department of Information Engineering, Chinese Academy of People's Liberation Army Artillery Air Defense Academy, Hefei 230031 terov Accelerated Adaptive Moment Estimation
Abstract:Optimization techniques, including adaptive step-size and momentum, are both utilized in AMSGrad. Compared with the adaptive step-size algorithms, there is one more logarithm factor in AMSGrad for convergence analysis. To solve the problem, the non-smooth convex problems are studied in this paper. By selecting the time-varying step-size and momentum parameters correctly, it is proved that the Heavy-ball-based momentum methods with adaptive step-size obtain the optimal individual convergence rate. It is indicated that the Heavy-ball-based momentum methods with adaptive step-size hold the advantages in both adaptation and acceleration. Hinge loss problem under L1-norm constraint is solved to verify the correctness of theoretical analysis.
[1] DUCHI J, HAZAN E, SINGER Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 2011, 12: 2121-2159. [2] SUN T, LI D S, QUAN Z, et al. Heavy-Ball Algorithms always Escape Saddle Points // Proc of the 28th International Joint Conference on Artificial Intelligence. New York, USA: ACM, 2019: 3520-3526. [3] KINGMA D P, BA J. Adam: A Method for Stochastic Optimization[C/OL]. [2020-09-29]. http://de.arxiv.org/pdf/1412.6980. [4] DOZAT T. Incorporating Nesterov Momentum into Adam[C/OL]. [2020-09-29]. http://cs229.stanford.edu/proj2015/054_report.pdf. [5] REDDI S J, KALE S, KUMAR S. On the Convergence of Adam and Beyond[C/OL]. [2020-09-29]. https://arxiv.org/abs/1904.09237. [6] ZOU F Y, SHEN L, JIE Z Q, et al. A Sufficient Condition for Convergences of Adam and RMSProp // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 11119-11127. [7] MUKKAMALA M C, HEIN M. Variants of RMSProp and Adagrad with Logarithmic Regret Bounds // Proc of the 34th International Conference on Machine Learning. New York, USA: ACM, 2017: 2545-2553. [8] MA J, YARATS D. Quasi-Hyperbolic Momentum and Adam for Deep Learning // Proc of the 7th International Conference for Learning Representations[C/OL]. [2020-09-29]. https://arxiv.org/pdf/1810.06801.pdf. [9] POLYAK B T. Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 1964, 4(5): 1-17. [10] NESTEROV Y. A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k2). Soviet Mathematics Doklady, 1983, 27(2): 372-376. [11] GHADIMI E, FEYZMAHDAVIAN H R, JOHANSSON M. Global Convergence of the Heavy-Ball Method for Convex Optimization // Proc of the European Control Conference. Washington, USA: IEEE, 2015: 310-315. [12] SHAMIR O, ZHANG T. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes // Proc of the 30th International Conference on Machine Learning. New York, USA: ACM, 2013: I-71-I-79. [13] 陶 蔚,潘志松,储德军,等.使用Nesterov步长策略投影次梯度方法的个体收敛性.计算机学报, 2018, 41(1): 164-176. (TAO W, PAN Z S, CHU D J, et al. The Individual Convergence of Projected Subgradient Methods Using the Nesterov's Step-Size Strategy. Chinese Journal of Computers, 2018, 41(1): 164-176.) [14] TAO W, PAN Z S, WU G W, et al. The Strength of Nesterov's Extrapolation in the Individual Convergence of Nonsmooth Optimization. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(7): 2557-2568. [15] 程禹嘉,陶 蔚,刘宇翔,等.Heavy-Ball型动量方法的最优个体收敛速率.计算机研究与发展, 2019, 56(8): 1686-1694. (CHENG Y J, TAO W, LIU Y X, et al. Optimal Individual Convergence Rate of the Heavy-Ball-Based Momentum Methods. Journal of Computer Research and Development, 2019, 56(8): 1686-1694.) [16] ALACAOGLU A, MALITSKY Y, MERTIKOPOULOS P, et al. A New Regret Analysis for Adam-Type Algorithms[C/OL]. [2020-09-29]. https://arxiv.org/pdf/2003.09729v1.pdf. [17] ZINKEVICH M. Online Convex Programming and Generalized Infinitesimal Gradient Ascent // Proc of the 20th International Conference on Machine Learning. New York, USA: ACM, 2003: 928-936. [18] AGARWAL A, BARTLETT P, RAVIKUMAR P, et al. Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization. IEEE Transactions on Information Theory, 2012, 58(5): 3235-3249. [19] KESKAR N S, SOCHER R. Improving Generalization Performance by Switching from Adam to SGD[C/OL]. [2020-09-29]. https://arxiv.org/pdf/1712.07628.pdf. [20] SHALEV-SHWARTZ S, SINGER Y, SREBRO N, et al. Pegasos: Primal Estimated Sub-gradient Solver for SVM // Proc of the 24th International Conference on Machine Learning. New York, USA: ACM, 2007: 807-814 [21] RAKHLINA, SHAMIR O, SRIDHARAN K. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization // Proc of the 29th International Conference on Machine Learning. New York, USA: ACM, 2012: 1571-1578. [22] TAO Q, WU G W, CHU D J. Improving Sparsity and Scalability in Regularized Nonconvex Truncated-Loss Learning Problems. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(7): 2782-2793.