Abstract:For convex problems, since the convergence analysis of DA(dual averaging) needs to be transformed in dual space, it is difficult to gain individual convergence. In this paper, a simple convergence analysis of DA is presented, and then it is proved that DA can attain the same optimal Ο(lnt/t) individual convergence rate as gradient descent(GD). Different from GD, the individual convergence of DA is proved to be step-size flexible by analyzing two typical step-size strategies. Furthermore, the stochastic version of the derived convergence is extended to solve large-scale machine learning problems. Experiments on L1-norm constrained hinge loss problems verify the correctness of the theoretical analysis.
[1] NESEROV Y. Primal-Dual Subgradient Methods for Convex Pro-blems. Mathematical Programming, 2009, 120(1): 221-259. [2] XIAO L. Dual Averaging Methods for Regularized Stochastic Lear-ning and Online Optimization. Journal of Machine Learning Research, 2010, 11: 2543-2596. [3] CHEN X, LIN Q H, PEÑA J. Optimal Regularized Dual Averaging Methods for Stochastic Optimization // Proc of the 25th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2012, I: 295-403. [4] NESTEROV Y, SHIKHMAN V. Quasi-Monotone Subgradient Me-thods for Nonsmooth Convex Minimization. Journal of Optimization Theory and Applications, 2015, 165(3): 917-940. [5] 陶 卿,马 坡,张梦晗,等.机器学习随机优化方法的个体收敛性研究综述.数据采集与处理, 2017, 32(1): 17-25. (TAO Q, MA P, ZHANG M H, et al. Individual Convergence of Stochastic Optimization Methods in Machine Learning. Journal of Data Acquisition and Processing, 2017, 32(1): 17-25). [6] SHAMIR O. Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent? // Proc of the 25th Annual Conference on Learning Theory. New York, USA: ACM, 2012: 47.1-47.3. [7] SHAMIR O, ZHANG T. Stochastic Gradient Descent for Nonsmoo-th Optimization: Convergence Results and Optimal Averaging Schemes // Proc of the 30th International Conference on Machine Learning. New York, USA: ACM, 2013: 71-79. [8] HARVEY N J A, LIAW C, PLAN Y, et al. Tight Analyses for Non-smooth Stochastic Gradient Descent // Proc of the 32nd Annual Conference on Learning Theory. New York, USA; ACM, 2019: 1579-1613. [9] 陶 蔚,潘志松,储德军,等.使用Nesterov 步长策略投影次梯度方法的个体收敛性.计算机学报, 2018, 41(1): 164-176. (TAO W, PAN Z S, CHU D J, et al. The Individual Convergence of Projected Subgradient Methods Using the Nesterov′s Step-Size Strategy. Chinese Journal of Computers, 2018, 41(1): 164-176.) [10] 程禹嘉,陶 蔚,刘宇翔,等. Heavy-ball型动量方法的最优个体收敛速率.计算机研究与发展, 2019, 56(8): 1686-1694. (CHENG Y J, TAO W, LIU Y X, et al. Optimal Individual Convergence Rate of the Heavy-Ball-Based Momentum Methods. Journal of Computer Research and Development, 2019, 56(8): 1686-1694.) [11] TAO W, PAN Z S, WU G W, et al. The Strength of Nesterov′s Extrapolation in the Individual Convergence of Nonsmooth Optimization. IEEE Transactions on Neural Networks and Learning Sys- tems, 2020, 31(7): 2557-2568. [12] DUCHI J, SINGER Y. Efficient Online and Batch Learning Using forward Backward Splitting. Journal of Machine Learning Research, 2009, 10: 2899-2934. [13] TSENG P. Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization. Mathematical Progra-mming, 2010, 125(2): 263-295. [14] BERTSEKAS D P, NEDIC/ A, OZDAGLAR A E. Convex Analysis and Optimization. Belmont, USA: Athena Scientific, 2003. [15] SHALEV-SHWARTZ S, SINGER Y, SREBRO N, et al. PeGa-sos: Primal Estimated Sub-gradient Solver for SVM. Mathematical Programming, 2011, 127(1): 3-30. [16] RAKHLIN A, SHAMIR O, SRIDHARAN K. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization // Proc of the 29th International Conference on Machine Learning. New York, USA: ACM, 2012: 1571-1578. [17] LIU J, JI S W, YE J P. SLEP: Sparse Learning with Efficient Projections[CB/OL]. [2020-08-08]. http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=23332E4606BC9D0D84E693C2FF63ADB6?doi=10.1.1.214.1045&rep=rep1&type=pdf.