AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for training deep neural networks.

Adaptive learning rate Linear speedup Momentum acceleration Non-convex optimization Sharpness-aware minimization

Journal

Neural networks : the official journal of the International Neural Network Society
ISSN: 1879-2782
Titre abrégé: Neural Netw
Pays: United States
ID NLM: 8805018

Informations de publication

Date de publication:
01 Nov 2023
Historique:
received: 05 06 2023
revised: 25 09 2023
accepted: 26 10 2023
medline: 10 11 2023
pubmed: 10 11 2023
entrez: 9 11 2023
Statut: aheadofprint

Résumé

Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We theoretically show that AdaSAM admits a O(1/bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.

Identifiants

pubmed: 37944247
pii: S0893-6080(23)00606-8
doi: 10.1016/j.neunet.2023.10.044
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

506-519

Informations de copyright

Copyright © 2023 Elsevier Ltd. All rights reserved.

Déclaration de conflit d'intérêts

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Auteurs

Hao Sun (H)

School of Computer Science, University of Science and Technology of China, Hefei, 230026, Anhui, China.

Li Shen (L)

JD.com, Beijing, 100000, China. Electronic address: mathshenli@gmail.com.

Qihuang Zhong (Q)

School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.

Liang Ding (L)

JD.com, Beijing, 100000, China.

Shixiang Chen (S)

School of Mathematical Science, University of Science and Technology of China, Hefei, 230026, Anhui, China.

Jingwei Sun (J)

School of Computer Science, University of Science and Technology of China, Hefei, 230026, Anhui, China.

Jing Li (J)

School of Computer Science, University of Science and Technology of China, Hefei, 230026, Anhui, China.

Guangzhong Sun (G)

School of Computer Science, University of Science and Technology of China, Hefei, 230026, Anhui, China.

Dacheng Tao (D)

School of Computer Science, University of Sydney, Sydney, 2006, New South Wales, Australia.

Classifications MeSH