AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for training deep neural networks.
Adaptive learning rate
Linear speedup
Momentum acceleration
Non-convex optimization
Sharpness-aware minimization
Journal
Neural networks : the official journal of the International Neural Network Society
ISSN: 1879-2782
Titre abrégé: Neural Netw
Pays: United States
ID NLM: 8805018
Informations de publication
Date de publication:
01 Nov 2023
01 Nov 2023
Historique:
received:
05
06
2023
revised:
25
09
2023
accepted:
26
10
2023
medline:
10
11
2023
pubmed:
10
11
2023
entrez:
9
11
2023
Statut:
aheadofprint
Résumé
Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We theoretically show that AdaSAM admits a O(1/bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.
Identifiants
pubmed: 37944247
pii: S0893-6080(23)00606-8
doi: 10.1016/j.neunet.2023.10.044
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
506-519Informations de copyright
Copyright © 2023 Elsevier Ltd. All rights reserved.
Déclaration de conflit d'intérêts
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.