High-dimensional dynamics of generalization error in neural networks.
Generalization error
Neural networks
Random matrix theory
Journal
Neural networks : the official journal of the International Neural Network Society
ISSN: 1879-2782
Titre abrégé: Neural Netw
Pays: United States
ID NLM: 8805018
Informations de publication
Date de publication:
Dec 2020
Dec 2020
Historique:
received:
04
01
2020
revised:
18
08
2020
accepted:
24
08
2020
pubmed:
7
10
2020
medline:
20
1
2021
entrez:
6
10
2020
Statut:
ppublish
Résumé
We perform an analysis of the average generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant "high-dimensional" regime where the number of free parameters in the network is on the order of or even larger than the number of examples in the dataset. Using random matrix theory and exact solutions in linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem. We find that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks. Overtraining is worst at intermediate network sizes, when the effective number of free parameters equals the number of samples, and thus can be reduced by making a network smaller or larger. Additionally, in the high-dimensional regime, low generalization error requires starting with small initial weights. We then turn to non-linear neural networks, and show that making networks very large does not harm their generalization performance. On the contrary, it can in fact reduce overtraining, even without early stopping or regularization of any sort. We identify two novel phenomena underlying this behavior in overcomplete models: first, there is a frozen subspace of the weights in which no learning occurs under gradient descent; and second, the statistical properties of the high-dimensional regime yield better-conditioned input correlations which protect against overtraining. We demonstrate that standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks, and derive an alternative bound which incorporates the frozen subspace and conditioning effects and qualitatively matches the behavior observed in simulation.
Identifiants
pubmed: 33022471
pii: S0893-6080(20)30311-7
doi: 10.1016/j.neunet.2020.08.022
pmc: PMC7685244
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
428-446Informations de copyright
Copyright © 2020 The Authors. Published by Elsevier Ltd.. All rights reserved.
Déclaration de conflit d'intérêts
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Références
Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):E7665-E7671
pubmed: 30054315
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
IEEE Trans Neural Netw. 1995;6(4):837-58
pubmed: 18263374
Phys Rev E Stat Nonlin Soft Matter Phys. 2007 Jan;75(1 Pt 2):016101
pubmed: 17358218
Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Sep;52(3):2878-2886
pubmed: 9963734
Phys Rev A. 1992 Apr 15;45(8):6056-6091
pubmed: 9907706
IEEE Trans Neural Netw. 1997;8(5):985-96
pubmed: 18255701
Proc Natl Acad Sci U S A. 2019 Aug 6;116(32):15849-15854
pubmed: 31341078
Neural Netw. 2015 Jan;61:85-117
pubmed: 25462637
Neural Comput. 1991 Winter;3(4):589-603
pubmed: 31167336
Phys Rev Lett. 1995 May 22;74(21):4337-4340
pubmed: 10058475
Neural Comput. 2008 Mar;20(3):813-43
pubmed: 18045020
Phys Rev Lett. 1991 May 6;66(18):2396-2399
pubmed: 10043474
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30063-30070
pubmed: 32332161
Neural Netw. 2008 Sep;21(7):989-1005
pubmed: 18693082