High-dimensional dynamics of generalization error in neural networks.

Deep Learning

Generalization error Neural networks Random matrix theory

Journal

Neural networks : the official journal of the International Neural Network Society

ISSN: 1879-2782

Titre abrégé: Neural Netw

Pays: United States

ID NLM: 8805018

Informations de publication

Date de publication:
Dec 2020

Historique:

received: 04 01 2020

revised: 18 08 2020

accepted: 24 08 2020

pubmed: 7 10 2020

medline: 20 1 2021

entrez: 6 10 2020

Statut: ppublish

Résumé

We perform an analysis of the average generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant "high-dimensional" regime where the number of free parameters in the network is on the order of or even larger than the number of examples in the dataset. Using random matrix theory and exact solutions in linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem. We find that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks. Overtraining is worst at intermediate network sizes, when the effective number of free parameters equals the number of samples, and thus can be reduced by making a network smaller or larger. Additionally, in the high-dimensional regime, low generalization error requires starting with small initial weights. We then turn to non-linear neural networks, and show that making networks very large does not harm their generalization performance. On the contrary, it can in fact reduce overtraining, even without early stopping or regularization of any sort. We identify two novel phenomena underlying this behavior in overcomplete models: first, there is a frozen subspace of the weights in which no learning occurs under gradient descent; and second, the statistical properties of the high-dimensional regime yield better-conditioned input correlations which protect against overtraining. We demonstrate that standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks, and derive an alternative bound which incorporates the frozen subspace and conditioning effects and qualitatively matches the behavior observed in simulation.

Identifiants

DOI: 10.1016/j.neunet.2020.08.022 PMID: 33022471 PMC: PMC7685244

pubmed: 33022471

pii: S0893-6080(20)30311-7

doi: 10.1016/j.neunet.2020.08.022

pmc: PMC7685244

pii:

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

428-446

Informations de copyright

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Références

Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):E7665-E7671

pubmed: 30054315

Nature. 2015 May 28;521(7553):436-44

pubmed: 26017442

IEEE Trans Neural Netw. 1995;6(4):837-58

pubmed: 18263374

Phys Rev E Stat Nonlin Soft Matter Phys. 2007 Jan;75(1 Pt 2):016101

pubmed: 17358218

Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Sep;52(3):2878-2886

pubmed: 9963734

Phys Rev A. 1992 Apr 15;45(8):6056-6091

pubmed: 9907706

IEEE Trans Neural Netw. 1997;8(5):985-96

pubmed: 18255701

Proc Natl Acad Sci U S A. 2019 Aug 6;116(32):15849-15854

pubmed: 31341078

Neural Netw. 2015 Jan;61:85-117

pubmed: 25462637

Neural Comput. 1991 Winter;3(4):589-603

pubmed: 31167336

Phys Rev Lett. 1995 May 22;74(21):4337-4340

pubmed: 10058475

Neural Comput. 2008 Mar;20(3):813-43

pubmed: 18045020

Phys Rev Lett. 1991 May 6;66(18):2396-2399

pubmed: 10043474

Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30063-30070

pubmed: 32332161

Neural Netw. 2008 Sep;21(7):989-1005

pubmed: 18693082

High-dimensional dynamics of generalization error in neural networks.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Déclaration de conflit d'intérêts

Références

Auteurs

Madhu S Advani (MS)

Andrew M Saxe (AM)

Haim Sompolinsky (H)

Articles similaires

Exploring structural diversity across the protein universe with The Encyclopedia of Domains.

ACL-DUNet: A tumor segmentation method based on multiple attention and densely connected breast ultrasound images.

Deep learning-based automatic image classification of oral cancer cells acquiring chemoresistance in vitro.

Deep learning models for hepatitis E incidence prediction leveraging Baidu index.

Classifications MeSH