Autoencoders for sample size estimation for fully connected neural network classifiers.


Journal

NPJ digital medicine
ISSN: 2398-6352
Titre abrégé: NPJ Digit Med
Pays: England
ID NLM: 101731738

Informations de publication

Date de publication:
13 Dec 2022
Historique:
received: 02 03 2022
accepted: 29 11 2022
entrez: 13 12 2022
pubmed: 14 12 2022
medline: 14 12 2022
Statut: epublish

Résumé

Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.

Identifiants

pubmed: 36513729
doi: 10.1038/s41746-022-00728-0
pii: 10.1038/s41746-022-00728-0
pmc: PMC9747810
doi:

Types de publication

Journal Article

Langues

eng

Pagination

180

Informations de copyright

© 2022. The Author(s).

Références

Annu Rev Biomed Data Sci. 2021 Jul;4:123-144
pubmed: 34396058
Nat Rev Neurosci. 2013 May;14(5):365-76
pubmed: 23571845
Expert Rev Neurother. 2017 Jan;17(1):7-16
pubmed: 27223100
Can Assoc Radiol J. 2019 Nov;70(4):344-353
pubmed: 31522841
Stat Med. 2001 Sep 15-30;20(17-18):2561-71
pubmed: 11523069
Nat Med. 2019 Jan;25(1):30-36
pubmed: 30617336
Cell. 2018 Feb 22;172(5):1122-1131.e9
pubmed: 29474911
Biometrics. 2008 Dec;64(4):1256-62
pubmed: 18266889
Swiss Med Wkly. 2007 Jan 27;137(3-4):44-9
pubmed: 17299669
Biostatistics. 2007 Jan;8(1):101-17
pubmed: 16613833
PLoS One. 2018 Apr 26;13(4):e0196258
pubmed: 29698451

Auteurs

Faris F Gulamali (FF)

Icahn School of Medicine, New York, NY, 10029, USA. faris.gulamali@icahn.mssm.edu.

Ashwin S Sawant (AS)

Icahn School of Medicine, New York, NY, 10029, USA.

Patricia Kovatch (P)

Icahn School of Medicine, New York, NY, 10029, USA.

Benjamin Glicksberg (B)

Icahn School of Medicine, New York, NY, 10029, USA.

Alexander Charney (A)

Icahn School of Medicine, New York, NY, 10029, USA.

Girish N Nadkarni (GN)

Icahn School of Medicine, New York, NY, 10029, USA.

Eric Oermann (E)

New York University, New York, NY, 10016, USA.

Classifications MeSH