Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).


Journal

PloS one
ISSN: 1932-6203
Titre abrégé: PLoS One
Pays: United States
ID NLM: 101285081

Informations de publication

Date de publication:
2021
Historique:
received: 21 03 2021
accepted: 24 07 2021
entrez: 5 8 2021
pubmed: 6 8 2021
medline: 15 12 2021
Statut: epublish

Résumé

The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method. By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn. Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.

Identifiants

pubmed: 34352006
doi: 10.1371/journal.pone.0255838
pii: PONE-D-21-09268
pmc: PMC8341664
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

e0255838

Déclaration de conflit d'intérêts

The authors have declared that no further conflicts of interest exist.

Références

Bioinformatics. 2004 Mar 22;20(5):623-8
pubmed: 15033868
Int J Mol Sci. 2015 Oct 28;16(10):25897-911
pubmed: 26516852
BMC Syst Biol. 2019 Apr 5;13(Suppl 2):35
pubmed: 30953498
Int J Mol Sci. 2019 Dec 20;21(1):
pubmed: 31861946

Auteurs

Jörn Lötsch (J)

Institute of Clinical Pharmacology, Goethe-University, Frankfurt am Main, Germany.
Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Frankfurt am Main, Germany.

Sebastian Malkusch (S)

Institute of Clinical Pharmacology, Goethe-University, Frankfurt am Main, Germany.

Alfred Ultsch (A)

DataBionics Research Group, University of Marburg, Marburg, Germany.

Articles similaires

Unsupervised learning for real-time and continuous gait phase detection.

Dollaporn Anopas, Yodchanan Wongsawat, Jetsada Arnin
1.00
Humans Gait Neural Networks, Computer Unsupervised Machine Learning Walking
Humans Shoulder Fractures Tomography, X-Ray Computed Neural Networks, Computer Female
Humans Artificial Intelligence Neoplasms Prognosis Image Processing, Computer-Assisted
Humans Deep Learning Mouth Neoplasms Drug Resistance, Neoplasm Cell Line, Tumor

Classifications MeSH