Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making.

Clinical data Data augmentation Machine learning Synthetic data

Journal

BioData mining
ISSN: 1756-0381
Titre abrégé: BioData Min
Pays: England
ID NLM: 101319161

Informations de publication

Date de publication:
29 Nov 2021
Historique:
received: 17 05 2021
accepted: 10 11 2021
entrez: 30 11 2021
pubmed: 1 12 2021
medline: 1 12 2021
Statut: epublish

Résumé

Clinical data sets have very special properties and suffer from many caveats in machine learning. They typically show a high-class imbalance, have a small number of samples and a large number of parameters, and have missing values. While feature selection approaches and imputation techniques address the former problems, the class imbalance is typically addressed using augmentation techniques. However, these techniques have been developed for big data analytics, and their suitability for clinical data sets is unclear.This study analyzed different augmentation techniques for use in clinical data sets and subsequent employment of machine learning-based classification. It turns out that Gaussian Noise Up-Sampling (GNUS) is not always but generally, is as good as SMOTE and ADASYN and even outperform those on some datasets. However, it has also been shown that augmentation does not improve classification at all in some cases.

Identifiants

pubmed: 34844620
doi: 10.1186/s13040-021-00283-6
pii: 10.1186/s13040-021-00283-6
pmc: PMC8628399
doi:

Types de publication

Journal Article

Langues

eng

Pagination

49

Subventions

Organisme : LOEWE
ID : Diffusible Signals

Informations de copyright

© 2021. The Author(s).

Références

J Hepatol. 2013 Aug;59(2):236-42
pubmed: 23523583
Proc Natl Acad Sci U S A. 1990 Dec;87(23):9193-6
pubmed: 2251264
BioData Min. 2014 Aug 01;7:14
pubmed: 25120583
Med Image Anal. 2016 Oct;33:170-175
pubmed: 27423409
BMC Genomics. 2020 Jan 2;21(1):6
pubmed: 31898477
Artif Intell Med. 2019 Sep;100:101706
pubmed: 31607340
Bioinformatics. 2012 Jan 1;28(1):112-8
pubmed: 22039212
Breast Cancer Res Treat. 2017 Jan;161(2):203-211
pubmed: 27826755
BioData Min. 2016 Nov 18;9:36
pubmed: 27891179
BMC Bioinformatics. 2018 Mar 27;19(1):109
pubmed: 29587624
BioData Min. 2019 Mar 4;12:7
pubmed: 30867681
PLoS One. 2014 Jul 02;9(7):e101444
pubmed: 24988316
Bioinformatics. 2019 Jul 15;35(14):2458-2465
pubmed: 30496351
Cancer Lett. 2016 Nov 1;382(1):110-117
pubmed: 27241666
J Matern Fetal Med. 2000 Sep-Oct;9(5):311-8
pubmed: 11132590
Nat Med. 2018 Oct;24(10):1559-1567
pubmed: 30224757
Bioinformatics. 2005 Oct 15;21(20):3940-1
pubmed: 16096348
Nat Rev Genet. 2015 Jun;16(6):321-32
pubmed: 25948244
Nat Rev Microbiol. 2006 Oct;4(10):790-7
pubmed: 16980939

Auteurs

Jacqueline Beinecke (J)

Department of Mathematics and Computer Science, Philipps-University of Marburg, Hans-Meerwein-Str. 6, 35043, Marburg, Germany.

Dominik Heider (D)

Department of Mathematics and Computer Science, Philipps-University of Marburg, Hans-Meerwein-Str. 6, 35043, Marburg, Germany. dominik.heider@uni-marburg.de.

Classifications MeSH