The impact of imputation quality on machine learning classifiers for datasets with missing values.

Journal

Communications medicine

ISSN: 2730-664X

Titre abrégé: Commun Med (Lond)

Pays: England

ID NLM: 9918250414506676

Informations de publication

Date de publication:
06 Oct 2023

Historique:

received: 18 07 2022

accepted: 13 09 2023

medline: 7 10 2023

pubmed: 7 10 2023

entrez: 6 10 2023

Statut: epublish

Résumé

Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable. Many artificial intelligence (AI) methods aim to classify samples of data into groups, e.g., patients with disease vs. those without. This often requires datasets to be complete, i.e., that all data has been collected for all samples. However, in clinical practice this is often not the case and some data can be missing. One solution is to ‘complete’ the dataset using a technique called imputation to replace those missing values. However, assessing how well the imputation method performs is challenging. In this work, we demonstrate why people should care about imputation, develop a new method for assessing imputation quality, and demonstrate that if we build AI models on poorly imputed data, the model can give different results to those we would hope for. Our findings may improve the utility and quality of AI models in the clinic.

Sections du résumé

BACKGROUND BACKGROUND

METHODS METHODS

We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data.

RESULTS RESULTS

The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised.

CONCLUSIONS CONCLUSIONS

It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

Many artificial intelligence (AI) methods aim to classify samples of data into groups, e.g., patients with disease vs. those without. This often requires datasets to be complete, i.e., that all data has been collected for all samples. However, in clinical practice this is often not the case and some data can be missing. One solution is to ‘complete’ the dataset using a technique called imputation to replace those missing values. However, assessing how well the imputation method performs is challenging. In this work, we demonstrate why people should care about imputation, develop a new method for assessing imputation quality, and demonstrate that if we build AI models on poorly imputed data, the model can give different results to those we would hope for. Our findings may improve the utility and quality of AI models in the clinic.

Autres résumés

Type: plain-language-summary (eng)

Identifiants

DOI: 10.1038/s43856-023-00356-z PMID: 37803172 PMC: PMC10558448

pubmed: 37803172

doi: 10.1038/s43856-023-00356-z

pii: 10.1038/s43856-023-00356-z

pmc: PMC10558448

doi:

Types de publication

Journal Article

Langues

eng

Pagination

139

Investigateurs

Ian Selby (I)

Anna Breger (A)

Jonathan R Weir-McCall (JR)

Effrossyni Gkrania-Klotsas (E)

Anna Korhonen (A)

Emily Jefferson (E)

Georg Langs (G)

Guang Yang (G)

Helmut Prosch (H)

Judith Babar (J)

Lorena Escudero Sánchez (L)

Marcel Wassin (M)

Markus Holzer (M)

Nicholas Walton (N)

Pietro Lió (P)

Informations de copyright

Références

Cancer Cell. 2018 Sep 10;34(3):427-438.e6

pubmed: 30205045

Eur Heart J. 2021 Jul 1;42(25):2439-2454

pubmed: 34120177

Front Genet. 2021 Jul 02;12:691274

pubmed: 34276792

Commun Med (Lond). 2023 Oct 6;3(1):139

pubmed: 37803172

PLoS One. 2021 Apr 16;16(4):e0250370

pubmed: 33861809

J Big Data. 2021;8(1):140

pubmed: 34722113

Sci Data. 2016 May 24;3:160035

pubmed: 27219127

Front Genet. 2021 Apr 13;12:624128

pubmed: 33927746

Nat Commun. 2020 Oct 29;11(1):5467

pubmed: 33122624

EGEMS (Wash DC). 2013 Dec 17;1(3):1035

pubmed: 25848578

Front Big Data. 2021 Jul 08;4:693674

pubmed: 34308343

Bioinformatics. 2012 Jan 1;28(1):112-8

pubmed: 22039212

BMJ. 2020 Apr 7;369:m1328

pubmed: 32265220

Sci Rep. 2016 Feb 12;6:21689

pubmed: 26868061

BMC Med Res Methodol. 2018 Dec 12;18(1):168

pubmed: 30541455

Brief Bioinform. 2022 Jan 17;23(1):

pubmed: 34882223

Digit Health. 2021 Nov 23;7:20552076211048654

pubmed: 34868617

Epidemiology. 2012 Sep;23(5):729-32

pubmed: 22584299

The impact of imputation quality on machine learning classifiers for datasets with missing values.

Journal

Informations de publication

Résumé

Sections du résumé

Autres résumés

Identifiants

Types de publication

Langues

Pagination

Investigateurs

Informations de copyright

Références

Auteurs

Tolou Shadbahr (T)

Michael Roberts (M)

Jan Stanczuk (J)

Julian Gilbey (J)

Philip Teare (P)

Sören Dittmer (S)

Matthew Thorpe (M)

Ramon Viñas Torné (RV)

Evis Sala (E)

Pietro Lió (P)

Mishal Patel (M)

Jacobus Preller (J)

James H F Rudd (JHF)

Tuomas Mirtti (T)

Antti Sakari Rannikko (AS)

John A D Aston (JAD)

Jing Tang (J)

Carola-Bibiane Schönlieb (CB)

Classifications MeSH