The impact of imputation quality on machine learning classifiers for datasets with missing values.


Journal

Communications medicine
ISSN: 2730-664X
Titre abrégé: Commun Med (Lond)
Pays: England
ID NLM: 9918250414506676

Informations de publication

Date de publication:
06 Oct 2023
Historique:
received: 18 07 2022
accepted: 13 09 2023
medline: 7 10 2023
pubmed: 7 10 2023
entrez: 6 10 2023
Statut: epublish

Résumé

Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable. Many artificial intelligence (AI) methods aim to classify samples of data into groups, e.g., patients with disease vs. those without. This often requires datasets to be complete, i.e., that all data has been collected for all samples. However, in clinical practice this is often not the case and some data can be missing. One solution is to ‘complete’ the dataset using a technique called imputation to replace those missing values. However, assessing how well the imputation method performs is challenging. In this work, we demonstrate why people should care about imputation, develop a new method for assessing imputation quality, and demonstrate that if we build AI models on poorly imputed data, the model can give different results to those we would hope for. Our findings may improve the utility and quality of AI models in the clinic.

Sections du résumé

BACKGROUND BACKGROUND
Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance.
METHODS METHODS
We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data.
RESULTS RESULTS
The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised.
CONCLUSIONS CONCLUSIONS
It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.
Many artificial intelligence (AI) methods aim to classify samples of data into groups, e.g., patients with disease vs. those without. This often requires datasets to be complete, i.e., that all data has been collected for all samples. However, in clinical practice this is often not the case and some data can be missing. One solution is to ‘complete’ the dataset using a technique called imputation to replace those missing values. However, assessing how well the imputation method performs is challenging. In this work, we demonstrate why people should care about imputation, develop a new method for assessing imputation quality, and demonstrate that if we build AI models on poorly imputed data, the model can give different results to those we would hope for. Our findings may improve the utility and quality of AI models in the clinic.

Autres résumés

Type: plain-language-summary (eng)
Many artificial intelligence (AI) methods aim to classify samples of data into groups, e.g., patients with disease vs. those without. This often requires datasets to be complete, i.e., that all data has been collected for all samples. However, in clinical practice this is often not the case and some data can be missing. One solution is to ‘complete’ the dataset using a technique called imputation to replace those missing values. However, assessing how well the imputation method performs is challenging. In this work, we demonstrate why people should care about imputation, develop a new method for assessing imputation quality, and demonstrate that if we build AI models on poorly imputed data, the model can give different results to those we would hope for. Our findings may improve the utility and quality of AI models in the clinic.

Identifiants

pubmed: 37803172
doi: 10.1038/s43856-023-00356-z
pii: 10.1038/s43856-023-00356-z
pmc: PMC10558448
doi:

Types de publication

Journal Article

Langues

eng

Pagination

139

Investigateurs

Ian Selby (I)
Anna Breger (A)
Jonathan R Weir-McCall (JR)
Effrossyni Gkrania-Klotsas (E)
Anna Korhonen (A)
Emily Jefferson (E)
Georg Langs (G)
Guang Yang (G)
Helmut Prosch (H)
Judith Babar (J)
Lorena Escudero Sánchez (L)
Marcel Wassin (M)
Markus Holzer (M)
Nicholas Walton (N)
Pietro Lió (P)

Informations de copyright

© 2023. Springer Nature Limited.

Références

Cancer Cell. 2018 Sep 10;34(3):427-438.e6
pubmed: 30205045
Eur Heart J. 2021 Jul 1;42(25):2439-2454
pubmed: 34120177
Front Genet. 2021 Jul 02;12:691274
pubmed: 34276792
Commun Med (Lond). 2023 Oct 6;3(1):139
pubmed: 37803172
PLoS One. 2021 Apr 16;16(4):e0250370
pubmed: 33861809
J Big Data. 2021;8(1):140
pubmed: 34722113
Sci Data. 2016 May 24;3:160035
pubmed: 27219127
Front Genet. 2021 Apr 13;12:624128
pubmed: 33927746
Nat Commun. 2020 Oct 29;11(1):5467
pubmed: 33122624
EGEMS (Wash DC). 2013 Dec 17;1(3):1035
pubmed: 25848578
Front Big Data. 2021 Jul 08;4:693674
pubmed: 34308343
Bioinformatics. 2012 Jan 1;28(1):112-8
pubmed: 22039212
BMJ. 2020 Apr 7;369:m1328
pubmed: 32265220
Sci Rep. 2016 Feb 12;6:21689
pubmed: 26868061
BMC Med Res Methodol. 2018 Dec 12;18(1):168
pubmed: 30541455
Brief Bioinform. 2022 Jan 17;23(1):
pubmed: 34882223
Digit Health. 2021 Nov 23;7:20552076211048654
pubmed: 34868617
Epidemiology. 2012 Sep;23(5):729-32
pubmed: 22584299

Auteurs

Tolou Shadbahr (T)

Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland.

Michael Roberts (M)

Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK. michael.roberts@maths.cam.ac.uk.
Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK. michael.roberts@maths.cam.ac.uk.

Jan Stanczuk (J)

Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK.

Julian Gilbey (J)

Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK.

Philip Teare (P)

Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK.

Sören Dittmer (S)

Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK.
ZeTeM, University of Bremen, Bremen, Germany.

Matthew Thorpe (M)

Department of Mathematics, University of Manchester, Manchester, UK.

Ramon Viñas Torné (RV)

Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.

Evis Sala (E)

Department of Radiology, University of Cambridge, Cambridge, UK.

Pietro Lió (P)

Department of Mathematics, University of Manchester, Manchester, UK.

Mishal Patel (M)

Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK.
Clinical Pharmacology & Safety Sciences, AstraZeneca, Cambridge, UK.

Jacobus Preller (J)

Addenbrooke's Hospital, Cambridge University Hospitals NHS Trust, Cambridge, UK.

James H F Rudd (JHF)

Department of Medicine, University of Cambridge, Cambridge, UK.

Tuomas Mirtti (T)

Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
Department of Pathology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland.
iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland.

Antti Sakari Rannikko (AS)

Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland.
Department of Urology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland.

John A D Aston (JAD)

Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK.

Jing Tang (J)

Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland.

Carola-Bibiane Schönlieb (CB)

Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK.

Classifications MeSH