Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study.

Autism Spectrum Disorders Autoencoder Confounders Confounding Index MRI Machine learning Outliers Reproducibility

Journal

Artificial intelligence in medicine
ISSN: 1873-2860
Titre abrégé: Artif Intell Med
Pays: Netherlands
ID NLM: 8915031

Informations de publication

Date de publication:
08 2020
Historique:
received: 19 07 2019
revised: 13 12 2019
accepted: 02 07 2020
entrez: 25 9 2020
pubmed: 26 9 2020
medline: 19 8 2021
Statut: ppublish

Résumé

Machine learning (ML) approaches have been widely applied to medical data in order to find reliable classifiers to improve diagnosis and detect candidate biomarkers of a disease. However, as a powerful, multivariate, data-driven approach, ML can be misled by biases and outliers in the training set, finding sample-dependent classification patterns. This phenomenon often occurs in biomedical applications in which, due to the scarcity of the data, combined with their heterogeneous nature and complex acquisition process, outliers and biases are very common. In this work we present a new workflow for biomedical research based on ML approaches, that maximizes the generalizability of the classification. This workflow is based on the adoption of two data selection tools: an autoencoder to identify the outliers and the Confounding Index, to understand which characteristics of the sample can mislead classification. As a study-case we adopt the controversial research about extracting brain structural biomarkers of Autism Spectrum Disorders (ASD) from magnetic resonance images. A classifier trained on a dataset composed by 86 subjects, selected using this framework, obtained an area under the receiver operating characteristic curve of 0.79. The feature pattern identified by this classifier is still able to capture the mean differences between the ASD and Typically Developing Control classes on 1460 new subjects in the same age range of the training set, thus providing new insights on the brain characteristics of ASD. In this work, we show that the proposed workflow allows to find generalizable patterns even if the dataset is limited, while skipping the two mentioned steps and using a larger but not well designed training set would have produced a sample-dependent classifier.

Identifiants

pubmed: 32972657
pii: S0933-3657(19)30608-6
doi: 10.1016/j.artmed.2020.101926
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

101926

Informations de copyright

Copyright © 2020 Elsevier B.V. All rights reserved.

Auteurs

Elisa Ferrari (E)

Scuola Normale Superiore, Pisa, Italy. Electronic address: elisa.ferrari@sns.it.

Paolo Bosco (P)

IRCCS Fondazione Stella Maris, Pisa, Italy.

Sara Calderoni (S)

IRCCS Fondazione Stella Maris, Pisa, Italy; Department of Clinical and Experimental Medicine, University of Pisa, Pisa, Italy.

Piernicola Oliva (P)

University of Sassari, Sassari, Italy; INFN - Cagliari Division, Italy.

Letizia Palumbo (L)

INFN - Pisa Division, Italy.

Giovanna Spera (G)

INFN - Pisa Division, Italy.

Maria Evelina Fantacci (ME)

INFN - Pisa Division, Italy; Department of Physics, University of Pisa, Pisa, Italy.

Alessandra Retico (A)

INFN - Pisa Division, Italy.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH