Effect of data harmonization of multicentric dataset in ASD/TD classification.
ABIDE
Autism spectrum disorder
Harmonization
Machine learning
Multi-site data
Journal
Brain informatics
ISSN: 2198-4018
Titre abrégé: Brain Inform
Pays: Germany
ID NLM: 101673751
Informations de publication
Date de publication:
25 Nov 2023
25 Nov 2023
Historique:
received:
05
07
2023
accepted:
16
10
2023
medline:
26
11
2023
pubmed:
26
11
2023
entrez:
25
11
2023
Statut:
epublish
Résumé
Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typically obtained collecting data from multiple acquisition centers. However, analyzing large multicentric datasets can introduce bias due to differences between acquisition centers. ComBat harmonization is commonly used to address batch effects, but it can lead to data leakage when the entire dataset is used to estimate model parameters. In this study, structural and functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) collection were used to classify subjects with Autism Spectrum Disorders (ASD) compared to Typical Developing controls (TD). We compared the classical approach (external harmonization) in which harmonization is performed before train/test split, with an harmonization calculated only on the train set (internal harmonization), and with the dataset with no harmonization. The results showed that harmonization using the whole dataset achieved higher discrimination performance, while non-harmonized data and harmonization using only the train set showed similar results, for both structural and connectivity features. We also showed that the higher performances of the external harmonization are not due to larger size of the sample for the estimation of the model and hence these improved performance with the entire dataset may be ascribed to data leakage. In order to prevent this leakage, it is recommended to define the harmonization model solely using the train set.
Identifiants
pubmed: 38006422
doi: 10.1186/s40708-023-00210-x
pii: 10.1186/s40708-023-00210-x
pmc: PMC10676338
doi:
Types de publication
Journal Article
Langues
eng
Pagination
32Informations de copyright
© 2023. The Author(s).
Références
Sci Data. 2017 Mar 14;4:170010
pubmed: 28291247
IEEE Trans Biomed Eng. 2021 Dec;68(12):3628-3637
pubmed: 33989150
Arch Neurol. 1998 Feb;55(2):169-79
pubmed: 9482358
Neuroimage. 2012 Aug 15;62(2):782-90
pubmed: 21979382
Mol Psychiatry. 2014 Jun;19(6):659-67
pubmed: 23774715
Neuroimage. 2023 Jul 1;274:120125
pubmed: 37084926
Neuroimage. 2022 Aug 1;256:119198
pubmed: 35421567
Cell Rep. 2013 Nov 14;5(3):738-47
pubmed: 24210821
Sci Rep. 2020 Jun 4;10(1):9137
pubmed: 32499585
Neuroimage. 2012 Aug 15;62(2):774-81
pubmed: 22248573
Front Comput Neurosci. 2021 Dec 02;15:762781
pubmed: 34924984
Biostatistics. 2007 Jan;8(1):118-27
pubmed: 16632515
J Am Coll Radiol. 2006 Jun;3(6):413-22
pubmed: 17412096
Nat Commun. 2019 Oct 31;10(1):4958
pubmed: 31673008
Front Psychiatry. 2016 Dec 01;7:177
pubmed: 27990125
Hum Brain Mapp. 2018 Nov;39(11):4213-4227
pubmed: 29962049
Brain Connect. 2016 Nov;6(9):700-713
pubmed: 27527561
Neuroimage Clin. 2022;35:103082
pubmed: 35700598
Neuroimage. 2017 Nov 1;161:149-170
pubmed: 28826946
Brain Imaging Behav. 2017 Apr;11(2):541-551
pubmed: 26941174
Biostatistics. 2023 Jul 14;24(3):635-652
pubmed: 34893807
Hum Brain Mapp. 2017 Nov;38(11):5740-5755
pubmed: 28792117
Hum Brain Mapp. 2016 May;37(5):1842-55
pubmed: 27015748
Front Neurosci. 2012 Dec 05;6:171
pubmed: 23227001
Front Psychiatry. 2019 Sep 20;10:620
pubmed: 31616322
Radiology. 1982 Apr;143(1):29-36
pubmed: 7063747
Neuroimage. 2020 Mar;208:116450
pubmed: 31821869