Deep feature batch correction using ComBat for machine learning applications in computational pathology.

Artificial intelligence Batch effects Computational pathology Histopathology The Cancer Genome Atlas (TCGA)

Journal

Journal of pathology informatics
ISSN: 2229-5089
Titre abrégé: J Pathol Inform
Pays: United States
ID NLM: 101528849

Informations de publication

Date de publication:
Dec 2024
Historique:
received: 08 07 2024
revised: 02 09 2024
accepted: 04 09 2024
medline: 14 10 2024
pubmed: 14 10 2024
entrez: 14 10 2024
Statut: epublish

Résumé

Developing artificial intelligence (AI) models for digital pathology requires large datasets from multiple sources. However, without careful implementation, AI models risk learning confounding site-specific features in datasets instead of clinically relevant information, leading to overestimated performance, poor generalizability to real-world data, and potential misdiagnosis. Whole-slide images (WSIs) from The Cancer Genome Atlas (TCGA) colon (COAD), and stomach adenocarcinoma datasets were selected for inclusion in this study. Patch embeddings were obtained using three feature extraction models, followed by ComBat harmonization. Attention-based multiple instance learning models were trained to predict tissue-source site (TSS), as well as clinical and genetic attributes, using raw, Macenko normalized, and Combat-harmonized patch embeddings. TSS prediction achieved high accuracy (AUROC > 0.95) with all three feature extraction models. ComBat harmonization significantly reduced the AUROC for TSS prediction, with mean AUROCs dropping to approximately 0.5 for most models, indicating successful mitigation of batch effects (e.g., CCL-ResNet50 in TCGA-COAD: Pre-ComBat AUROC = 0.960, Post-ComBat AUROC = 0.506, ComBat harmonization of deep learning-derived histology features effectively reduces the risk of AI models learning confounding features in WSIs, ensuring more reliable performance estimates. This approach is promising for the integration of large-scale digital pathology datasets.

Sections du résumé

Background UNASSIGNED
Developing artificial intelligence (AI) models for digital pathology requires large datasets from multiple sources. However, without careful implementation, AI models risk learning confounding site-specific features in datasets instead of clinically relevant information, leading to overestimated performance, poor generalizability to real-world data, and potential misdiagnosis.
Methods UNASSIGNED
Whole-slide images (WSIs) from The Cancer Genome Atlas (TCGA) colon (COAD), and stomach adenocarcinoma datasets were selected for inclusion in this study. Patch embeddings were obtained using three feature extraction models, followed by ComBat harmonization. Attention-based multiple instance learning models were trained to predict tissue-source site (TSS), as well as clinical and genetic attributes, using raw, Macenko normalized, and Combat-harmonized patch embeddings.
Results UNASSIGNED
TSS prediction achieved high accuracy (AUROC > 0.95) with all three feature extraction models. ComBat harmonization significantly reduced the AUROC for TSS prediction, with mean AUROCs dropping to approximately 0.5 for most models, indicating successful mitigation of batch effects (e.g., CCL-ResNet50 in TCGA-COAD: Pre-ComBat AUROC = 0.960, Post-ComBat AUROC = 0.506,
Conclusion UNASSIGNED
ComBat harmonization of deep learning-derived histology features effectively reduces the risk of AI models learning confounding features in WSIs, ensuring more reliable performance estimates. This approach is promising for the integration of large-scale digital pathology datasets.

Identifiants

pubmed: 39398947
doi: 10.1016/j.jpi.2024.100396
pii: S2153-3539(24)00035-X
pmc: PMC11470259
doi:

Types de publication

Journal Article

Langues

eng

Pagination

100396

Informations de copyright

© 2024 The Authors.

Déclaration de conflit d'intérêts

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Auteurs

Pierre Murchan (P)

Department of Histopathology and Morbid Anatomy, Trinity Translational Medicine Institute, Trinity College Dublin, Dublin D08 W9RT, Ireland.
The SFI Centre for Research Training in Genomics Data Science, Dublin, Ireland.

Pilib Ó Broin (P)

The SFI Centre for Research Training in Genomics Data Science, Dublin, Ireland.
School of Mathematical & Statistical Sciences, University of Galway, Galway H91 TK33, Ireland.

Anne-Marie Baird (AM)

School of Medicine, Trinity Translational Medicine Institute, Trinity College Dublin, Dublin D02 A440, Ireland.

Orla Sheils (O)

School of Medicine, Trinity Translational Medicine Institute, Trinity College Dublin, Dublin D02 A440, Ireland.

Stephen P Finn (S)

Department of Histopathology and Morbid Anatomy, Trinity Translational Medicine Institute, Trinity College Dublin, Dublin D08 W9RT, Ireland.
Department of Histopathology, St. James's Hospital, James's Street, Dublin D08 X4RX, Ireland.

Classifications MeSH