From code sharing to sharing of implementations: Advancing reproducible AI development for medical imaging through federated testing.

Artificial intelligence Deployment environment Federated testing PET segmentation Postprocessing Preprocessing Reproducibility

Journal

Journal of medical imaging and radiation sciences
ISSN: 1876-7982
Titre abrégé: J Med Imaging Radiat Sci
Pays: United States
ID NLM: 101469694

Informations de publication

Date de publication:
28 Aug 2024
Historique:
received: 12 05 2024
revised: 22 07 2024
accepted: 01 08 2024
medline: 31 8 2024
pubmed: 31 8 2024
entrez: 29 8 2024
Statut: aheadofprint

Résumé

The reproducibility crisis in AI research remains a significant concern. While code sharing has been acknowledged as a step toward addressing this issue, our focus extends beyond this paradigm. In this work, we explore "federated testing" as an avenue for advancing reproducible AI research and development especially in medical imaging. Unlike federated learning, where a model is developed and refined on data from different centers, federated testing involves models developed by one team being deployed and evaluated by others, addressing reproducibility across various implementations. Our study follows an exploratory design aimed at systematically evaluating the sources of discrepancies in shared model execution for medical imaging and outputs on the same input data, independent of generalizability analysis. We distributed the same model code to multiple independent centers, monitoring execution in different runtime environments while considering various real-world scenarios for pre- and post-processing steps. We analyzed deployment infrastructure by comparing the impact of different computational resources (GPU vs. CPU) on model performance. To assess federated testing in AI models for medical imaging, we performed a comparative evaluation across different centers, each with distinct pre- and post-processing steps and deployment environments, specifically targeting AI-driven positron emission tomography (PET) imaging segmentation. More specifically, we studied federated testing for an AI-based model for surrogate total metabolic tumor volume (sTMTV) segmentation in PET imaging: the AI algorithm, trained on maximum intensity projection (MIP) data, segments lymphoma regions and estimates sTMTV. Our study reveals that relying solely on open-source code sharing does not guarantee reproducible results due to variations in code execution, runtime environments, and incomplete input specifications. Deploying the segmentation model on local and virtual GPUs compared to using Docker containers showed no effect on reproducibility. However, significant sources of variability were found in data preparation and pre-/post- processing techniques for PET imaging. These findings underscore the limitations of code sharing alone in achieving consistent and accurate results in federated testing. Achieving consistently precise results in federated testing requires more than just sharing models through open-source code. Comprehensive pipeline sharing, including pre- and post-processing steps, is essential. Cloud-based platforms that automate these processes can streamline AI model testing across diverse locations. Standardizing protocols and sharing complete pipelines can significantly enhance the robustness and reproducibility of AI models.

Sections du résumé

BACKGROUND BACKGROUND
The reproducibility crisis in AI research remains a significant concern. While code sharing has been acknowledged as a step toward addressing this issue, our focus extends beyond this paradigm. In this work, we explore "federated testing" as an avenue for advancing reproducible AI research and development especially in medical imaging. Unlike federated learning, where a model is developed and refined on data from different centers, federated testing involves models developed by one team being deployed and evaluated by others, addressing reproducibility across various implementations.
METHODS METHODS
Our study follows an exploratory design aimed at systematically evaluating the sources of discrepancies in shared model execution for medical imaging and outputs on the same input data, independent of generalizability analysis. We distributed the same model code to multiple independent centers, monitoring execution in different runtime environments while considering various real-world scenarios for pre- and post-processing steps. We analyzed deployment infrastructure by comparing the impact of different computational resources (GPU vs. CPU) on model performance. To assess federated testing in AI models for medical imaging, we performed a comparative evaluation across different centers, each with distinct pre- and post-processing steps and deployment environments, specifically targeting AI-driven positron emission tomography (PET) imaging segmentation. More specifically, we studied federated testing for an AI-based model for surrogate total metabolic tumor volume (sTMTV) segmentation in PET imaging: the AI algorithm, trained on maximum intensity projection (MIP) data, segments lymphoma regions and estimates sTMTV.
RESULTS RESULTS
Our study reveals that relying solely on open-source code sharing does not guarantee reproducible results due to variations in code execution, runtime environments, and incomplete input specifications. Deploying the segmentation model on local and virtual GPUs compared to using Docker containers showed no effect on reproducibility. However, significant sources of variability were found in data preparation and pre-/post- processing techniques for PET imaging. These findings underscore the limitations of code sharing alone in achieving consistent and accurate results in federated testing.
CONCLUSION CONCLUSIONS
Achieving consistently precise results in federated testing requires more than just sharing models through open-source code. Comprehensive pipeline sharing, including pre- and post-processing steps, is essential. Cloud-based platforms that automate these processes can streamline AI model testing across diverse locations. Standardizing protocols and sharing complete pipelines can significantly enhance the robustness and reproducibility of AI models.

Identifiants

pubmed: 39208523
pii: S1939-8654(24)00476-4
doi: 10.1016/j.jmir.2024.101745
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

101745

Informations de copyright

Copyright © 2024. Published by Elsevier Inc.

Auteurs

Fereshteh Yousefirizi (F)

Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, Canada. Electronic address: frizi@bccrc.ca.

Annudesh Liyanage (A)

Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, Canada; Department of Physics and Astronomy, University of British Columbia, Vancouver, Canada.

Ivan S Klyuzhin (IS)

Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, Canada.

Arman Rahmim (A)

Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, Canada; Department of Physics and Astronomy, University of British Columbia, Vancouver, Canada; Department of Radiology, University of British Columbia, Vancouver, BC, Canada; Department of Biomedical Engineering, University of British Columbia, Vancouver, Canada.

Classifications MeSH