Distributed learning on 20 000+ lung cancer patients - The Personal Health Train.
Big data
Distributed learning
FAIR data
Federated learning
Lung cancer
Machine learning
Prediction modeling
Survival analysis
Journal
Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology
ISSN: 1879-0887
Titre abrégé: Radiother Oncol
Pays: Ireland
ID NLM: 8407192
Informations de publication
Date de publication:
03 2020
03 2020
Historique:
received:
22
06
2019
revised:
18
11
2019
accepted:
19
11
2019
pubmed:
9
1
2020
medline:
15
4
2021
entrez:
9
1
2020
Statut:
ppublish
Résumé
Access to healthcare data is indispensable for scientific progress and innovation. Sharing healthcare data is time-consuming and notoriously difficult due to privacy and regulatory concerns. The Personal Health Train (PHT) provides a privacy-by-design infrastructure connecting FAIR (Findable, Accessible, Interoperable, Reusable) data sources and allows distributed data analysis and machine learning. Patient data never leaves a healthcare institute. Lung cancer patient-specific databases (tumor staging and post-treatment survival information) of oncology departments were translated according to a FAIR data model and stored locally in a graph database. Software was installed locally to enable deployment of distributed machine learning algorithms via a central server. Algorithms (MATLAB, code and documentation publicly available) are patient privacy-preserving as only summary statistics and regression coefficients are exchanged with the central server. A logistic regression model to predict post-treatment two-year survival was trained and evaluated by receiver operating characteristic curves (ROC), root mean square prediction error (RMSE) and calibration plots. In 4 months, we connected databases with 23 203 patient cases across 8 healthcare institutes in 5 countries (Amsterdam, Cardiff, Maastricht, Manchester, Nijmegen, Rome, Rotterdam, Shanghai) using the PHT. Summary statistics were computed across databases. A distributed logistic regression model predicting post-treatment two-year survival was trained on 14 810 patients treated between 1978 and 2011 and validated on 8 393 patients treated between 2012 and 2015. The PHT infrastructure demonstrably overcomes patient privacy barriers to healthcare data sharing and enables fast data analyses across multiple institutes from different countries with different regulatory regimens. This infrastructure promotes global evidence-based medicine while prioritizing patient privacy.
Sections du résumé
BACKGROUND AND PURPOSE
Access to healthcare data is indispensable for scientific progress and innovation. Sharing healthcare data is time-consuming and notoriously difficult due to privacy and regulatory concerns. The Personal Health Train (PHT) provides a privacy-by-design infrastructure connecting FAIR (Findable, Accessible, Interoperable, Reusable) data sources and allows distributed data analysis and machine learning. Patient data never leaves a healthcare institute.
MATERIALS AND METHODS
Lung cancer patient-specific databases (tumor staging and post-treatment survival information) of oncology departments were translated according to a FAIR data model and stored locally in a graph database. Software was installed locally to enable deployment of distributed machine learning algorithms via a central server. Algorithms (MATLAB, code and documentation publicly available) are patient privacy-preserving as only summary statistics and regression coefficients are exchanged with the central server. A logistic regression model to predict post-treatment two-year survival was trained and evaluated by receiver operating characteristic curves (ROC), root mean square prediction error (RMSE) and calibration plots.
RESULTS
In 4 months, we connected databases with 23 203 patient cases across 8 healthcare institutes in 5 countries (Amsterdam, Cardiff, Maastricht, Manchester, Nijmegen, Rome, Rotterdam, Shanghai) using the PHT. Summary statistics were computed across databases. A distributed logistic regression model predicting post-treatment two-year survival was trained on 14 810 patients treated between 1978 and 2011 and validated on 8 393 patients treated between 2012 and 2015.
CONCLUSION
The PHT infrastructure demonstrably overcomes patient privacy barriers to healthcare data sharing and enables fast data analyses across multiple institutes from different countries with different regulatory regimens. This infrastructure promotes global evidence-based medicine while prioritizing patient privacy.
Identifiants
pubmed: 31911366
pii: S0167-8140(19)33489-9
doi: 10.1016/j.radonc.2019.11.019
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
189-200Informations de copyright
Copyright © 2019 The Authors. Published by Elsevier B.V. All rights reserved.