Federated horizontally partitioned principal component analysis for biomedical applications.


Journal

Bioinformatics advances
ISSN: 2635-0041
Titre abrégé: Bioinform Adv
Pays: England
ID NLM: 9918282081306676

Informations de publication

Date de publication:
2022
Historique:
received: 13 09 2021
revised: 07 04 2022
entrez: 26 1 2023
pubmed: 27 1 2023
medline: 27 1 2023
Statut: epublish

Résumé

Federated learning enables privacy-preserving machine learning in the medical domain because the sensitive patient data remain with the owner and only parameters are exchanged between the data holders. The federated scenario introduces specific challenges related to the decentralized nature of the data, such as batch effects and differences in study population between the sites. Here, we investigate the challenges of moving classical analysis methods to the federated domain, specifically principal component analysis (PCA), a versatile and widely used tool, often serving as an initial step in machine learning and visualization workflows. We provide implementations of different federated PCA algorithms and evaluate them regarding their accuracy for high-dimensional biological data using realistic sample distributions over multiple data sites, and their ability to preserve downstream analyses. Federated subspace iteration converges to the centralized solution even for unfavorable data distributions, while approximate methods introduce error. Larger sample sizes at the study sites lead to better accuracy of the approximate methods. Approximate methods may be sufficient for coarse data visualization, but are vulnerable to outliers and batch effects. Before the analysis, the PCA algorithm, as well as the number of eigenvectors should be considered carefully to avoid unnecessary communication overhead. Simulation code and notebooks for federated PCA can be found at https://gitlab.com/roettgerlab/federatedPCA; the code for the federated app is available at https://github.com/AnneHartebrodt/fc-federated-pca. Supplementary data are available at

Identifiants

pubmed: 36699354
doi: 10.1093/bioadv/vbac026
pii: vbac026
pmc: PMC9710634
doi:

Types de publication

Journal Article

Langues

eng

Pagination

vbac026

Informations de copyright

© The Author(s) 2022. Published by Oxford University Press.

Références

Sci Data. 2020 Oct 13;7(1):343
pubmed: 33051456
Circ Cardiovasc Qual Outcomes. 2019 Jul;12(7):e005122
pubmed: 31284738
Nat Genet. 2015 Feb;47(2):115-25
pubmed: 25581432
Nat Genet. 2018 Mar;50(3):322-328
pubmed: 29511284
Cell. 2019 Mar 21;177(1):26-31
pubmed: 30901543
Mol Syst Biol. 2019 Jun 19;15(6):e8746
pubmed: 31217225
Eur J Cancer. 2018 Nov;104:70-80
pubmed: 30336359
Front Big Data. 2020 May 28;3:16
pubmed: 33693390
Nat Med. 2019 Oct;25(10):1627
pubmed: 31537911
Yearb Med Inform. 2014 Aug 15;9:14-20
pubmed: 25123716
Comput Struct Biotechnol J. 2014 Nov 15;13:8-17
pubmed: 25750696
Ann Stat. 2019 Dec;47(6):3009-3031
pubmed: 31700197
Nat Biotechnol. 2018 Jul;36(6):547-551
pubmed: 29734293
Genome Biol. 2022 Jan 24;23(1):32
pubmed: 35073941
Nat Genet. 2013 Oct;45(10):1113-20
pubmed: 24071849
Nature. 2021 Jun;594(7862):265-270
pubmed: 34040261
PLoS Med. 2019 Nov 21;16(11):e1002966
pubmed: 31751330

Auteurs

Anne Hartebrodt (A)

Department of Mathematics and Computer Science, University of Southern Denmark, Odense 5230, Denmark.

Richard Röttger (R)

Department of Mathematics and Computer Science, University of Southern Denmark, Odense 5230, Denmark.

Classifications MeSH