Fast and accurate out-of-core PCA framework for large scale biobank data.
Journal
Genome research
ISSN: 1549-5469
Titre abrégé: Genome Res
Pays: United States
ID NLM: 9518021
Informations de publication
Date de publication:
09 2023
09 2023
Historique:
received:
21
11
2022
accepted:
18
08
2023
pmc-release:
01
03
2024
medline:
23
10
2023
pubmed:
25
8
2023
entrez:
24
8
2023
Statut:
ppublish
Résumé
Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.
Identifiants
pubmed: 37620119
pii: gr.277525.122
doi: 10.1101/gr.277525.122
pmc: PMC10620046
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
1599-1608Informations de copyright
© 2023 Li et al.; Published by Cold Spring Harbor Laboratory Press.
Références
IEEE Trans Image Process. 2000;9(8):1371-4
pubmed: 18262974
Bioinformatics. 2019 Oct 1;35(19):3679-3683
pubmed: 30957838
PLoS Genet. 2019 Nov 1;15(11):e1008432
pubmed: 31675358
Bioinformatics. 2021 Jul 27;37(13):1868-1875
pubmed: 33459779
PLoS Genet. 2020 May 29;16(5):e1008773
pubmed: 32469896
Cell. 2018 Jul 26;174(3):716-729.e27
pubmed: 29961576
Genetics. 2018 Oct;210(2):719-731
pubmed: 30131346
Cell. 2016 Aug 25;166(5):1308-1323.e30
pubmed: 27565351
Mol Biol Evol. 2013 Jan;30(1):24-35
pubmed: 22923467
Genome Res. 2007 Feb;17(2):219-30
pubmed: 17185644
Nat Genet. 2006 Aug;38(8):904-9
pubmed: 16862161
Nat Rev Genet. 2019 May;20(5):273-282
pubmed: 30617341
Genome Biol. 2020 Jan 20;21(1):9
pubmed: 31955711
Gigascience. 2015 Feb 25;4:7
pubmed: 25722852
Bioinformatics. 2017 Sep 01;33(17):2776-2778
pubmed: 28475694
Bioinformatics. 2018 Aug 15;34(16):2781-2787
pubmed: 29617937
PLoS Genet. 2006 Dec;2(12):e190
pubmed: 17194218
Am J Hum Genet. 2016 Mar 3;98(3):456-472
pubmed: 26924531
Cell. 2022 May 26;185(11):1986-2005.e26
pubmed: 35525246
Bioinformatics. 2020 Aug 15;36(16):4449-4457
pubmed: 32415959
Nat Commun. 2017 Jan 16;8:14049
pubmed: 28091601