scTab: Scaling cross-tissue single-cell annotation models.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
04 Aug 2024
Historique:
received: 26 11 2023
accepted: 25 07 2024
medline: 5 8 2024
pubmed: 5 8 2024
entrez: 4 8 2024
Statut: epublish

Résumé

Identifying cellular identities is a key use case in single-cell transcriptomics. While machine learning has been leveraged to automate cell annotation predictions for some time, there has been little progress in scaling neural networks to large data sets and in constructing models that generalize well across diverse tissues. Here, we propose scTab, an automated cell type prediction model specific to tabular data, and train it using a novel data augmentation scheme across a large corpus of single-cell RNA-seq observations (22.2 million cells). In this context, we show that cross-tissue annotation requires nonlinear models and that the performance of scTab scales both in terms of training dataset size and model size. Additionally, we show that the proposed data augmentation schema improves model generalization. In summary, we introduce a de novo cell type prediction model for single-cell RNA-seq data that can be trained across a large-scale collection of curated datasets and demonstrate the benefits of using deep learning methods in this paradigm.

Identifiants

pubmed: 39098889
doi: 10.1038/s41467-024-51059-5
pii: 10.1038/s41467-024-51059-5
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

6611

Informations de copyright

© 2024. The Author(s).

Références

Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
doi: 10.1038/s41592-021-01336-8 pubmed: 34949812
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
doi: 10.15252/msb.20188746 pubmed: 31217225 pmcid: 6582955
Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 24, 550–572 (2023).
doi: 10.1038/s41576-023-00586-w pubmed: 37002403
Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17, 137–145 (2020).
doi: 10.1038/s41592-019-0654-x pubmed: 31792435
Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol 20, 194 (2019).
doi: 10.1186/s13059-019-1795-z pubmed: 31500660 pmcid: 6734286
Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022).
doi: 10.1126/science.abl5197 pubmed: 35549406 pmcid: 7612735
Köhler, N. D., Büttner, M., Andriamanga, N. & Theis, F. J. Deep learning does not outperform classical machine learning for cell-type annotation. bioRxiv. https://doi.org/10.1101/653907 (2019).
Ergen, C. et al. Consensus prediction of cell type labels with popV. bioRxiv. https://doi.org/10.1101/2023.08.18.553912 (2023).
Regev, A. et al. & Human Cell Atlas Organizing Committee. The Human Cell Atlas White Paper. arXiv [q-bio.TO] (2018). at http://arxiv.org/abs/1810.05192
Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29, 1563–1577 (2023).
doi: 10.1038/s41591-023-02327-2 pubmed: 37291214 pmcid: 10287567
Novella-Rausell, C., Grudniewska, M., Peters, D. J. M. & Mahfouz, A. A comprehensive mouse kidney atlas enables rare cell population characterization and robust marker discovery. bioRxiv 2022.07.02.498501. https://doi.org/10.1101/2022.07.02.498501 (2022).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
doi: 10.1038/s41592-018-0229-2 pubmed: 30504886 pmcid: 6289068
Diehl, A. D. et al. The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability. J. Biomed. Semantics 7, 44 (2016).
doi: 10.1186/s13326-016-0088-7 pubmed: 27377652 pmcid: 4932724
Fischer, D. S. et al. Sfaira accelerates data and model reuse in single cell genomics. Genome Biol 22, 248 (2021).
doi: 10.1186/s13059-021-02452-6 pubmed: 34433466 pmcid: 8386039
CZI Single-Cell Biology Program, Abdulla, S. et al. CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. bioRxiv 2023.10.30.563174. https://doi.org/10.1101/2023.10.30.563174 (2023).
Clarke, Z. A. et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods. Nat. Protoc. 16, 2749–2764 (2021).
doi: 10.1038/s41596-021-00534-0 pubmed: 34031612
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
doi: 10.1038/s41587-021-01001-7 pubmed: 34462589
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
doi: 10.1016/j.cell.2021.04.048 pubmed: 34062119 pmcid: 8238499
Huang, Y. & Zhang, P. Evaluation of machine learning approaches for cell-type identification from single-cell transcriptomics data. Brief. Bioinform. 22, bbab035 (2021).
doi: 10.1093/bib/bbab035 pubmed: 33611343
De Donno, C. et al. Population-level integration of single-cell datasets enables multi-scale analysis across samples. Nat. Methods 20, 1683–1692 (2023).
doi: 10.1038/s41592-023-02035-2 pubmed: 37813989 pmcid: 10630133
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
doi: 10.1038/s41586-023-06139-9 pubmed: 37258680 pmcid: 10949956
Heimberg, G. et al. Scalable querying of human cell atlases via a foundational model reveals commonalities across fibrosis-associated macrophages. bioRxiv 2023.07.18.549537. https://doi.org/10.1101/2023.07.18.549537 (2023).
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods. https://doi.org/10.1038/s41592-024-02201-0 (2024).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
doi: 10.15252/msb.20209620 pubmed: 33491336 pmcid: 7829634
Boiarsky, R., Singh, N., Buendia, A., Getz, G. & Sontag, D. A deep dive into single-cell RNA sequencing foundation models. bioRxiv https://doi.org/10.1101/2023.10.19.563100 (2023).
Kedzierska, K. Z., Crawford, L., Amini, A. P. & Lu, A. X. Assessing the limits of zero-shot foundation models in single-cell biology. bioRxiv 2023.10.16.561085. https://doi.org/10.1101/2023.10.16.561085 (2023).
Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need. https://doi.org/10.48550/ARXIV.2106.03253 . (2021).
Kaplan, J. et al. Scaling Laws for Neural Language Models. arXiv [cs.LG]. https://doi.org/10.48550/ARXIV.2001.08361 (2020).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (Curran Associates, Inc., 2012).
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 60 (2019).
Jupp, S., Burdett, T., Leroy, C. & Parkinson, H. E. A new Ontology Lookup Service at EMBL-EBI. SWAT4LS 2, 118–119 (2015).
Osumi-Sutherland, D. et al. Cell type ontologies of the Human Cell Atlas. Nat. Cell Biol. 23, 1129–1135 (2021).
doi: 10.1038/s41556-021-00787-7 pubmed: 34750578
Arik, S. O. & Pfister, T. TabNet: Attentive Interpretable Tabular Learning. 10.48550/ARXIV.1908.07442. (2019).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. arXiv [stat.ML]. http://arxiv.org/abs/1612.01474 (2016).
Xu, J., Zhang, A., Liu, F., Chen, L. & Zhang, X. CIForm as a Transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Brief. Bioinform. 24, bbad195 (2023).
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 107–115 (2021).
doi: 10.1145/3446776
Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE). https://doi.org/10.1109/cvpr.2009.5206848 , (2009).
Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. arXiv [cs.CV] http://arxiv.org/abs/1409.0575 (2014).
Lin, T.-Y. et al. Microsoft COCO: Common Objects in Context. in Computer Vision – ECCV 2014 740–755 (Springer International Publishing, 2014).
Wang, A. et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv [cs.CL]. at http://arxiv.org/abs/1804.07461 (2018).
Wang, A. et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv [cs.CL] (2019). http://arxiv.org/abs/1905.00537
Luong, M.-T. & Manning, C. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign. 76–79 https://aclanthology.org/2015.iwslt-evaluation.11 (2015).
Hao, M., et al. Large scale foundation model on single-cell transcriptomics. bioRxiv. https://doi.org/10.1101/2023.05.29.542705 (2023).
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence 4, 852–866 (2022).
doi: 10.1038/s42256-022-00534-z
Grill, J.-B. et al. Bootstrap your own latent: A new approach to self-supervised Learning. arXiv [cs.LG]. https://doi.org/10.48550/ARXIV.2006.07733 (2020).
Peters, B., Niculae, V. & Martins, A. F. T. Sparse sequence-to-sequence models. in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics). https://doi.org/10.18653/v1/p19-1146 (2019).
Rosen, Y., Roohani, Y., Agarwal, A., Samotorčan, L., Tabula Sapiens Consortium, Quake, S. R. & Leskovec, J. Universal Cell Embeddings: A Foundation Model for Cell Biology. bioRxiv 2023.11.28.568918. https://doi.org/10.1101/2023.11.28.568918 (2023).
Fischer, F. & Biederstedt, E. theislab/scTab: First release. https://doi.org/10.5281/zenodo.12663458 (2024).

Auteurs

Felix Fischer (F)

Department of Computational Health, Institute of Computational Biology, Helmholtz, Munich, Germany.
School of Computing, Information and Technology, Technical University of Munich, Munich, Germany.

David S Fischer (DS)

Department of Computational Health, Institute of Computational Biology, Helmholtz, Munich, Germany.
Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.

Roman Mukhin (R)

eBook Applications LLC, Boston, MA, 02467, USA.

Andrey Isaev (A)

eBook Applications LLC, Boston, MA, 02467, USA.

Evan Biederstedt (E)

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.
Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA, 02129, USA.
Krantz Family Center for Cancer Research, Massachusetts General Hospital, Boston, MA, 02114, USA.

Alexandra-Chloé Villani (AC)

Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA, 02129, USA.
Krantz Family Center for Cancer Research, Massachusetts General Hospital, Boston, MA, 02114, USA.
Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.

Fabian J Theis (FJ)

Department of Computational Health, Institute of Computational Biology, Helmholtz, Munich, Germany. fabian.theis@helmholtz-munich.de.
School of Computing, Information and Technology, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-munich.de.
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-munich.de.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH