Improving replicability in single-cell RNA-Seq cell type discovery with Dune.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
24 May 2024
Historique:
received: 14 09 2023
accepted: 17 05 2024
medline: 25 5 2024
pubmed: 25 5 2024
entrez: 24 5 2024
Statut: epublish

Résumé

Single-cell transcriptome sequencing (scRNA-Seq) has allowed new types of investigations at unprecedented levels of resolution. Among the primary goals of scRNA-Seq is the classification of cells into distinct types. Many approaches build on existing clustering literature to develop tools specific to single-cell. However, almost all of these methods rely on heuristics or user-supplied parameters to control the number of clusters. This affects both the resolution of the clusters within the original dataset as well as their replicability across datasets. While many recommendations exist, in general, there is little assurance that any given set of parameters will represent an optimal choice in the trade-off between cluster resolution and replicability. For instance, another set of parameters may result in more clusters that are also more replicable. Here, we propose Dune, a new method for optimizing the trade-off between the resolution of the clusters and their replicability. Our method takes as input a set of clustering results-or partitions-on a single dataset and iteratively merges clusters within each partitions in order to maximize their concordance between partitions. As demonstrated on multiple datasets from different platforms, Dune outperforms existing techniques, that rely on hierarchical merging for reducing the number of clusters, in terms of replicability of the resultant merged clusters as well as concordance with ground truth. Dune is available as an R package on Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/Dune.html . Cluster refinement by Dune helps improve the robustness of any clustering analysis and reduces the reliance on tuning parameters. This method provides an objective approach for borrowing information across multiple clusterings to generate replicable clusters most likely to represent common biological features across multiple datasets.

Sections du résumé

BACKGROUND BACKGROUND
Single-cell transcriptome sequencing (scRNA-Seq) has allowed new types of investigations at unprecedented levels of resolution. Among the primary goals of scRNA-Seq is the classification of cells into distinct types. Many approaches build on existing clustering literature to develop tools specific to single-cell. However, almost all of these methods rely on heuristics or user-supplied parameters to control the number of clusters. This affects both the resolution of the clusters within the original dataset as well as their replicability across datasets. While many recommendations exist, in general, there is little assurance that any given set of parameters will represent an optimal choice in the trade-off between cluster resolution and replicability. For instance, another set of parameters may result in more clusters that are also more replicable.
RESULTS RESULTS
Here, we propose Dune, a new method for optimizing the trade-off between the resolution of the clusters and their replicability. Our method takes as input a set of clustering results-or partitions-on a single dataset and iteratively merges clusters within each partitions in order to maximize their concordance between partitions. As demonstrated on multiple datasets from different platforms, Dune outperforms existing techniques, that rely on hierarchical merging for reducing the number of clusters, in terms of replicability of the resultant merged clusters as well as concordance with ground truth. Dune is available as an R package on Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/Dune.html .
CONCLUSIONS CONCLUSIONS
Cluster refinement by Dune helps improve the robustness of any clustering analysis and reduces the reliance on tuning parameters. This method provides an objective approach for borrowing information across multiple clusterings to generate replicable clusters most likely to represent common biological features across multiple datasets.

Identifiants

pubmed: 38789920
doi: 10.1186/s12859-024-05814-6
pii: 10.1186/s12859-024-05814-6
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

198

Subventions

Organisme : Fonds Wetenschappelijk Onderzoek
ID : 1246220N
Organisme : NIH HHS
ID : U19MH114821
Pays : United States
Organisme : NIH HHS
ID : U19MH114830
Pays : United States

Informations de copyright

© 2024. The Author(s).

Références

Svensson V, da Veiga Beltrame E. A curated database reveals trends in single cell transcriptomics. bioRxiv; 2019. pp. 742304. https://doi.org/10.1101/742304 .
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, Hemberg M. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14(5):483–6. https://doi.org/10.1038/nmeth.4236 .
doi: 10.1038/nmeth.4236 pubmed: 28346451 pmcid: 5410170
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888. https://doi.org/10.1016/j.cell.2019.05.031 .
doi: 10.1016/j.cell.2019.05.031 pubmed: 31178118 pmcid: 6687398
Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, Zhang F, Mundlos S, Christiansen L, Steemers FJ, Trapnell C, Shendure J. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566(7745):496–502. https://doi.org/10.1038/s41586-019-0969-x .
doi: 10.1038/s41586-019-0969-x pubmed: 30787437 pmcid: 6434952
Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research. 2018;7:377–82. https://doi.org/10.5256/f1000research.17093.r36544 .
doi: 10.5256/f1000research.17093.r36544
Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data; 2019. http://www.nature.com/articles/s41576-018-0088-9 .
Ranjan B, Schmidt F, Sun W, Park J, Honardoost MA, Tan J, Arul RN, Prabhakar S. ScConsensus: combining supervised and unsupervised clustering for cell type identification in single-cell RNA sequencing data. BMC Bioinform. 2021;22(1):186. https://doi.org/10.1186/s12859-021-04028-4 .
doi: 10.1186/s12859-021-04028-4
Risso D, Purvis L, Fletcher RB, Das D, Ngai J, Dudoit S, Purdom E. ClusterExperiment and RSEC: a bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol. 2018;14(9):e1006378. https://doi.org/10.1371/journal.pcbi.1006378 .
doi: 10.1371/journal.pcbi.1006378 pubmed: 30180157 pmcid: 6138422
Tasic B, Yao Z, Graybuck LT, Smith KA, Nguyen TN, Bertagnolli D, Goldy J, Garren E, Economo MN, Viswanathan S, Penn O, Bakken T, Menon V, Miller J, Fong O, Hirokawa KE, Lathia K, Rimorin C, Tieu M, Larsen R, Casper T, Barkan E, Kroll M, Parry S, Shapovalova NV, Hirschstein D, Pendergraft J, Sullivan HA, Kim TK, Szafer A, Dee N, Groblewski P, Wickersham I, Cetin A, Harris JA, Levi BP, Sunkin SM, Madisen L, Daigle TL, Looger L, Bernard A, Phillips J, Lein E, Hawrylycz M, Svoboda K, Jones AR, Koch C, Zeng H. Shared and distinct transcriptomic cell types across neocortical areas. Nature. 2018;563(7729):72–8. https://doi.org/10.1038/s41586-018-0654-5 .
doi: 10.1038/s41586-018-0654-5 pubmed: 30382198 pmcid: 6456269
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research. 2018;8:9. https://doi.org/10.12688/f1000research.15809.1 .
doi: 10.12688/f1000research.15809.1
Zappia L, Oshlack A. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. GigaScience. 2018;7(7):1–9. https://doi.org/10.1093/gigascience/giy083 .
doi: 10.1093/gigascience/giy083
Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017;18(1):174. https://doi.org/10.1186/s13059-017-1305-0 .
doi: 10.1186/s13059-017-1305-0 pubmed: 28899397 pmcid: 5596896
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–50. https://doi.org/10.1080/01621459.1971.10482356 .
doi: 10.1080/01621459.1971.10482356
Lawrence H, Phipps A. Comparing partitions. J Classif. 1985;2(1):193–218. https://doi.org/10.1007/BF01908075 .
doi: 10.1007/BF01908075
Yao Z, Liu H, Xie F, Fischer S, Adkins RS, Aldrige AI, Ament SA, Ann Bartlett M, Behrens M, Van den Berge K, Bertagnolli D, Tommaso Biancalani A, Booeshaghi S, Bravo HC, Casper T, Colantuoni C, Crabtree J, Creasy H, Crichton K, Crow M, Dee N, Dougherty EL, Doyle WI, Dudoit S, Fang R, Felix V, Fong O, Giglio M, Goldy J, Hawrylycz M, Roux H, de Bezieux BR, Herb RH, Hou X, Qiwen H, Josh Huang Z, Kancherla J, Kroll M, Lathia K, Li YE, Lucero JD, Luo C, Mahurkar A, McMillen D, Nadaf NM, Nery JR, Nguyen TN, Niu S-Y, Ntranos V, Orvis J, Osteen JK, Pham T, Pinto-Duarte A, Poirion O, Preissl S, Purdom E, Rimorin C, Risso D, Rivkin AC, Smith K, Street K, Sulc J, Svensson V, Tieu M, Torkelson A, Tung H, Vaishnav ED, Vanderburg CR, van Velthoven C, Wang X, White O, Gillis J, Kharchenko PV, Ngai J, Pachter L, Regev A, Tasic B, Welch JD, Ecker JR, Macosko E, Ren B, BRAIN Initiative Cell Census Network (BICCN), Hongkui Z, Eran AM. An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types. bioRxiv. 2020. https://doi.org/10.1101/2020.02.29.970558 .
Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, Shen-Orr SS, Klein AM, Melton DA, Yanai I. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 2016;3(4):346–60. https://doi.org/10.1016/j.cels.2016.08.011 .
doi: 10.1016/j.cels.2016.08.011 pubmed: 27667365 pmcid: 5228327
Segerstolpe Å, Palasantza A, Eliasson P, Andersson EM, Andréasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, Smith DM, Kasper M, Ämmälä C, Sandberg R. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metabol. 2016;24(4):593–607. https://doi.org/10.1016/j.cmet.2016.08.020 .
doi: 10.1016/j.cmet.2016.08.020
Crow M, Paul A, Ballouz S, Huang ZJ, Gillis J. Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat Commun. 2018;9(1):884. https://doi.org/10.1038/s41467-018-03282-0 .
doi: 10.1038/s41467-018-03282-0 pubmed: 29491377 pmcid: 5830442
Bagherinia A, Minaei-Bidgoli B, Hossinzadeh M, Parvin H. Elite fuzzy clustering ensemble based on clustering diversity and quality measures. Appl Intell. 2019;49(5):1724–47. https://doi.org/10.1007/s10489-018-1332-x .
doi: 10.1007/s10489-018-1332-x
Zhang AW, O’Flanagan C, Chavez EA, Lim JLP, Ceglia N, McPherson A, Wiens M, Walters P, Chan T, Hewitson B, Lai D, Mottok A, Sarkozy C, Chong L, Aoki T, Wang X, Weng AP, McAlpine JN, Aparicio S, Steidl C, Campbell KR, Shah SP. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods. 2019;16(10):1007–15. https://doi.org/10.1038/s41592-019-0529-1 .
doi: 10.1038/s41592-019-0529-1 pubmed: 31501550 pmcid: 7485597
Zhang Z, Luo D, Zhong X, Choi JH, Ma Y, Wang S, Mahrt E, Guo W, Stawiski EW, Modrusan Z, Seshagiri S, Kapur P, Hon GC, Brugarolas J, Wang T. Scina: semi-supervised analysis of single cells in silico. Genes. 2019;10(7):531. https://doi.org/10.3390/genes10070531 .
doi: 10.3390/genes10070531 pubmed: 31336988 pmcid: 6678337
Domanskyi S, Szedlak A, Hawkins NT, Wang J, Paternostro G, Piermarocchi C. Polled digital cell sorter (p-DCS): automatic identification of hematological cell types from single cell RNA-sequencing clusters. BMC Bioinform. 2019;20(1):369. https://doi.org/10.1186/s12859-019-2951-x .
doi: 10.1186/s12859-019-2951-x
Wagner F, Yanai I. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data. bioRxiv. 2018, pp. 456129. https://doi.org/10.1101/456129 .
Pliner HA, Shendure J, Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat Methods. 2019;16(10):983–6. https://doi.org/10.1038/s41592-019-0535-3 .
doi: 10.1038/s41592-019-0535-3 pubmed: 31501545 pmcid: 6791524
Lin Y, Cao Y, Kim HJ, Salim A, Speed TP, Lin D, Yang P, Jean YHY. scClassify: hierarchical classification of cells. bioRxiv. 2019, pp. 776948. https://doi.org/10.1101/776948 .
van der Laan Mark, Pollard K. Hybrid clustering of gene expression data with visualization and the bootstrap. 2001;117:01.
van der Maaten LJP, Hinton GE. Visualizing high-dimensional data using t-sne. J Mach Learn Res. 2008;9:2579–605.
van der Maaten LJP. Accelerating t-sne using tree-based algorithms. J Mach Learn Res. 2014;15:3221–45.
Krijthe JH. Rtsne: T-distributed stochastic neighbor embedding using barnes-hut implementation; 2015. https://github.com/jkrijthe/Rtsne . R package version 0.15.
Jaccard P. Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles. 1901;37:241–72. https://doi.org/10.5169/seals-266440 .
doi: 10.5169/seals-266440
Taiyun K, Rui CI, Yingxin L, Andy Y-YW, Jean YHY, Pengyi Y. Impact of similarity metrics on single-cell RNA-seq data clustering. Brief Bioinform. 2019;20(6):2316–26. https://doi.org/10.1093/bib/bby076 .
doi: 10.1093/bib/bby076
Ritchie ME, Phipson B, Wu DI, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47–e47. https://doi.org/10.1093/nar/gkv007 .
doi: 10.1093/nar/gkv007 pubmed: 25605792 pmcid: 4402510
Benjamini Y, Hochberg Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J Roy Stat Soc Ser B Methological. 1995;57(1):289–300. https://doi.org/10.2307/2346101 .
doi: 10.2307/2346101
Towns J, Cockerill T, Dahan M, Foster I, Gaither K, Grimshaw A, Hazlewood V, Lathrop S, Lifka D, Peterson GD, Roskies R, Scott JR, Wilkins-Diehr N. Xsede: accelerating scientific discovery. Comput Sci Eng. 2014;16(5):62–74. https://doi.org/10.1109/MCSE.2014.80 .
doi: 10.1109/MCSE.2014.80
Herbert F. Dune. Philadelphia: Chilton Books; 1965.
Bell ET. The iterated exponential integers. Ann Math. 1938;39(3):539. https://doi.org/10.2307/1968633 .
doi: 10.2307/1968633
Blondel Vincent D, Loup GJ, Renaud L, Etienne L. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;10:P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008 .
doi: 10.1088/1742-5468/2008/10/P10008
Becht E, McInnes L, Healy J, Dutertre C-A, Kwok IWH, Ng LG, Ginhoux F, Newell EW. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37(1):38–44. https://doi.org/10.1038/nbt.4314 .
doi: 10.1038/nbt.4314
McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arxiv 2018. http://arxiv.org/abs/1802.03426 .
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. https://doi.org/10.1038/s41598-019-41695-z .
doi: 10.1038/s41598-019-41695-z pubmed: 30914743 pmcid: 6435756

Auteurs

Hector Roux de Bézieux (H)

Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA.
Center for Computational Biology, University of California, Berkeley, CA, USA.

Kelly Street (K)

Division of Biostatistics, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.

Stephan Fischer (S)

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.

Koen Van den Berge (K)

Department of Statistics, University of California, Berkeley, CA, USA.
Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.

Rebecca Chance (R)

Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA.

Davide Risso (D)

Department of Statistical Sciences, University of Padova, Padova, Italy.

Jesse Gillis (J)

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.

John Ngai (J)

Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA.

Elizabeth Purdom (E)

Department of Statistics, University of California, Berkeley, CA, USA.
Center for Computational Biology, University of California, Berkeley, CA, USA.

Sandrine Dudoit (S)

Department of Statistics, University of California, Berkeley, CA, USA. sandrine@berkeley.edu.
Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA. sandrine@berkeley.edu.
Center for Computational Biology, University of California, Berkeley, CA, USA. sandrine@berkeley.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH