An information-theoretic approach to single cell sequencing analysis.
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
12 Aug 2023
12 Aug 2023
Historique:
received:
22
03
2022
accepted:
18
07
2023
medline:
14
8
2023
pubmed:
13
8
2023
entrez:
12
8
2023
Statut:
epublish
Résumé
Single-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information. Here, we formally quantify the information obtained from a sc-Seq experiment and show that it corresponds to an intuitive notion of gene expression heterogeneity. We demonstrate a natural relation between our notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (inter-cluster heterogeneity) and that remaining (intra-cluster heterogeneity). We test our definition of heterogeneity as the objective function of a clustering algorithm, and show that it is a useful descriptor for gene expression patterns associated with different cell types. Thus, our definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression. Our measure of heterogeneity, and its decomposition into inter- and intra-cluster, is non-parametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns. Based on this theory, we develop an efficient method for the automatic unsupervised clustering of cells from sc-Seq data, and provide an R package implementation.
Sections du résumé
BACKGROUND
BACKGROUND
Single-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information.
RESULTS
RESULTS
Here, we formally quantify the information obtained from a sc-Seq experiment and show that it corresponds to an intuitive notion of gene expression heterogeneity. We demonstrate a natural relation between our notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (inter-cluster heterogeneity) and that remaining (intra-cluster heterogeneity). We test our definition of heterogeneity as the objective function of a clustering algorithm, and show that it is a useful descriptor for gene expression patterns associated with different cell types.
CONCLUSIONS
CONCLUSIONS
Thus, our definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression. Our measure of heterogeneity, and its decomposition into inter- and intra-cluster, is non-parametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns. Based on this theory, we develop an efficient method for the automatic unsupervised clustering of cells from sc-Seq data, and provide an R package implementation.
Identifiants
pubmed: 37573291
doi: 10.1186/s12859-023-05424-8
pii: 10.1186/s12859-023-05424-8
pmc: PMC10422744
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
311Subventions
Organisme : Engineering and Physical Sciences Research Council
ID : EP/N510129/1
Informations de copyright
© 2023. BioMed Central Ltd., part of Springer Nature.
Références
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nate Genet. 2000;25(1):25–9.
doi: 10.1038/75556
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Methodol). 1995;57(1):289–300.
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008(10):P10008.
doi: 10.1088/1742-5468/2008/10/P10008
Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, et al. Accounting for technical noise in single-cell rna-seq experiments. Nat Meth. 2013;10(11):1093–5.
doi: 10.1038/nmeth.2645
Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput. 1995;16(5):1190–208.
doi: 10.1137/0916069
Chan TE, Stumpf MP, Babtie AC. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst. 2017;5(3):251–67.
pubmed: 28957658
pmcid: 5624513
doi: 10.1016/j.cels.2017.08.014
Cover TM, Thomas JA. Elements of information theory. Wiley; 2012.
Fisher R. Statistical methods for research workers. Gyan Books; 2017.
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research. 2018;7:1297.
pubmed: 30228881
pmcid: 6124389
doi: 10.12688/f1000research.15809.1
Consortium Gene Ontology. The gene ontology resource: enriching a gold mine. Nucl Acids Res. 2021;49(D1):D325–34.
doi: 10.1093/nar/gkaa1113
Greulich P, Smith R, MacArthur BD. The physics of cell fate. In: Levine H, Jolly MK, Kulkarni P, Nanjundiah V, editors. Phenotypic switching. NewYork: Academic Press; 2020. p. 189–206.
doi: 10.1016/B978-0-12-817996-3.00003-7
Grün D, Kester L. Van, Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Meth. 2014;11(6):637–40.
doi: 10.1038/nmeth.2930
Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H. Van, Oudenaarden A. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–5.
pubmed: 26287467
doi: 10.1038/nature14966
Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20(1):1–15.
doi: 10.1186/s13059-019-1874-1
Hashimshony T, Senderovich N, Avital G, Klochendler A, De Leeuw Y, Anavy L, Gennert D, Li S, Livak KJ, Rozenblatt-Rosen O, et al. Cel-seq2: sensitive highly-multiplexed single-cell RNA-seq. Genome Biol. 2016;17:1–7.
doi: 10.1186/s13059-016-0938-8
Hausser J, Strimmer K. Entropy inference and the James–Stein estimator, with application to nonlinear gene association networks. J Mach Learn Res. 2009;10(7):1469–84.
Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19(4):562–78.
pubmed: 29121214
doi: 10.1093/biostatistics/kxx053
Jaccard P. The distribution of the flora in the alpine zone. 1. NewPhytol. 1912;11(2):37–50.
Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nat Meth. 2014;11(7):740–2.
doi: 10.1038/nmeth.2967
Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20(5):273–82.
pubmed: 30617341
doi: 10.1038/s41576-018-0088-9
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. Sc3: consensus clustering of single-cell RNA-seq data. Nat Meth. 2017;14(5):483–6.
doi: 10.1038/nmeth.4236
Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86.
doi: 10.1214/aoms/1177729694
Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, Kong SL, Chua C, Hon LK, Tan WS, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet. 2017;49(5):708–18.
pubmed: 28319088
doi: 10.1038/ng.3818
Liu B, Li C, Li Z, Wang D, Ren X, Zhang Z. An entropy-based metric for assessing the purity of single cell populations. Nat Commun. 2020;11(1):1–13.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 2014;15(12):1–21.
doi: 10.1186/s13059-014-0550-8
Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6): e8746.
pubmed: 31217225
pmcid: 6582955
doi: 10.15252/msb.20188746
Lun AT, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Research. 2016;5:2122.
pubmed: 27909575
pmcid: 5112579
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.
pubmed: 26000488
pmcid: 4481139
doi: 10.1016/j.cell.2015.05.002
McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426 . 2018
Ortega A. Introduction to graph signal processing. Cambridge University Press; 2022.
doi: 10.1017/9781108552349
Peterman N, Levine E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genom. 2016;17:1–17.
doi: 10.1186/s12864-016-2533-5
R Core Team R: a language and environment for statistical computing, R foundation for statistical computing, Vienna. https://www.R-project.org/ , 2020
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–50.
doi: 10.1080/01621459.1971.10482356
Risso D, Purvis L, Fletcher RB, Das D, Ngai J, Dudoit S, Purdom E. Clusterexperiment and rsec: a bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol. 2018;14(9): e1006378.
pubmed: 30180157
pmcid: 6138422
doi: 10.1371/journal.pcbi.1006378
Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
pubmed: 19910308
doi: 10.1093/bioinformatics/btp616
Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using gaussian finite mixture models. R J. 2016;8(1):289.
pubmed: 27818791
pmcid: 5096736
doi: 10.32614/RJ-2016-021
Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423.
doi: 10.1002/j.1538-7305.1948.tb01338.x
Shorrocks AF. The class of additively decomposable inequality measures. Econom: J Econom Soc. 1980;48(3):613–25.
doi: 10.2307/1913126
Smith RC, MacArthur BD. Information-theoretic approaches to understanding stem cell variability. Curr Stem Cell Rep. 2017;3(3):225–31.
doi: 10.1007/s40778-017-0093-5
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM III, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–902.
pubmed: 31178118
pmcid: 6687398
doi: 10.1016/j.cell.2019.05.031
Stumpf PS, Du X, Imanishi H, Kunisaki Y, Semba Y, Noble T, Smith RC, Rose-Zerili M, West JJ, Oreffo RO, et al. Transfer learning efficiently maps bone marrow cell types from mouse to human using single-cell RNA sequencing. Commun Biol. 2020;3(1):1–11.
doi: 10.1038/s42003-020-01463-6
Svensson V, Natarajan KN, Ly L-H, Miragaia RJ, Labalette C, Macaulay IC, Cvejic A, Teichmann SA. Power analysis of single-cell RNA-sequencing experiments. Nat Meth. 2017;14(4):381.
doi: 10.1038/nmeth.4220
Svensson V, Vento-Tormo R, Teichmann SA. Exponential scaling of single-cell RNA-seq in the past decade. Nat Protoc. 2018;13(4):599–604.
pubmed: 29494575
doi: 10.1038/nprot.2017.149
Consortium Tabula Muris. Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature. 2018;562(7727):367.
doi: 10.1038/s41586-018-0590-4
Theil H. Economics and Information, theory studies in mathematical and managerial economics. North-Holland Publishing Company; 1967.
Tian L, Dong X, Freytag S, Lê Cao K-A, Su S, JalalAbadi A, Amann-Zalcenstein D, Weber TS, Seidi A, Jabbari JS, et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Meth. 2019;16(6):479–87.
doi: 10.1038/s41592-019-0425-8
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 2019;20(1):1–16.
doi: 10.1186/s13059-019-1861-6
Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. 2015;25(10):1491–8.
pubmed: 26430159
pmcid: 4579334
doi: 10.1101/gr.190595.115
Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–29.
pubmed: 29961576
pmcid: 6771278
doi: 10.1016/j.cell.2018.05.061
Van Rossum G, Drake FL. Python 3 reference manual. Scotts Valley: CreateSpace; 2009.
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nat Meth. 2020;17(3):261–72.
doi: 10.1038/s41592-019-0686-2
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12.
doi: 10.1038/ncomms14049
Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans Math Softw (TOMS). 1997;23(4):550–60.
doi: 10.1145/279232.279236