kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq.


Journal

Nature protocols
ISSN: 1750-2799
Titre abrégé: Nat Protoc
Pays: England
ID NLM: 101284307

Informations de publication

Date de publication:
10 Oct 2024
Historique:
received: 18 01 2024
accepted: 29 07 2024
medline: 11 10 2024
pubmed: 11 10 2024
entrez: 10 10 2024
Statut: aheadofprint

Résumé

The term 'RNA-seq' refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, single cells or single nuclei. The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data. Execution of this protocol requires basic familiarity with a command line environment. With this protocol, quantification of a moderately sized RNA-seq dataset can be completed within minutes.

Identifiants

pubmed: 39390263
doi: 10.1038/s41596-024-01057-0
pii: 10.1038/s41596-024-01057-0
doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)
ID : U19MH114830
Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)
ID : 5UM1HG012077-02
Organisme : NIGMS NIH HHS
ID : T32 GM008042
Pays : United States

Informations de copyright

© 2024. Springer Nature Limited.

Références

Melsted, P. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat. Biotechnol. 39, 813–818 (2021).
doi: 10.1038/s41587-021-00870-2 pubmed: 33795888
Tian, L. et al. scPipe: a flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data. PLoS Comput. Biol. 14, e1006361 (2018).
doi: 10.1371/journal.pcbi.1006361 pubmed: 30096152 pmcid: 6105007
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).
doi: 10.1038/nmeth.1226 pubmed: 18516045
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).
doi: 10.1186/s13059-016-0881-8 pubmed: 26813401 pmcid: 4728800
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
doi: 10.1093/bioinformatics/bts635 pubmed: 23104886
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
doi: 10.1038/nprot.2012.016 pubmed: 22383036 pmcid: 3334321
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
doi: 10.1038/nmeth.2251 pubmed: 23160280
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
doi: 10.1038/nbt.3519 pubmed: 27043002
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
doi: 10.1038/nmeth.4197 pubmed: 28263959 pmcid: 5600148
Liao, Y., Smyth, G. K. & Shi, W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 47, e47 (2019).
doi: 10.1093/nar/gkz114 pubmed: 30783653 pmcid: 6486549
Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
doi: 10.1093/bioinformatics/btt656 pubmed: 24227677
Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
doi: 10.1093/bioinformatics/btu638 pubmed: 25260700
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
doi: 10.1038/nprot.2016.095 pubmed: 27560171 pmcid: 5032908
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinforma. 12, 323 (2011).
doi: 10.1186/1471-2105-12-323
Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).
doi: 10.1186/s13059-019-1670-y pubmed: 30917859 pmcid: 6437997
He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19, 316–322 (2022).
doi: 10.1038/s41592-022-01408-3 pubmed: 35277707 pmcid: 8933848
He, D. & Patro, R. simpleaf: a simple, flexible, and scalable framework for single-cell data processing using alevin-fry. Bioinformatics https://doi.org/10.1093/bioinformatics/btad614 (2023).
Kaminow, B., Yunusov, D. & Dobin, A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at bioRxiv https://doi.org/10.1101/2021.05.05.442755 (2021).
Niebler, S., Müller, A., Hankeln, T. & Schmidt, B. RainDrop: rapid activation matrix computation for droplet-based single-cell RNA-seq reads. BMC Bioinforma. 21, 274 (2020).
doi: 10.1186/s12859-020-03593-4
Liao, Y., Raghu, D., Pal, B., Mielke, L. A. & Shi, W. cellCounts: an R function for quantifying 10x Chromium single-cell RNA sequencing data. Bioinformatics https://doi.org/10.1093/bioinformatics/btad439 (2023).
Battenberg, K. et al. A flexible cross-platform single-cell data processing pipeline. Nat. Commun. 13, 6847 (2022).
doi: 10.1038/s41467-022-34681-z pubmed: 36369450 pmcid: 9652453
Melsted, P., Ntranos, V. & Pachter, L. The barcode, UMI, set format and BUStools. Bioinformatics 35, 4472–4473 (2019).
doi: 10.1093/bioinformatics/btz279 pubmed: 31073610
Hjörleifsson, K. E. et al. Accurate quantification of single-cell and single-nucleus RNA-seq transcripts using distinguishing flanking k-mers. Preprint at bioRxiv https://doi.org/10.1101/2022.12.02.518832 (2024).
Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).
doi: 10.1038/nmeth.1778 pubmed: 22101854
Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).
doi: 10.1101/gr.209601.116 pubmed: 28100584 pmcid: 5340976
Reese, M. G. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).
doi: 10.1101/gr.10.4.483 pubmed: 10779488 pmcid: 310877
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
doi: 10.1101/gr.229102 pubmed: 12045153 pmcid: 186604
Booeshaghi, A. S., Min, K. H. J., Gehring, J. & Pachter, L. Quantifying orthogonal barcodes for sequence census assays. Bioinf. Adv 4, 1 (2024).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
doi: 10.1038/nmeth.4380 pubmed: 28759029 pmcid: 5669064
Booeshaghi, A. S., Gao, F. & Pachter, L. Assessing the multimodal tradeoff. Preprint at bioRxiv https://doi.org/10.1101/2021.12.08.471788 (2023).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
doi: 10.1093/bioinformatics/bty191 pubmed: 29750242 pmcid: 6137996
Luebbert, L. et al. Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression. Preprint at bioRxiv https://doi.org/10.1101/2023.12.11.571168 (2024).
Holley, G. & Melsted, P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).
doi: 10.1186/s13059-020-02135-8 pubmed: 32943081 pmcid: 7499882
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
doi: 10.1038/ng.1028 pubmed: 22231483 pmcid: 3272472
Martin, F. J. et al. Ensembl 2023. Nucleic Acids Res. 51, D933–D941 (2023).
Grindberg, R. V. et al. RNA-sequencing from single nuclei. Proc. Natl Acad. Sci. USA 110, 19802–19807 (2013).
doi: 10.1073/pnas.1319700110 pubmed: 24248345 pmcid: 3856806
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
doi: 10.1038/s41586-018-0414-6 pubmed: 30089906 pmcid: 6130801
Gorin, G., Fang, M., Chari, T. & Pachter, L. RNA velocity unraveled. PLoS Comput. Biol. 18, e1010492 (2022).
doi: 10.1371/journal.pcbi.1010492 pubmed: 36094956 pmcid: 9499228
Gorin, G., Vastola, J. J., Fang, M. & Pachter, L. Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nat. Commun. 13, 7620 (2022).
doi: 10.1038/s41467-022-34857-7 pubmed: 36494337 pmcid: 9734650
Carilli, M., Gorin, G., Choi, Y., Chari, T. & Pachter, L. Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Nat. Methods 21, 1466–1469 (2024).
doi: 10.1038/s41592-024-02365-9 pubmed: 39054391
Gorin, G. & Pachter, L. Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod. Preprint at bioRxiv https://doi.org/10.1101/2022.06.11.495771 (2023).
Gorin, G., Vastola, J. J. & Pachter, L. Studying stochastic systems biology of the cell with single-cell genomics data. Cell Syst. https://doi.org/10.1016/j.cels.2023.08.004 (2023).
Pool, A.-H., Poldsam, H., Chen, S., Thomson, M. & Oka, Y. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references. Nat. Methods https://doi.org/10.1038/s41592-023-02003-w (2023).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
doi: 10.1038/ncomms14049 pubmed: 28091601 pmcid: 5241818
Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).
doi: 10.1038/nprot.2014.006 pubmed: 24385147
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
doi: 10.1038/nmeth.2639 pubmed: 24056875
Hagemann-Jensen, M. et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38, 708–714 (2020).
doi: 10.1038/s41587-020-0497-0 pubmed: 32518404
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
doi: 10.1126/science.aam8999 pubmed: 29545511 pmcid: 7643870
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
doi: 10.1186/s13059-014-0550-8 pubmed: 25516281 pmcid: 4302049
Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 4, 1521 (2015).
doi: 10.12688/f1000research.7563.1 pubmed: 26925227
Pimentel, H., Bray, N. L., Puente, S., Melsted, P. & Pachter, L. Differential analysis of RNA-seq incorporating quantification uncertainty. Nat. Methods 14, 687–690 (2017).
doi: 10.1038/nmeth.4324 pubmed: 28581496
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
doi: 10.1093/nar/gkv007 pubmed: 25605792 pmcid: 4402510
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
doi: 10.1186/gb-2014-15-2-r29 pubmed: 24485249 pmcid: 4053721
Einarsson, P. H. & Melsted, P. BUSZ: compressed BUS files. Bioinformatics 39, btad295 (2023).
doi: 10.1093/bioinformatics/btad295 pubmed: 37129540 pmcid: 10185401
Gustafsson, J., Robinson, J., Nielsen, J. & Pachter, L. BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq. Genome Biol. 22, 174 (2021).
doi: 10.1186/s13059-021-02386-z pubmed: 34103073 pmcid: 8188791
Hashimshony, T. et al. CEL-Seq2: sensitive highly-multiplexed single-cell RNA-seq. Genome Biol. 17, 77 (2016).
doi: 10.1186/s13059-016-0938-8 pubmed: 27121950 pmcid: 4848782
Ntranos, V., Kamath, G. M., Zhang, J. M., Pachter, L. & Tse, D. N. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol. 17, 112 (2016).
doi: 10.1186/s13059-016-0970-8 pubmed: 27230763 pmcid: 4881296
Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).
doi: 10.1038/s41592-018-0303-9 pubmed: 30664774
Pachter, L. Models for transcript quantification from RNA-Seq. Preprint at https://doi.org/10.48550/arXiv.1104.3889 (2011).
Booeshaghi, A. S., Chen, X. & Pachter, L. A machine-readable specification for genomics assays. Bioinformatics https://doi.org/10.1093/bioinformatics/btae168 (2024).
Booeshaghi, A. S., Sullivan, D. K. & Pachter, L. Universal preprocessing of single-cell genomics data. Preprint at bioRxiv https://doi.org/10.1101/2023.09.14.543267 (2023).
Luebbert, L. & Pachter, L. Efficient querying of genomic reference databases with gget. Bioinformatics 39, btac836 (2023).
doi: 10.1093/bioinformatics/btac836 pubmed: 36610989 pmcid: 9835474
Gálvez-Merchán, Á., Min, K. H. J., Pachter, L. & Booeshaghi, A. S. Metadata retrieval from sequence databases with ffq. Bioinformatics 39, btac836 (2023).
doi: 10.1093/bioinformatics/btac667
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
doi: 10.1186/s13059-017-1382-0 pubmed: 29409532 pmcid: 5802054
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
doi: 10.1016/j.cell.2021.04.048 pubmed: 34062119 pmcid: 8238499
Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17, 137–145 (2020).
doi: 10.1038/s41592-019-0654-x pubmed: 31792435
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
pubmed: 27909575 pmcid: 5112579
McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
doi: 10.1093/bioinformatics/btw777 pubmed: 28088763 pmcid: 5408845
Pezoa, F., Reutter, J. L., Suarez, F., Ugarte, M. & Vrgoč, D. Foundations of JSON schema. In Proc. 25th International Conference on World Wide Web 263–273 (International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2016).
Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).
doi: 10.1093/nar/gkac1071 pubmed: 36420896
Huntley, M. A. et al. Complex regulation of ADAR-mediated RNA-editing across tissues. BMC Genomics 17, 61 (2016).
doi: 10.1186/s12864-015-2291-9 pubmed: 26768488 pmcid: 4714477
Sullivan, D. K. & Pachter, L. Flexible parsing and preprocessing of technical sequences with splitcode. Bioinformatics https://doi.org/10.1093/bioinformatics/btae331 (2024).

Auteurs

Delaney K Sullivan (DK)

Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA.

Kristján Eldjárn Hjörleifsson (KE)

Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.

Laura Luebbert (L)

Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.

Guillaume Holley (G)

deCODE Genetics/Amgen Inc., Reykjavik, Iceland.

Lambda Moses (L)

Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.

Johan Gustafsson (J)

Broad Institute of Harvard and MIT, Cambridge, MA, USA.

Nicolas L Bray (NL)

Broad Institute of Harvard and MIT, Cambridge, MA, USA.

Harold Pimentel (H)

Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA.
Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA.

A Sina Booeshaghi (AS)

Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA. sinab@berkeley.edu.

Páll Melsted (P)

deCODE Genetics/Amgen Inc., Reykjavik, Iceland. pmelsted@hi.is.
School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland. pmelsted@hi.is.

Lior Pachter (L)

Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA. lpachter@caltech.edu.
Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA. lpachter@caltech.edu.

Classifications MeSH