kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq.

Journal

Nature protocols

ISSN: 1750-2799

Titre abrégé: Nat Protoc

Pays: England

ID NLM: 101284307

Informations de publication

Date de publication:
10 Oct 2024

Historique:

received: 18 01 2024

accepted: 29 07 2024

medline: 11 10 2024

pubmed: 11 10 2024

entrez: 10 10 2024

Statut: aheadofprint

Résumé

The term 'RNA-seq' refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, single cells or single nuclei. The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data. Execution of this protocol requires basic familiarity with a command line environment. With this protocol, quantification of a moderately sized RNA-seq dataset can be completed within minutes.

Identifiants

DOI: 10.1038/s41596-024-01057-0 PMID: 39390263

pubmed: 39390263

doi: 10.1038/s41596-024-01057-0

pii: 10.1038/s41596-024-01057-0

doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

Subventions

Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)

ID : U19MH114830

Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)

ID : 5UM1HG012077-02

Organisme : NIGMS NIH HHS

ID : T32 GM008042

Pays : United States

Informations de copyright

Références

Melsted, P. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat. Biotechnol. 39, 813–818 (2021).

doi: 10.1038/s41587-021-00870-2 pubmed: 33795888

Tian, L. et al. scPipe: a flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data. PLoS Comput. Biol. 14, e1006361 (2018).

doi: 10.1371/journal.pcbi.1006361 pubmed: 30096152 pmcid: 6105007

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).

doi: 10.1038/nmeth.1226 pubmed: 18516045

Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).

doi: 10.1186/s13059-016-0881-8 pubmed: 26813401 pmcid: 4728800

Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

doi: 10.1093/bioinformatics/bts635 pubmed: 23104886

Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

doi: 10.1038/nprot.2012.016 pubmed: 22383036 pmcid: 3334321

Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).

doi: 10.1038/nmeth.2251 pubmed: 23160280

Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

doi: 10.1038/nbt.3519 pubmed: 27043002

Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

doi: 10.1038/nmeth.4197 pubmed: 28263959 pmcid: 5600148

Liao, Y., Smyth, G. K. & Shi, W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 47, e47 (2019).

doi: 10.1093/nar/gkz114 pubmed: 30783653 pmcid: 6486549

Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

doi: 10.1093/bioinformatics/btt656 pubmed: 24227677

Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

doi: 10.1093/bioinformatics/btu638 pubmed: 25260700

Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).

doi: 10.1038/nprot.2016.095 pubmed: 27560171 pmcid: 5032908

Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinforma. 12, 323 (2011).

doi: 10.1186/1471-2105-12-323

Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).

doi: 10.1186/s13059-019-1670-y pubmed: 30917859 pmcid: 6437997

He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19, 316–322 (2022).

doi: 10.1038/s41592-022-01408-3 pubmed: 35277707 pmcid: 8933848

He, D. & Patro, R. simpleaf: a simple, flexible, and scalable framework for single-cell data processing using alevin-fry. Bioinformatics https://doi.org/10.1093/bioinformatics/btad614 (2023).

Kaminow, B., Yunusov, D. & Dobin, A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at bioRxiv https://doi.org/10.1101/2021.05.05.442755 (2021).

Niebler, S., Müller, A., Hankeln, T. & Schmidt, B. RainDrop: rapid activation matrix computation for droplet-based single-cell RNA-seq reads. BMC Bioinforma. 21, 274 (2020).

doi: 10.1186/s12859-020-03593-4

Liao, Y., Raghu, D., Pal, B., Mielke, L. A. & Shi, W. cellCounts: an R function for quantifying 10x Chromium single-cell RNA sequencing data. Bioinformatics https://doi.org/10.1093/bioinformatics/btad439 (2023).

Battenberg, K. et al. A flexible cross-platform single-cell data processing pipeline. Nat. Commun. 13, 6847 (2022).

doi: 10.1038/s41467-022-34681-z pubmed: 36369450 pmcid: 9652453

Melsted, P., Ntranos, V. & Pachter, L. The barcode, UMI, set format and BUStools. Bioinformatics 35, 4472–4473 (2019).

doi: 10.1093/bioinformatics/btz279 pubmed: 31073610

Hjörleifsson, K. E. et al. Accurate quantification of single-cell and single-nucleus RNA-seq transcripts using distinguishing flanking k-mers. Preprint at bioRxiv https://doi.org/10.1101/2022.12.02.518832 (2024).

Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).

doi: 10.1038/nmeth.1778 pubmed: 22101854

Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).

doi: 10.1101/gr.209601.116 pubmed: 28100584 pmcid: 5340976

Reese, M. G. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).

doi: 10.1101/gr.10.4.483 pubmed: 10779488 pmcid: 310877

Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

doi: 10.1101/gr.229102 pubmed: 12045153 pmcid: 186604

Booeshaghi, A. S., Min, K. H. J., Gehring, J. & Pachter, L. Quantifying orthogonal barcodes for sequence census assays. Bioinf. Adv 4, 1 (2024).

Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

doi: 10.1038/nmeth.4380 pubmed: 28759029 pmcid: 5669064

Booeshaghi, A. S., Gao, F. & Pachter, L. Assessing the multimodal tradeoff. Preprint at bioRxiv https://doi.org/10.1101/2021.12.08.471788 (2023).

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

doi: 10.1093/bioinformatics/bty191 pubmed: 29750242 pmcid: 6137996

Luebbert, L. et al. Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression. Preprint at bioRxiv https://doi.org/10.1101/2023.12.11.571168 (2024).

Holley, G. & Melsted, P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).

doi: 10.1186/s13059-020-02135-8 pubmed: 32943081 pmcid: 7499882

Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

doi: 10.1038/ng.1028 pubmed: 22231483 pmcid: 3272472

Martin, F. J. et al. Ensembl 2023. Nucleic Acids Res. 51, D933–D941 (2023).

Grindberg, R. V. et al. RNA-sequencing from single nuclei. Proc. Natl Acad. Sci. USA 110, 19802–19807 (2013).

doi: 10.1073/pnas.1319700110 pubmed: 24248345 pmcid: 3856806

La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).

doi: 10.1038/s41586-018-0414-6 pubmed: 30089906 pmcid: 6130801

Gorin, G., Fang, M., Chari, T. & Pachter, L. RNA velocity unraveled. PLoS Comput. Biol. 18, e1010492 (2022).

doi: 10.1371/journal.pcbi.1010492 pubmed: 36094956 pmcid: 9499228

Gorin, G., Vastola, J. J., Fang, M. & Pachter, L. Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nat. Commun. 13, 7620 (2022).

doi: 10.1038/s41467-022-34857-7 pubmed: 36494337 pmcid: 9734650

Carilli, M., Gorin, G., Choi, Y., Chari, T. & Pachter, L. Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Nat. Methods 21, 1466–1469 (2024).

doi: 10.1038/s41592-024-02365-9 pubmed: 39054391

Gorin, G. & Pachter, L. Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod. Preprint at bioRxiv https://doi.org/10.1101/2022.06.11.495771 (2023).

Gorin, G., Vastola, J. J. & Pachter, L. Studying stochastic systems biology of the cell with single-cell genomics data. Cell Syst. https://doi.org/10.1016/j.cels.2023.08.004 (2023).

Pool, A.-H., Poldsam, H., Chen, S., Thomson, M. & Oka, Y. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references. Nat. Methods https://doi.org/10.1038/s41592-023-02003-w (2023).

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

doi: 10.1038/ncomms14049 pubmed: 28091601 pmcid: 5241818

Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).

doi: 10.1038/nprot.2014.006 pubmed: 24385147

Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

doi: 10.1038/nmeth.2639 pubmed: 24056875

Hagemann-Jensen, M. et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38, 708–714 (2020).

doi: 10.1038/s41587-020-0497-0 pubmed: 32518404

Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).

doi: 10.1126/science.aam8999 pubmed: 29545511 pmcid: 7643870

Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

doi: 10.1186/s13059-014-0550-8 pubmed: 25516281 pmcid: 4302049

Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 4, 1521 (2015).

doi: 10.12688/f1000research.7563.1 pubmed: 26925227

Pimentel, H., Bray, N. L., Puente, S., Melsted, P. & Pachter, L. Differential analysis of RNA-seq incorporating quantification uncertainty. Nat. Methods 14, 687–690 (2017).

doi: 10.1038/nmeth.4324 pubmed: 28581496

Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

doi: 10.1093/nar/gkv007 pubmed: 25605792 pmcid: 4402510

Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

doi: 10.1186/gb-2014-15-2-r29 pubmed: 24485249 pmcid: 4053721

Einarsson, P. H. & Melsted, P. BUSZ: compressed BUS files. Bioinformatics 39, btad295 (2023).

doi: 10.1093/bioinformatics/btad295 pubmed: 37129540 pmcid: 10185401

Gustafsson, J., Robinson, J., Nielsen, J. & Pachter, L. BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq. Genome Biol. 22, 174 (2021).

doi: 10.1186/s13059-021-02386-z pubmed: 34103073 pmcid: 8188791

Hashimshony, T. et al. CEL-Seq2: sensitive highly-multiplexed single-cell RNA-seq. Genome Biol. 17, 77 (2016).

doi: 10.1186/s13059-016-0938-8 pubmed: 27121950 pmcid: 4848782

Ntranos, V., Kamath, G. M., Zhang, J. M., Pachter, L. & Tse, D. N. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol. 17, 112 (2016).

doi: 10.1186/s13059-016-0970-8 pubmed: 27230763 pmcid: 4881296

Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).

doi: 10.1038/s41592-018-0303-9 pubmed: 30664774

Pachter, L. Models for transcript quantification from RNA-Seq. Preprint at https://doi.org/10.48550/arXiv.1104.3889 (2011).

Booeshaghi, A. S., Chen, X. & Pachter, L. A machine-readable specification for genomics assays. Bioinformatics https://doi.org/10.1093/bioinformatics/btae168 (2024).

Booeshaghi, A. S., Sullivan, D. K. & Pachter, L. Universal preprocessing of single-cell genomics data. Preprint at bioRxiv https://doi.org/10.1101/2023.09.14.543267 (2023).

Luebbert, L. & Pachter, L. Efficient querying of genomic reference databases with gget. Bioinformatics 39, btac836 (2023).

doi: 10.1093/bioinformatics/btac836 pubmed: 36610989 pmcid: 9835474

Gálvez-Merchán, Á., Min, K. H. J., Pachter, L. & Booeshaghi, A. S. Metadata retrieval from sequence databases with ffq. Bioinformatics 39, btac836 (2023).

doi: 10.1093/bioinformatics/btac667

Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

doi: 10.1186/s13059-017-1382-0 pubmed: 29409532 pmcid: 5802054

Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).

Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).

doi: 10.1016/j.cell.2021.04.048 pubmed: 34062119 pmcid: 8238499

Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17, 137–145 (2020).

doi: 10.1038/s41592-019-0654-x pubmed: 31792435

Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).

pubmed: 27909575 pmcid: 5112579

McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).

doi: 10.1093/bioinformatics/btw777 pubmed: 28088763 pmcid: 5408845

Pezoa, F., Reutter, J. L., Suarez, F., Ugarte, M. & Vrgoč, D. Foundations of JSON schema. In Proc. 25th International Conference on World Wide Web 263–273 (International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2016).

Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).

doi: 10.1093/nar/gkac1071 pubmed: 36420896

Huntley, M. A. et al. Complex regulation of ADAR-mediated RNA-editing across tissues. BMC Genomics 17, 61 (2016).

doi: 10.1186/s12864-015-2291-9 pubmed: 26768488 pmcid: 4714477

Sullivan, D. K. & Pachter, L. Flexible parsing and preprocessing of technical sequences with splitcode. Bioinformatics https://doi.org/10.1093/bioinformatics/btae331 (2024).

kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Subventions

Informations de copyright

Références

Auteurs

Delaney K Sullivan (DK)

Kyung Hoi Joseph Min (KHJ)

Kristján Eldjárn Hjörleifsson (KE)

Laura Luebbert (L)

Guillaume Holley (G)

Lambda Moses (L)

Johan Gustafsson (J)

Nicolas L Bray (NL)

Harold Pimentel (H)

A Sina Booeshaghi (AS)

Páll Melsted (P)

Lior Pachter (L)

Classifications MeSH