kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq.
Journal
Nature protocols
ISSN: 1750-2799
Titre abrégé: Nat Protoc
Pays: England
ID NLM: 101284307
Informations de publication
Date de publication:
10 Oct 2024
10 Oct 2024
Historique:
received:
18
01
2024
accepted:
29
07
2024
medline:
11
10
2024
pubmed:
11
10
2024
entrez:
10
10
2024
Statut:
aheadofprint
Résumé
The term 'RNA-seq' refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, single cells or single nuclei. The kallisto, bustools and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data. Execution of this protocol requires basic familiarity with a command line environment. With this protocol, quantification of a moderately sized RNA-seq dataset can be completed within minutes.
Identifiants
pubmed: 39390263
doi: 10.1038/s41596-024-01057-0
pii: 10.1038/s41596-024-01057-0
doi:
Types de publication
Journal Article
Review
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)
ID : U19MH114830
Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)
ID : 5UM1HG012077-02
Organisme : NIGMS NIH HHS
ID : T32 GM008042
Pays : United States
Informations de copyright
© 2024. Springer Nature Limited.
Références
Melsted, P. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat. Biotechnol. 39, 813–818 (2021).
doi: 10.1038/s41587-021-00870-2
pubmed: 33795888
Tian, L. et al. scPipe: a flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data. PLoS Comput. Biol. 14, e1006361 (2018).
doi: 10.1371/journal.pcbi.1006361
pubmed: 30096152
pmcid: 6105007
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).
doi: 10.1038/nmeth.1226
pubmed: 18516045
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).
doi: 10.1186/s13059-016-0881-8
pubmed: 26813401
pmcid: 4728800
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
doi: 10.1093/bioinformatics/bts635
pubmed: 23104886
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
doi: 10.1038/nprot.2012.016
pubmed: 22383036
pmcid: 3334321
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
doi: 10.1038/nmeth.2251
pubmed: 23160280
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
doi: 10.1038/nbt.3519
pubmed: 27043002
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
doi: 10.1038/nmeth.4197
pubmed: 28263959
pmcid: 5600148
Liao, Y., Smyth, G. K. & Shi, W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 47, e47 (2019).
doi: 10.1093/nar/gkz114
pubmed: 30783653
pmcid: 6486549
Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
doi: 10.1093/bioinformatics/btt656
pubmed: 24227677
Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
doi: 10.1093/bioinformatics/btu638
pubmed: 25260700
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
doi: 10.1038/nprot.2016.095
pubmed: 27560171
pmcid: 5032908
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinforma. 12, 323 (2011).
doi: 10.1186/1471-2105-12-323
Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).
doi: 10.1186/s13059-019-1670-y
pubmed: 30917859
pmcid: 6437997
He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19, 316–322 (2022).
doi: 10.1038/s41592-022-01408-3
pubmed: 35277707
pmcid: 8933848
He, D. & Patro, R. simpleaf: a simple, flexible, and scalable framework for single-cell data processing using alevin-fry. Bioinformatics https://doi.org/10.1093/bioinformatics/btad614 (2023).
Kaminow, B., Yunusov, D. & Dobin, A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at bioRxiv https://doi.org/10.1101/2021.05.05.442755 (2021).
Niebler, S., Müller, A., Hankeln, T. & Schmidt, B. RainDrop: rapid activation matrix computation for droplet-based single-cell RNA-seq reads. BMC Bioinforma. 21, 274 (2020).
doi: 10.1186/s12859-020-03593-4
Liao, Y., Raghu, D., Pal, B., Mielke, L. A. & Shi, W. cellCounts: an R function for quantifying 10x Chromium single-cell RNA sequencing data. Bioinformatics https://doi.org/10.1093/bioinformatics/btad439 (2023).
Battenberg, K. et al. A flexible cross-platform single-cell data processing pipeline. Nat. Commun. 13, 6847 (2022).
doi: 10.1038/s41467-022-34681-z
pubmed: 36369450
pmcid: 9652453
Melsted, P., Ntranos, V. & Pachter, L. The barcode, UMI, set format and BUStools. Bioinformatics 35, 4472–4473 (2019).
doi: 10.1093/bioinformatics/btz279
pubmed: 31073610
Hjörleifsson, K. E. et al. Accurate quantification of single-cell and single-nucleus RNA-seq transcripts using distinguishing flanking k-mers. Preprint at bioRxiv https://doi.org/10.1101/2022.12.02.518832 (2024).
Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).
doi: 10.1038/nmeth.1778
pubmed: 22101854
Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).
doi: 10.1101/gr.209601.116
pubmed: 28100584
pmcid: 5340976
Reese, M. G. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).
doi: 10.1101/gr.10.4.483
pubmed: 10779488
pmcid: 310877
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
doi: 10.1101/gr.229102
pubmed: 12045153
pmcid: 186604
Booeshaghi, A. S., Min, K. H. J., Gehring, J. & Pachter, L. Quantifying orthogonal barcodes for sequence census assays. Bioinf. Adv 4, 1 (2024).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
doi: 10.1038/nmeth.4380
pubmed: 28759029
pmcid: 5669064
Booeshaghi, A. S., Gao, F. & Pachter, L. Assessing the multimodal tradeoff. Preprint at bioRxiv https://doi.org/10.1101/2021.12.08.471788 (2023).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
doi: 10.1093/bioinformatics/bty191
pubmed: 29750242
pmcid: 6137996
Luebbert, L. et al. Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression. Preprint at bioRxiv https://doi.org/10.1101/2023.12.11.571168 (2024).
Holley, G. & Melsted, P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).
doi: 10.1186/s13059-020-02135-8
pubmed: 32943081
pmcid: 7499882
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
doi: 10.1038/ng.1028
pubmed: 22231483
pmcid: 3272472
Martin, F. J. et al. Ensembl 2023. Nucleic Acids Res. 51, D933–D941 (2023).
Grindberg, R. V. et al. RNA-sequencing from single nuclei. Proc. Natl Acad. Sci. USA 110, 19802–19807 (2013).
doi: 10.1073/pnas.1319700110
pubmed: 24248345
pmcid: 3856806
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
doi: 10.1038/s41586-018-0414-6
pubmed: 30089906
pmcid: 6130801
Gorin, G., Fang, M., Chari, T. & Pachter, L. RNA velocity unraveled. PLoS Comput. Biol. 18, e1010492 (2022).
doi: 10.1371/journal.pcbi.1010492
pubmed: 36094956
pmcid: 9499228
Gorin, G., Vastola, J. J., Fang, M. & Pachter, L. Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments. Nat. Commun. 13, 7620 (2022).
doi: 10.1038/s41467-022-34857-7
pubmed: 36494337
pmcid: 9734650
Carilli, M., Gorin, G., Choi, Y., Chari, T. & Pachter, L. Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data. Nat. Methods 21, 1466–1469 (2024).
doi: 10.1038/s41592-024-02365-9
pubmed: 39054391
Gorin, G. & Pachter, L. Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod. Preprint at bioRxiv https://doi.org/10.1101/2022.06.11.495771 (2023).
Gorin, G., Vastola, J. J. & Pachter, L. Studying stochastic systems biology of the cell with single-cell genomics data. Cell Syst. https://doi.org/10.1016/j.cels.2023.08.004 (2023).
Pool, A.-H., Poldsam, H., Chen, S., Thomson, M. & Oka, Y. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references. Nat. Methods https://doi.org/10.1038/s41592-023-02003-w (2023).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
doi: 10.1038/ncomms14049
pubmed: 28091601
pmcid: 5241818
Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).
doi: 10.1038/nprot.2014.006
pubmed: 24385147
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
doi: 10.1038/nmeth.2639
pubmed: 24056875
Hagemann-Jensen, M. et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 38, 708–714 (2020).
doi: 10.1038/s41587-020-0497-0
pubmed: 32518404
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
doi: 10.1126/science.aam8999
pubmed: 29545511
pmcid: 7643870
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
doi: 10.1186/s13059-014-0550-8
pubmed: 25516281
pmcid: 4302049
Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 4, 1521 (2015).
doi: 10.12688/f1000research.7563.1
pubmed: 26925227
Pimentel, H., Bray, N. L., Puente, S., Melsted, P. & Pachter, L. Differential analysis of RNA-seq incorporating quantification uncertainty. Nat. Methods 14, 687–690 (2017).
doi: 10.1038/nmeth.4324
pubmed: 28581496
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
doi: 10.1093/nar/gkv007
pubmed: 25605792
pmcid: 4402510
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
doi: 10.1186/gb-2014-15-2-r29
pubmed: 24485249
pmcid: 4053721
Einarsson, P. H. & Melsted, P. BUSZ: compressed BUS files. Bioinformatics 39, btad295 (2023).
doi: 10.1093/bioinformatics/btad295
pubmed: 37129540
pmcid: 10185401
Gustafsson, J., Robinson, J., Nielsen, J. & Pachter, L. BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq. Genome Biol. 22, 174 (2021).
doi: 10.1186/s13059-021-02386-z
pubmed: 34103073
pmcid: 8188791
Hashimshony, T. et al. CEL-Seq2: sensitive highly-multiplexed single-cell RNA-seq. Genome Biol. 17, 77 (2016).
doi: 10.1186/s13059-016-0938-8
pubmed: 27121950
pmcid: 4848782
Ntranos, V., Kamath, G. M., Zhang, J. M., Pachter, L. & Tse, D. N. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol. 17, 112 (2016).
doi: 10.1186/s13059-016-0970-8
pubmed: 27230763
pmcid: 4881296
Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).
doi: 10.1038/s41592-018-0303-9
pubmed: 30664774
Pachter, L. Models for transcript quantification from RNA-Seq. Preprint at https://doi.org/10.48550/arXiv.1104.3889 (2011).
Booeshaghi, A. S., Chen, X. & Pachter, L. A machine-readable specification for genomics assays. Bioinformatics https://doi.org/10.1093/bioinformatics/btae168 (2024).
Booeshaghi, A. S., Sullivan, D. K. & Pachter, L. Universal preprocessing of single-cell genomics data. Preprint at bioRxiv https://doi.org/10.1101/2023.09.14.543267 (2023).
Luebbert, L. & Pachter, L. Efficient querying of genomic reference databases with gget. Bioinformatics 39, btac836 (2023).
doi: 10.1093/bioinformatics/btac836
pubmed: 36610989
pmcid: 9835474
Gálvez-Merchán, Á., Min, K. H. J., Pachter, L. & Booeshaghi, A. S. Metadata retrieval from sequence databases with ffq. Bioinformatics 39, btac836 (2023).
doi: 10.1093/bioinformatics/btac667
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
doi: 10.1186/s13059-017-1382-0
pubmed: 29409532
pmcid: 5802054
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
doi: 10.1016/j.cell.2021.04.048
pubmed: 34062119
pmcid: 8238499
Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17, 137–145 (2020).
doi: 10.1038/s41592-019-0654-x
pubmed: 31792435
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
pubmed: 27909575
pmcid: 5112579
McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
doi: 10.1093/bioinformatics/btw777
pubmed: 28088763
pmcid: 5408845
Pezoa, F., Reutter, J. L., Suarez, F., Ugarte, M. & Vrgoč, D. Foundations of JSON schema. In Proc. 25th International Conference on World Wide Web 263–273 (International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2016).
Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).
doi: 10.1093/nar/gkac1071
pubmed: 36420896
Huntley, M. A. et al. Complex regulation of ADAR-mediated RNA-editing across tissues. BMC Genomics 17, 61 (2016).
doi: 10.1186/s12864-015-2291-9
pubmed: 26768488
pmcid: 4714477
Sullivan, D. K. & Pachter, L. Flexible parsing and preprocessing of technical sequences with splitcode. Bioinformatics https://doi.org/10.1093/bioinformatics/btae331 (2024).