Transipedia.org: k-mer-based exploration of large RNA sequencing datasets and application to cancer data.


Journal

Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660

Informations de publication

Date de publication:
10 Oct 2024
Historique:
received: 26 03 2024
accepted: 01 10 2024
medline: 11 10 2024
pubmed: 11 10 2024
entrez: 10 10 2024
Statut: epublish

Résumé

Indexing techniques relying on k-mers have proven effective in searching for RNA sequences across thousands of RNA-seq libraries, but without enabling direct RNA quantification. We show here that arbitrary RNA sequences can be quantified in seconds through their decomposition into k-mers, with a precision akin to that of conventional RNA quantification methods. Using an index of the Cancer Cell Line Encyclopedia (CCLE) collection consisting of 1019 RNA-seq samples, we show that k-mer indexing offers a powerful means to reveal non-reference sequences, and variant RNAs induced by specific gene alterations, for instance in splicing factors.

Identifiants

pubmed: 39390592
doi: 10.1186/s13059-024-03413-5
pii: 10.1186/s13059-024-03413-5
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

266

Subventions

Organisme : Agence Nationale de la Recherche
ID : ANR-22- CE45-0007
Organisme : Agence Nationale de la Recherche
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche
ID : ANR-19-P3IA-0001

Informations de copyright

© 2024. The Author(s).

Références

Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun. 2018;9(1):1366.
doi: 10.1038/s41467-018-03751-6 pubmed: 29636450 pmcid: 5893633
Clough E, Barrett T. The gene expression omnibus database. Stat Genomics Methods Protocol. 2016;1418:93–110.
Morillon A, Gautheret D. Bridging the gap between reference and real transcriptomes. Genome Biol. 2019;20(1):1–7.
doi: 10.1186/s13059-019-1710-7
Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, et al. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 2021;22(1):1–40.
doi: 10.1186/s13059-021-02533-6
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 2021;31(1):1–12.
doi: 10.1101/gr.260604.119 pubmed: 33328168 pmcid: 7849385
Darvish M, Seiler E, Mehringer S, Rahn R, Reinert K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics. 2022;38(17):4100–8.
doi: 10.1093/bioinformatics/btac492 pubmed: 35801930 pmcid: 9438961
Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de bruijn graphs. Genome Res. 2022;32(9):1754–64.
doi: 10.1101/gr.276607.122 pubmed: 35609994 pmcid: 9528980
Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics. 2020;36(Supplement_1):i177–85.
Consortium SI. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32(9):903–14.
doi: 10.1038/nbt.2957
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525–7.
doi: 10.1038/nbt.3519 pubmed: 27043002
Consortium CCLE, et al. Genomics of drug sensitivity in cancer consortium. Pharmacogenomic Agreement Between Two Cancer Cell Line Data Sets. Nat. 2015;528:84–7.
Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2019;47(D1):D941-7.
doi: 10.1093/nar/gky1015 pubmed: 30371878
Philippe N, Salson M, Commes T, Rivals E. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 2013;14:1–16.
doi: 10.1186/gb-2013-14-3-r30
Gillani R, Seong BKA, Crowdis J, Conway JR, Dharia NV, Alimohamed S, et al. Gene fusions create partner and collateral dependencies essential to cancer cell survival. Cancer Res. 2021;81(15):3971–84.
doi: 10.1158/0008-5472.CAN-21-0791 pubmed: 34099491 pmcid: 8338889
Davidson NM, Chen Y, Sadras T, Ryland GL, Blombery P, Ekert PG, et al. JAFFAL: detecting fusion genes with long-read transcriptome sequencing. Genome Biol. 2022;23(1):1–20.
doi: 10.1186/s13059-021-02588-5
Gioiosa S, Bolis M, Flati T, Massini A, Garattini E, Chillemi G, et al. Massive NGS data analysis reveals hundreds of potential novel gene fusions in human cell lines. GigaScience. 2018;7(10):giy062.
Bendall ML, De Mulder M, Iñiguez LP, Lecanda-Sánchez A, Pérez-Losada M, Ostrowski MA, et al. Telescope: Characterization of the retrotranscriptome by accurate estimation of transposable element expression. PLoS Comput Biol. 2019;15(9):e1006453.
doi: 10.1371/journal.pcbi.1006453 pubmed: 31568525 pmcid: 6786656
Kong Y, Rose CM, Cass AA, Williams AG, Darwish M, Lianoglou S, et al. Transposable element expression in tumors is associated with immune infiltration and increased antigenicity. Nat Commun. 2019;10(1):5228.
doi: 10.1038/s41467-019-13035-2 pubmed: 31745090 pmcid: 6864081
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14(4):417–9.
doi: 10.1038/nmeth.4197 pubmed: 28263959 pmcid: 5600148
Alsafadi S, Houy A, Battistella A, Popova T, Wassef M, Henry E, et al. Cancer-associated SF3B1 mutations affect alternative splicing by promoting alternative branchpoint usage. Nat Commun. 2016;7(1):10615.
doi: 10.1038/ncomms10615 pubmed: 26842708 pmcid: 4743009
Zhou Z, Gong Q, Wang Y, Li M, Wang L, Ding H, et al. The biological function and clinical significance of SF3B1 mutations in cancer. Biomark Res. 2020;8(1):1–14.
doi: 10.1186/s40364-020-00220-5
Alsafadi S, Dayot S, Tarin M, Houy A, Bellanger D, Cornella M, et al. Genetic alterations of SUGP1 mimic mutant-SF3B1 splice pattern in lung adenocarcinoma and other cancers. Oncogene. 2021;40(1):85–96.
doi: 10.1038/s41388-020-01507-5 pubmed: 33057152
Seo JS, Ju YS, Lee WC, Shin JY, Lee JK, Bleazard T, et al. The transcriptional landscape and mutational profile of lung adenocarcinoma. Genome Res. 2012;22(11):2109–19.
doi: 10.1101/gr.145144.112 pubmed: 22975805 pmcid: 3483540
MacRae T, Sargeant T, Lemieux S, Hebert J, Deneault E, Sauvageau G. RNA-Seq reveals spliceosome and proteasome genes as most consistent transcripts in human cancer cells. PLoS ONE. 2013;8(9):e72884.
doi: 10.1371/journal.pone.0072884 pubmed: 24069164 pmcid: 3775772
Pabst C, Bergeron A, Lavallée VP, Yeh J, Gendron P, Norddahl GL, et al. GPR56 identifies primary human acute myeloid leukemia cells with high repopulating potential in vivo. Blood J Am Soc Hematol. 2016;127(16):2018–27.
Lavallée VP, Lemieux S, Boucher G, Gendron P, Boivin I, Armstrong RN, et al. RNA-sequencing analysis of core binding factor AML identifies recurrent ZBTB7A mutations and defines RUNX1-CBFA2T3 fusion signature. Blood J Am Soc Hematol. 2016;127(20):2498–501.
Riquier S, Bessiere C, Guibert B, Bouge AL, Boureux A, Ruffle F, et al. Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-seq datasets. NAR Genomics Bioinforma. 2021;3(3):lqab058.
Chisanga D, Liao Y, Shi W. Impact of gene annotation choice on the quantification of RNA-seq data. BMC Bioinformatics. 2022;23(1):1–21.
doi: 10.1186/s12859-022-04644-8
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research. 2015;4:1521.
Ghandi M, Huang FW, Jané-Valbuena J, Kryukov GV, Lo CC, McDonald ER III, et al. Next-generation characterization of the cancer cell line encyclopedia. Nature. 2019;569(7757):503–8.
doi: 10.1038/s41586-019-1186-3 pubmed: 31068700 pmcid: 6697103
Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502(7471):333–9.
doi: 10.1038/nature12634 pubmed: 24132290 pmcid: 3927368
Döhner H, Estey E, Grimwade D, Amadori S, Appelbaum FR, Büchner T, et al. Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel. Blood J Am Soc Hematol. 2017;129(4):424–47.
Kent WJ. BLAT-the BLAST-like alignment tool. Genome Res. 2002;12(4):656–64.
pubmed: 11932250 pmcid: 187518
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.
doi: 10.1093/bioinformatics/btq033 pubmed: 20110278 pmcid: 2832824
Bousquet M, De Clara E. LncRNAs specific signature in acute myeloid leukemia with intermediate risk. Gene Expression Omnibus; 2016. Datasets. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62852 . Accessed 1 Jan 2021.
Shi L, Wang C, Mason C, Fischer M, Peng Z, Auerbach S, et al. SEQC Project. Gene Expression Omnibus; 2014. Datasets. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47792 . Accessed 1 Dec 2023.
Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. SNP and Expression data from the Cancer Cell Line Encyclopedia (CCLE). Gene Expression Omnibus; 2012. Datasets. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36139 . Accessed 31 Jan 2021.
Seo J, Ju Y, Lee W, Shin J, Lee J, Bleazard T, et al. The transcriptional landscape and mutational profile of lung adenocarcinoma. Gene Expression Omnibus; 2012. Datasets. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40419 . Accessed 10 Sep 2024.
Simon C, Chagraoui J, Krosl J, Gendron P, Wilhelm B, Lemieux S, et al. Leucegene: AML sequencing (part 1). Gene Expression Omnibus; 2013. Datasets. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49642 . Accessed 1 Jan 2021.
Simon C, Chagraoui J, Krosl J, Gendron P, Wilhelm B, Lemieux S, et al. Leucegene: AML sequencing (part 2). Gene Expression Omnibus; 2014. Datasets. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52656 . Accessed 1 Jan 2021.
Simon C, Chagraoui J, Krosl J, Gendron P, Wilhelm B, Lemieux S, et al. Leucegene: AML sequencing (part 3). Gene Expression Omnibus; 2015. Datasets. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62190 . Accessed 1 Jan 2021.
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. 2020. Datasets. dbGaP. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v8.p2 . Accessed 1 Apr 2024.
Guibert B, Bessiere C, Boureux A, Xue H, Commes T, Gautheret D. Code for Exploring a large cancer cell line RNA-sequencing dataset with k-mers. Datasets Zenodo. 2024. https://doi.org/10.5281/zenodo.13819530 .
doi: 10.5281/zenodo.13819530
Guibert B, Bessiere C, Boureux A, Xue H, Commes T, Gautheret D. Code for Exploring a large cancer cell line RNA-sequencing dataset with k-mers. Github; 2024. https://github.com/Transipedia/publication-ccle .

Auteurs

Chloé Bessière (C)

IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France.
CRCT, Inserm, CNRS, Université Toulouse III-Paul Sabatier, Centre de Recherches en Cancérologie de Toulouse, Toulouse, France.

Haoliang Xue (H)

I2BC, Université Paris-Saclay, CNRS, CEA, Gif sur Yvette, France.

Benoit Guibert (B)

IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France.

Anthony Boureux (A)

IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France.

Florence Rufflé (F)

IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France.

Julien Viot (J)

Department of Medical Oncology, Biotechnology and Immuno-Oncology Platform, University Hospital of Besançon, Besançon, France.
INSERM, EFS BFC, UMR1098, RIGHT, University of Franche-Comté, Interactions Greffon-Hôte-Tumeur/Ingénierie Cellulaire et Génique, Besançon, France.

Rayan Chikhi (R)

Institut Pasteur, Université Paris Cité, Paris, France.

Mikaël Salson (M)

Université de Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.

Camille Marchet (C)

Université de Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.

Thérèse Commes (T)

IRMB, INSERM U1183, Hopital Saint-Eloi, Universite de Montpellier, Montpellier, France. therese.commes@inserm.fr.

Daniel Gautheret (D)

I2BC, Université Paris-Saclay, CNRS, CEA, Gif sur Yvette, France. daniel.gautheret@universite-paris-saclay.fr.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH