Consistent RNA sequencing contamination in GTEx and other data sets.
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
22 04 2020
22 04 2020
Historique:
received:
19
09
2019
accepted:
23
03
2020
entrez:
24
4
2020
pubmed:
24
4
2020
medline:
1
8
2020
Statut:
epublish
Résumé
A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.
Identifiants
pubmed: 32321923
doi: 10.1038/s41467-020-15821-9
pii: 10.1038/s41467-020-15821-9
pmc: PMC7176728
doi:
Substances chimiques
RNA
63231-63-0
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
1933Subventions
Organisme : NIGMS NIH HHS
ID : R01 GM130564
Pays : United States
Organisme : NHLBI NIH HHS
ID : R01 HL137811
Pays : United States
Organisme : NCATS NIH HHS
ID : UL1 TR002001
Pays : United States
Références
Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
doi: 10.1038/ng.2653
Tomczak, K., Czerwinska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19, A68–77 (2015).
Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
pubmed: 25700174
doi: 10.1126/science.aaa1934
Kumasaka, N., Knights, A. J. & Gaffney, D. J. Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat. Genet. 48, 206–213 (2016).
pubmed: 26656845
doi: 10.1038/ng.3467
Gutman, D. A. et al. MR imaging predictors of molecular profile and survival: multi-institutional study of the TCGA glioblastoma data set. Radiology 267, 560–569 (2013).
pubmed: 23392431
pmcid: 3632807
doi: 10.1148/radiol.13120118
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).
pubmed: 25984700
pmcid: 4856034
doi: 10.1038/nmeth.3407
Okoniewski, M. J. & Miller, C. J. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 7, 276 (2006).
pubmed: 16749918
pmcid: 1513401
doi: 10.1186/1471-2105-7-276
van Dijk, E. L., Jaszczyszyn, Y. & Thermes, C. Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res. 322, 12–20 (2014).
pubmed: 24440557
doi: 10.1016/j.yexcr.2014.01.008
Tuerk, A., Wiktorin, G. & Guler, S. Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates. PLoS Comput.Biol. 13, e1005515 (2017).
pubmed: 28505151
pmcid: 5448817
doi: 10.1371/journal.pcbi.1005515
Lusk, R. W. Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS ONE 9, e110808 (2014).
pubmed: 25354084
pmcid: 4213012
doi: 10.1371/journal.pone.0110808
Rosenberg, A. Z. et al. xMD-miRNA-seq to generate near in vivo miRNA expression estimates in colon epithelial cells. Sci. Rep. 8, 9783 (2018).
pubmed: 29955168
pmcid: 6023933
doi: 10.1038/s41598-018-28198-z
McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst. 8, 329–337 e324 (2019).
pubmed: 30954475
pmcid: 6853612
doi: 10.1016/j.cels.2019.03.003
Merchant, S., Wood, D. E. & Salzberg, S. L. Unexpected cross-species contamination in genome sequencing projects. PeerJ 2, e675 (2014).
pubmed: 25426337
pmcid: 4243333
doi: 10.7717/peerj.675
Cibulskis, K. et al. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics 27, 2601–2602 (2011).
pubmed: 3167057
pmcid: 3167057
doi: 10.1093/bioinformatics/btr446
Ma, X. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 20, 50 (2019).
pubmed: 30867008
pmcid: 6417284
doi: 10.1186/s13059-019-1659-6
McCall, M. N., Illei, P. B. & Halushka, M. K. Complex sources of variation in tissue expression data: analysis of the GTEx lung transcriptome. Am. J. Hum. Genet. 99, 624–635 (2016).
pubmed: 27588449
pmcid: 5011060
doi: 10.1016/j.ajhg.2016.07.007
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
pubmed: 25516281
pmcid: 4302049
doi: 10.1186/s13059-014-0550-8
Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 e383 (2016).
pubmed: 27693023
pmcid: 5092539
doi: 10.1016/j.cels.2016.09.002
Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell Proteom. 13, 397–406 (2014).
doi: 10.1074/mcp.M113.035600
Uhlen, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
pubmed: 25613900
pmcid: 25613900
doi: 10.1126/science.1260419
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
pubmed: 22343431
pmcid: 3398141
doi: 10.1038/nprot.2011.457
Consortium, G. T. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
doi: 10.1038/nature24277
Suntsova, M. et al. Atlas of RNA sequencing profiles for normal human tissues. Sci. Data 6, 36 (2019).
pubmed: 31015567
pmcid: 6478850
doi: 10.1038/s41597-019-0043-4
Chhibber, A. et al. Transcriptomic variation of pharmacogenes in multiple human tissues and lymphoblastoid cell lines. Pharmacogenomics J. 17, 137–145 (2017).
pubmed: 26856248
doi: 10.1038/tpj.2015.93
pmcid: 26856248
Raulerson, C. K. et al. Adipose tissue gene expression associations reveal hundreds of candidate genes for cardiometabolic traits. Am. J. Hum. Genet. 105, 773–787 (2019).
pubmed: 31564431
pmcid: 6817527
doi: 10.1016/j.ajhg.2019.09.001
Griffiths, J. A., Richard, A. C., Bach, K., Lun, A. T. L. & Marioni, J. C. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 9, 2667 (2018).
pubmed: 29991676
pmcid: 6039488
doi: 10.1038/s41467-018-05083-x
Puram, S. V. et al. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck. Cancer Cell 171, 1611–1624 (2017).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
pubmed: 27124452
pmcid: 4944528
doi: 10.1126/science.aad0501
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 e344 (2016).
pubmed: 27667365
pmcid: 5228327
doi: 10.1016/j.cels.2016.08.011
Chiou, J. et al. Single cell chromatin accessibility reveals pancreatic islet cell type- and state-specific regulatory programs of diabetes risk. bioRxiv https://doi.org/10.1101/693671 (2019).
Tabula Muris, C. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
doi: 10.1038/s41586-018-0590-4
Consortium, G. T. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
doi: 10.1126/science.1262110
Kircher, M., Sawyer, S. & Meyer, M. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Res. 40, e3 (2012).
pubmed: 22021376
doi: 10.1093/nar/gkr771
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. bioRxiv https://doi.org/10.1101/303727 (2018).
DePasquale, E. A. K. et al. DoubletDecon: deconvoluting doublets from single-cell RNA-sequencing data. Cell Rep. 29, 1718–1727 e1718 (2019).
pubmed: 31693907
pmcid: 6983270
doi: 10.1016/j.celrep.2019.09.082
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
pubmed: 29206104
pmcid: 5762154
doi: 10.7554/eLife.27041
Witwer, K. W. & Halushka, M. K. Toward the promise of microRNAs - enhancing reproducibility and rigor in microRNA research. RNA Biol. 13, 1103–1116 (2016).
pubmed: 27645402
pmcid: 5100345
doi: 10.1080/15476286.2016.1236172
Kryukov, K. & Imanishi, T. Human contamination in public genome assemblies. PLoS ONE 11, e0162424 (2016).
pubmed: 27611326
pmcid: 5017631
doi: 10.1371/journal.pone.0162424
Longo, M. S., O’Neill, M. J. & O’Neill, R. J. Abundant human DNA contamination identified in non-primate genome databases. PLoS ONE 6, e16410 (2011).
pubmed: 21358816
pmcid: 3040168
doi: 10.1371/journal.pone.0016410
Zhang, L. et al. Exogenous plant MIR168a specifically targets mammalian LDLRAP1: evidence of cross-kingdom regulation by microRNA. Cell Res. 22, 107–126 (2012).
pubmed: 21931358
doi: 10.1038/cr.2011.158
Tosar, J. P., Rovira, C., Naya, H. & Cayota, A. Mining of public sequencing databases supports a non-dietary origin for putative foreign miRNAs: underestimated effects of contamination in NGS. RNA 20, 754–757 (2014).
pubmed: 24729469
pmcid: 4024629
doi: 10.1261/rna.044263.114
Zhang, Y. et al. Analysis of plant-derived miRNAs in animal small RNA datasets. BMC Genomics 13, 381 (2012).
pubmed: 22873950
pmcid: 3462722
doi: 10.1186/1471-2164-13-381
Li, H. et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49, 708–718 (2017).
pubmed: 28319088
doi: 10.1038/ng.3818
Tian, S. K. et al. Optimizing workflows and processing of cytologic samples for comprehensive analysis by next-generation sequencing: memorial sloan kettering cancer center experience. Arch. Pathol. Lab. Med. 140, 1200–1205 (2016).
pubmed: 27588332
doi: 10.5858/arpa.2016-0108-RA
Van Allen, E. M. et al. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine. Nat. Med. 20, 682–688 (2014).
pubmed: 24836576
pmcid: 4048335
doi: 10.1038/nm.3559
Collado-Torres, L., Nellore, A. & Jaffe, A. E. Recount workflow: accessing over 70,000 human RNA-seq samples with bioconductor. F1000Research 6, 1558 (2017).
pubmed: 29043067
pmcid: 5621122
doi: 10.12688/f1000research.12223.1
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
pubmed: 25751142
pmcid: 4655817
doi: 10.1038/nmeth.3317
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
pubmed: 19505943
pmcid: 19505943
doi: 10.1093/bioinformatics/btp352
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
pubmed: 3198575
pmcid: 3198575
doi: 10.1093/bioinformatics/btr509
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
pubmed: 25690850
pmcid: 25690850
doi: 10.1038/nbt.3122
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
pubmed: 27560171
pmcid: 5032908
doi: 10.1038/nprot.2016.095
Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinformatics 14, 178–192 (2013).
pubmed: 22517427
doi: 10.1093/bib/bbs017
pmcid: 22517427
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
pubmed: 3346182
pmcid: 3346182
Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 28, 1248–1250 (2010).
pubmed: 21139605
doi: 10.1038/nbt1210-1248
pmcid: 21139605
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
pubmed: 24056875
doi: 10.1038/nmeth.2639
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
pubmed: 3163565
pmcid: 3163565
doi: 10.1186/1471-2105-12-323
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
pubmed: 26000487
pmcid: 4441768
doi: 10.1016/j.cell.2015.04.044
Ricordi, C., Lacy, P. E., Finke, E. H., Olack, B. J. & Scharp, D. W. Automated method for isolation of human pancreatic islets. Diabetes 37, 413–420 (1988).
pubmed: 3288530
doi: 10.2337/diab.37.4.413
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
pubmed: 25867923
pmcid: 4430369
doi: 10.1038/nbt.3192