Consistent RNA sequencing contamination in GTEx and other data sets.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
22 04 2020
Historique:
received: 19 09 2019
accepted: 23 03 2020
entrez: 24 4 2020
pubmed: 24 4 2020
medline: 1 8 2020
Statut: epublish

Résumé

A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.

Identifiants

pubmed: 32321923
doi: 10.1038/s41467-020-15821-9
pii: 10.1038/s41467-020-15821-9
pmc: PMC7176728
doi:

Substances chimiques

RNA 63231-63-0

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

1933

Subventions

Organisme : NIGMS NIH HHS
ID : R01 GM130564
Pays : United States
Organisme : NHLBI NIH HHS
ID : R01 HL137811
Pays : United States
Organisme : NCATS NIH HHS
ID : UL1 TR002001
Pays : United States

Références

Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
doi: 10.1038/ng.2653
Tomczak, K., Czerwinska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19, A68–77 (2015).
Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
pubmed: 25700174 doi: 10.1126/science.aaa1934
Kumasaka, N., Knights, A. J. & Gaffney, D. J. Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat. Genet. 48, 206–213 (2016).
pubmed: 26656845 doi: 10.1038/ng.3467
Gutman, D. A. et al. MR imaging predictors of molecular profile and survival: multi-institutional study of the TCGA glioblastoma data set. Radiology 267, 560–569 (2013).
pubmed: 23392431 pmcid: 3632807 doi: 10.1148/radiol.13120118
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).
pubmed: 25984700 pmcid: 4856034 doi: 10.1038/nmeth.3407
Okoniewski, M. J. & Miller, C. J. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 7, 276 (2006).
pubmed: 16749918 pmcid: 1513401 doi: 10.1186/1471-2105-7-276
van Dijk, E. L., Jaszczyszyn, Y. & Thermes, C. Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res. 322, 12–20 (2014).
pubmed: 24440557 doi: 10.1016/j.yexcr.2014.01.008
Tuerk, A., Wiktorin, G. & Guler, S. Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates. PLoS Comput.Biol. 13, e1005515 (2017).
pubmed: 28505151 pmcid: 5448817 doi: 10.1371/journal.pcbi.1005515
Lusk, R. W. Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS ONE 9, e110808 (2014).
pubmed: 25354084 pmcid: 4213012 doi: 10.1371/journal.pone.0110808
Rosenberg, A. Z. et al. xMD-miRNA-seq to generate near in vivo miRNA expression estimates in colon epithelial cells. Sci. Rep. 8, 9783 (2018).
pubmed: 29955168 pmcid: 6023933 doi: 10.1038/s41598-018-28198-z
McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst. 8, 329–337 e324 (2019).
pubmed: 30954475 pmcid: 6853612 doi: 10.1016/j.cels.2019.03.003
Merchant, S., Wood, D. E. & Salzberg, S. L. Unexpected cross-species contamination in genome sequencing projects. PeerJ 2, e675 (2014).
pubmed: 25426337 pmcid: 4243333 doi: 10.7717/peerj.675
Cibulskis, K. et al. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics 27, 2601–2602 (2011).
pubmed: 3167057 pmcid: 3167057 doi: 10.1093/bioinformatics/btr446
Ma, X. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 20, 50 (2019).
pubmed: 30867008 pmcid: 6417284 doi: 10.1186/s13059-019-1659-6
McCall, M. N., Illei, P. B. & Halushka, M. K. Complex sources of variation in tissue expression data: analysis of the GTEx lung transcriptome. Am. J. Hum. Genet. 99, 624–635 (2016).
pubmed: 27588449 pmcid: 5011060 doi: 10.1016/j.ajhg.2016.07.007
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
pubmed: 25516281 pmcid: 4302049 doi: 10.1186/s13059-014-0550-8
Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 e383 (2016).
pubmed: 27693023 pmcid: 5092539 doi: 10.1016/j.cels.2016.09.002
Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell Proteom. 13, 397–406 (2014).
doi: 10.1074/mcp.M113.035600
Uhlen, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
pubmed: 25613900 pmcid: 25613900 doi: 10.1126/science.1260419
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
pubmed: 22343431 pmcid: 3398141 doi: 10.1038/nprot.2011.457
Consortium, G. T. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
doi: 10.1038/nature24277
Suntsova, M. et al. Atlas of RNA sequencing profiles for normal human tissues. Sci. Data 6, 36 (2019).
pubmed: 31015567 pmcid: 6478850 doi: 10.1038/s41597-019-0043-4
Chhibber, A. et al. Transcriptomic variation of pharmacogenes in multiple human tissues and lymphoblastoid cell lines. Pharmacogenomics J. 17, 137–145 (2017).
pubmed: 26856248 doi: 10.1038/tpj.2015.93 pmcid: 26856248
Raulerson, C. K. et al. Adipose tissue gene expression associations reveal hundreds of candidate genes for cardiometabolic traits. Am. J. Hum. Genet. 105, 773–787 (2019).
pubmed: 31564431 pmcid: 6817527 doi: 10.1016/j.ajhg.2019.09.001
Griffiths, J. A., Richard, A. C., Bach, K., Lun, A. T. L. & Marioni, J. C. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 9, 2667 (2018).
pubmed: 29991676 pmcid: 6039488 doi: 10.1038/s41467-018-05083-x
Puram, S. V. et al. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck. Cancer Cell 171, 1611–1624 (2017).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
pubmed: 27124452 pmcid: 4944528 doi: 10.1126/science.aad0501
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 e344 (2016).
pubmed: 27667365 pmcid: 5228327 doi: 10.1016/j.cels.2016.08.011
Chiou, J. et al. Single cell chromatin accessibility reveals pancreatic islet cell type- and state-specific regulatory programs of diabetes risk. bioRxiv https://doi.org/10.1101/693671 (2019).
Tabula Muris, C. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
doi: 10.1038/s41586-018-0590-4
Consortium, G. T. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
doi: 10.1126/science.1262110
Kircher, M., Sawyer, S. & Meyer, M. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Res. 40, e3 (2012).
pubmed: 22021376 doi: 10.1093/nar/gkr771
Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. bioRxiv https://doi.org/10.1101/303727 (2018).
DePasquale, E. A. K. et al. DoubletDecon: deconvoluting doublets from single-cell RNA-sequencing data. Cell Rep. 29, 1718–1727 e1718 (2019).
pubmed: 31693907 pmcid: 6983270 doi: 10.1016/j.celrep.2019.09.082
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
pubmed: 29206104 pmcid: 5762154 doi: 10.7554/eLife.27041
Witwer, K. W. & Halushka, M. K. Toward the promise of microRNAs - enhancing reproducibility and rigor in microRNA research. RNA Biol. 13, 1103–1116 (2016).
pubmed: 27645402 pmcid: 5100345 doi: 10.1080/15476286.2016.1236172
Kryukov, K. & Imanishi, T. Human contamination in public genome assemblies. PLoS ONE 11, e0162424 (2016).
pubmed: 27611326 pmcid: 5017631 doi: 10.1371/journal.pone.0162424
Longo, M. S., O’Neill, M. J. & O’Neill, R. J. Abundant human DNA contamination identified in non-primate genome databases. PLoS ONE 6, e16410 (2011).
pubmed: 21358816 pmcid: 3040168 doi: 10.1371/journal.pone.0016410
Zhang, L. et al. Exogenous plant MIR168a specifically targets mammalian LDLRAP1: evidence of cross-kingdom regulation by microRNA. Cell Res. 22, 107–126 (2012).
pubmed: 21931358 doi: 10.1038/cr.2011.158
Tosar, J. P., Rovira, C., Naya, H. & Cayota, A. Mining of public sequencing databases supports a non-dietary origin for putative foreign miRNAs: underestimated effects of contamination in NGS. RNA 20, 754–757 (2014).
pubmed: 24729469 pmcid: 4024629 doi: 10.1261/rna.044263.114
Zhang, Y. et al. Analysis of plant-derived miRNAs in animal small RNA datasets. BMC Genomics 13, 381 (2012).
pubmed: 22873950 pmcid: 3462722 doi: 10.1186/1471-2164-13-381
Li, H. et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49, 708–718 (2017).
pubmed: 28319088 doi: 10.1038/ng.3818
Tian, S. K. et al. Optimizing workflows and processing of cytologic samples for comprehensive analysis by next-generation sequencing: memorial sloan kettering cancer center experience. Arch. Pathol. Lab. Med. 140, 1200–1205 (2016).
pubmed: 27588332 doi: 10.5858/arpa.2016-0108-RA
Van Allen, E. M. et al. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine. Nat. Med. 20, 682–688 (2014).
pubmed: 24836576 pmcid: 4048335 doi: 10.1038/nm.3559
Collado-Torres, L., Nellore, A. & Jaffe, A. E. Recount workflow: accessing over 70,000 human RNA-seq samples with bioconductor. F1000Research 6, 1558 (2017).
pubmed: 29043067 pmcid: 5621122 doi: 10.12688/f1000research.12223.1
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
pubmed: 25751142 pmcid: 4655817 doi: 10.1038/nmeth.3317
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
pubmed: 19505943 pmcid: 19505943 doi: 10.1093/bioinformatics/btp352
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
pubmed: 3198575 pmcid: 3198575 doi: 10.1093/bioinformatics/btr509
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
pubmed: 25690850 pmcid: 25690850 doi: 10.1038/nbt.3122
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
pubmed: 27560171 pmcid: 5032908 doi: 10.1038/nprot.2016.095
Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinformatics 14, 178–192 (2013).
pubmed: 22517427 doi: 10.1093/bib/bbs017 pmcid: 22517427
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
pubmed: 3346182 pmcid: 3346182
Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 28, 1248–1250 (2010).
pubmed: 21139605 doi: 10.1038/nbt1210-1248 pmcid: 21139605
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
pubmed: 24056875 doi: 10.1038/nmeth.2639
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
pubmed: 3163565 pmcid: 3163565 doi: 10.1186/1471-2105-12-323
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
pubmed: 26000487 pmcid: 4441768 doi: 10.1016/j.cell.2015.04.044
Ricordi, C., Lacy, P. E., Finke, E. H., Olack, B. J. & Scharp, D. W. Automated method for isolation of human pancreatic islets. Diabetes 37, 413–420 (1988).
pubmed: 3288530 doi: 10.2337/diab.37.4.413
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
pubmed: 25867923 pmcid: 4430369 doi: 10.1038/nbt.3192

Auteurs

Tim O Nieuwenhuis (TO)

Department of Pathology, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.
McKusick-Nathans Institute, Department of Genetic Medicine, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.

Stephanie Y Yang (SY)

McKusick-Nathans Institute, Department of Genetic Medicine, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.

Rohan X Verma (RX)

Department of Pathology, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.

Vamsee Pillalamarri (V)

McKusick-Nathans Institute, Department of Genetic Medicine, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.

Dan E Arking (DE)

McKusick-Nathans Institute, Department of Genetic Medicine, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.

Avi Z Rosenberg (AZ)

Department of Pathology, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.

Matthew N McCall (MN)

Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, 14642, USA.

Marc K Halushka (MK)

Department of Pathology, Johns Hopkins University SOM, Baltimore, MD, 21205, USA. mhalush1@jhmi.edu.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH