Consistent RNA sequencing contamination in GTEx and other data sets.

DNA Contamination Genotype High-Throughput Nucleotide Sequencing Humans Polymorphism, Single Nucleotide RNA / genetics Sequence Analysis, RNA Single-Cell Analysis

Journal

Nature communications

ISSN: 2041-1723

Titre abrégé: Nat Commun

Pays: England

ID NLM: 101528555

Informations de publication

Date de publication:
22 04 2020

Historique:

received: 19 09 2019

accepted: 23 03 2020

entrez: 24 4 2020

pubmed: 24 4 2020

medline: 1 8 2020

Statut: epublish

Résumé

A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.

Identifiants

DOI: 10.1038/s41467-020-15821-9 PMID: 32321923 PMC: PMC7176728

pubmed: 32321923

doi: 10.1038/s41467-020-15821-9

pii: 10.1038/s41467-020-15821-9

pmc: PMC7176728

doi:

Substances chimiques

RNA 63231-63-0

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

1933

Subventions

Organisme : NIGMS NIH HHS

ID : R01 GM130564

Pays : United States

Organisme : NHLBI NIH HHS

ID : R01 HL137811

Pays : United States

Organisme : NCATS NIH HHS

ID : UL1 TR002001

Pays : United States

Références

Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

doi: 10.1038/ng.2653

Tomczak, K., Czerwinska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19, A68–77 (2015).

Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

pubmed: 25700174 doi: 10.1126/science.aaa1934

Kumasaka, N., Knights, A. J. & Gaffney, D. J. Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat. Genet. 48, 206–213 (2016).

pubmed: 26656845 doi: 10.1038/ng.3467

Gutman, D. A. et al. MR imaging predictors of molecular profile and survival: multi-institutional study of the TCGA glioblastoma data set. Radiology 267, 560–569 (2013).

pubmed: 23392431 pmcid: 3632807 doi: 10.1148/radiol.13120118

Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).

pubmed: 25984700 pmcid: 4856034 doi: 10.1038/nmeth.3407

Okoniewski, M. J. & Miller, C. J. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 7, 276 (2006).

pubmed: 16749918 pmcid: 1513401 doi: 10.1186/1471-2105-7-276

van Dijk, E. L., Jaszczyszyn, Y. & Thermes, C. Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res. 322, 12–20 (2014).

pubmed: 24440557 doi: 10.1016/j.yexcr.2014.01.008

Tuerk, A., Wiktorin, G. & Guler, S. Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates. PLoS Comput.Biol. 13, e1005515 (2017).

pubmed: 28505151 pmcid: 5448817 doi: 10.1371/journal.pcbi.1005515

Lusk, R. W. Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS ONE 9, e110808 (2014).

pubmed: 25354084 pmcid: 4213012 doi: 10.1371/journal.pone.0110808

Rosenberg, A. Z. et al. xMD-miRNA-seq to generate near in vivo miRNA expression estimates in colon epithelial cells. Sci. Rep. 8, 9783 (2018).

pubmed: 29955168 pmcid: 6023933 doi: 10.1038/s41598-018-28198-z

McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst. 8, 329–337 e324 (2019).

pubmed: 30954475 pmcid: 6853612 doi: 10.1016/j.cels.2019.03.003

Merchant, S., Wood, D. E. & Salzberg, S. L. Unexpected cross-species contamination in genome sequencing projects. PeerJ 2, e675 (2014).

pubmed: 25426337 pmcid: 4243333 doi: 10.7717/peerj.675

Cibulskis, K. et al. ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics 27, 2601–2602 (2011).

pubmed: 3167057 pmcid: 3167057 doi: 10.1093/bioinformatics/btr446

Ma, X. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 20, 50 (2019).

pubmed: 30867008 pmcid: 6417284 doi: 10.1186/s13059-019-1659-6

McCall, M. N., Illei, P. B. & Halushka, M. K. Complex sources of variation in tissue expression data: analysis of the GTEx lung transcriptome. Am. J. Hum. Genet. 99, 624–635 (2016).

pubmed: 27588449 pmcid: 5011060 doi: 10.1016/j.ajhg.2016.07.007

Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

pubmed: 25516281 pmcid: 4302049 doi: 10.1186/s13059-014-0550-8

Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 e383 (2016).

pubmed: 27693023 pmcid: 5092539 doi: 10.1016/j.cels.2016.09.002

Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell Proteom. 13, 397–406 (2014).

doi: 10.1074/mcp.M113.035600

Uhlen, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).

pubmed: 25613900 pmcid: 25613900 doi: 10.1126/science.1260419

Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).

pubmed: 22343431 pmcid: 3398141 doi: 10.1038/nprot.2011.457

Consortium, G. T. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

doi: 10.1038/nature24277

Suntsova, M. et al. Atlas of RNA sequencing profiles for normal human tissues. Sci. Data 6, 36 (2019).

pubmed: 31015567 pmcid: 6478850 doi: 10.1038/s41597-019-0043-4

Chhibber, A. et al. Transcriptomic variation of pharmacogenes in multiple human tissues and lymphoblastoid cell lines. Pharmacogenomics J. 17, 137–145 (2017).

pubmed: 26856248 doi: 10.1038/tpj.2015.93 pmcid: 26856248

Raulerson, C. K. et al. Adipose tissue gene expression associations reveal hundreds of candidate genes for cardiometabolic traits. Am. J. Hum. Genet. 105, 773–787 (2019).

pubmed: 31564431 pmcid: 6817527 doi: 10.1016/j.ajhg.2019.09.001

Griffiths, J. A., Richard, A. C., Bach, K., Lun, A. T. L. & Marioni, J. C. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 9, 2667 (2018).

pubmed: 29991676 pmcid: 6039488 doi: 10.1038/s41467-018-05083-x

Puram, S. V. et al. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck. Cancer Cell 171, 1611–1624 (2017).

Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).

pubmed: 27124452 pmcid: 4944528 doi: 10.1126/science.aad0501

Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 e344 (2016).

pubmed: 27667365 pmcid: 5228327 doi: 10.1016/j.cels.2016.08.011

Chiou, J. et al. Single cell chromatin accessibility reveals pancreatic islet cell type- and state-specific regulatory programs of diabetes risk. bioRxiv https://doi.org/10.1101/693671 (2019).

Tabula Muris, C. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

doi: 10.1038/s41586-018-0590-4

Consortium, G. T. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

doi: 10.1126/science.1262110

Kircher, M., Sawyer, S. & Meyer, M. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Res. 40, e3 (2012).

pubmed: 22021376 doi: 10.1093/nar/gkr771

Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data. bioRxiv https://doi.org/10.1101/303727 (2018).

DePasquale, E. A. K. et al. DoubletDecon: deconvoluting doublets from single-cell RNA-sequencing data. Cell Rep. 29, 1718–1727 e1718 (2019).

pubmed: 31693907 pmcid: 6983270 doi: 10.1016/j.celrep.2019.09.082

Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).

pubmed: 29206104 pmcid: 5762154 doi: 10.7554/eLife.27041

Witwer, K. W. & Halushka, M. K. Toward the promise of microRNAs - enhancing reproducibility and rigor in microRNA research. RNA Biol. 13, 1103–1116 (2016).

pubmed: 27645402 pmcid: 5100345 doi: 10.1080/15476286.2016.1236172

Kryukov, K. & Imanishi, T. Human contamination in public genome assemblies. PLoS ONE 11, e0162424 (2016).

pubmed: 27611326 pmcid: 5017631 doi: 10.1371/journal.pone.0162424

Longo, M. S., O’Neill, M. J. & O’Neill, R. J. Abundant human DNA contamination identified in non-primate genome databases. PLoS ONE 6, e16410 (2011).

pubmed: 21358816 pmcid: 3040168 doi: 10.1371/journal.pone.0016410

Zhang, L. et al. Exogenous plant MIR168a specifically targets mammalian LDLRAP1: evidence of cross-kingdom regulation by microRNA. Cell Res. 22, 107–126 (2012).

pubmed: 21931358 doi: 10.1038/cr.2011.158

Tosar, J. P., Rovira, C., Naya, H. & Cayota, A. Mining of public sequencing databases supports a non-dietary origin for putative foreign miRNAs: underestimated effects of contamination in NGS. RNA 20, 754–757 (2014).

pubmed: 24729469 pmcid: 4024629 doi: 10.1261/rna.044263.114

Zhang, Y. et al. Analysis of plant-derived miRNAs in animal small RNA datasets. BMC Genomics 13, 381 (2012).

pubmed: 22873950 pmcid: 3462722 doi: 10.1186/1471-2164-13-381

Li, H. et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49, 708–718 (2017).

pubmed: 28319088 doi: 10.1038/ng.3818

Tian, S. K. et al. Optimizing workflows and processing of cytologic samples for comprehensive analysis by next-generation sequencing: memorial sloan kettering cancer center experience. Arch. Pathol. Lab. Med. 140, 1200–1205 (2016).

pubmed: 27588332 doi: 10.5858/arpa.2016-0108-RA

Van Allen, E. M. et al. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine. Nat. Med. 20, 682–688 (2014).

pubmed: 24836576 pmcid: 4048335 doi: 10.1038/nm.3559

Collado-Torres, L., Nellore, A. & Jaffe, A. E. Recount workflow: accessing over 70,000 human RNA-seq samples with bioconductor. F1000Research 6, 1558 (2017).

pubmed: 29043067 pmcid: 5621122 doi: 10.12688/f1000research.12223.1

Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

pubmed: 25751142 pmcid: 4655817 doi: 10.1038/nmeth.3317

Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

pubmed: 19505943 pmcid: 19505943 doi: 10.1093/bioinformatics/btp352

Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

pubmed: 3198575 pmcid: 3198575 doi: 10.1093/bioinformatics/btr509

Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).

pubmed: 25690850 pmcid: 25690850 doi: 10.1038/nbt.3122

Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).

pubmed: 27560171 pmcid: 5032908 doi: 10.1038/nprot.2016.095

Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinformatics 14, 178–192 (2013).

pubmed: 22517427 doi: 10.1093/bib/bbs017 pmcid: 22517427

Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

pubmed: 3346182 pmcid: 3346182

Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 28, 1248–1250 (2010).

pubmed: 21139605 doi: 10.1038/nbt1210-1248 pmcid: 21139605

Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

pubmed: 24056875 doi: 10.1038/nmeth.2639

Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

pubmed: 3163565 pmcid: 3163565 doi: 10.1186/1471-2105-12-323

Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

pubmed: 26000487 pmcid: 4441768 doi: 10.1016/j.cell.2015.04.044

Ricordi, C., Lacy, P. E., Finke, E. H., Olack, B. J. & Scharp, D. W. Automated method for isolation of human pancreatic islets. Diabetes 37, 413–420 (1988).

pubmed: 3288530 doi: 10.2337/diab.37.4.413

Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).

pubmed: 25867923 pmcid: 4430369 doi: 10.1038/nbt.3192

Consistent RNA sequencing contamination in GTEx and other data sets.

Journal

Informations de publication

Résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Références

Auteurs

Tim O Nieuwenhuis (TO)

Stephanie Y Yang (SY)

Rohan X Verma (RX)

Vamsee Pillalamarri (V)

Dan E Arking (DE)

Avi Z Rosenberg (AZ)

Matthew N McCall (MN)

Marc K Halushka (MK)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Classifications MeSH