GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms.


Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
01 02 2020
Historique:
received: 16 07 2019
revised: 25 11 2019
accepted: 14 01 2020
entrez: 14 2 2020
pubmed: 14 2 2020
medline: 28 1 2021
Statut: ppublish

Résumé

Metagenomic sequencing is a well-established tool in the modern biosciences. While it promises unparalleled insights into the genetic content of the biological samples studied, conclusions drawn are at risk from biases inherent to the DNA sequencing methods, including inaccurate abundance estimates as a function of genomic guanine-cytosine (GC) contents. We explored such GC biases across many commonly used platforms in experiments sequencing multiple genomes (with mean GC contents ranging from 28.9% to 62.4%) and metagenomes. GC bias profiles varied among different library preparation protocols and sequencing platforms. We found that our workflows using MiSeq and NextSeq were hindered by major GC biases, with problems becoming increasingly severe outside the 45-65% GC range, leading to a falsely low coverage in GC-rich and especially GC-poor sequences, where genomic windows with 30% GC content had >10-fold less coverage than windows close to 50% GC content. We also showed that GC content correlates tightly with coverage biases. The PacBio and HiSeq platforms also evidenced similar profiles of GC biases to each other, which were distinct from those seen in the MiSeq and NextSeq workflows. The Oxford Nanopore workflow was not afflicted by GC bias. These findings indicate potential sources of difficulty, arising from GC biases, in genome sequencing that could be pre-emptively addressed with methodological optimizations provided that the GC biases inherent to the relevant workflow are understood. Furthermore, it is recommended that a more critical approach be taken in quantitative abundance estimates in metagenomic studies. In the future, metagenomic studies should take steps to account for the effects of GC bias before drawing conclusions, or they should use a demonstrably unbiased workflow.

Sections du résumé

BACKGROUND
Metagenomic sequencing is a well-established tool in the modern biosciences. While it promises unparalleled insights into the genetic content of the biological samples studied, conclusions drawn are at risk from biases inherent to the DNA sequencing methods, including inaccurate abundance estimates as a function of genomic guanine-cytosine (GC) contents.
RESULTS
We explored such GC biases across many commonly used platforms in experiments sequencing multiple genomes (with mean GC contents ranging from 28.9% to 62.4%) and metagenomes. GC bias profiles varied among different library preparation protocols and sequencing platforms. We found that our workflows using MiSeq and NextSeq were hindered by major GC biases, with problems becoming increasingly severe outside the 45-65% GC range, leading to a falsely low coverage in GC-rich and especially GC-poor sequences, where genomic windows with 30% GC content had >10-fold less coverage than windows close to 50% GC content. We also showed that GC content correlates tightly with coverage biases. The PacBio and HiSeq platforms also evidenced similar profiles of GC biases to each other, which were distinct from those seen in the MiSeq and NextSeq workflows. The Oxford Nanopore workflow was not afflicted by GC bias.
CONCLUSIONS
These findings indicate potential sources of difficulty, arising from GC biases, in genome sequencing that could be pre-emptively addressed with methodological optimizations provided that the GC biases inherent to the relevant workflow are understood. Furthermore, it is recommended that a more critical approach be taken in quantitative abundance estimates in metagenomic studies. In the future, metagenomic studies should take steps to account for the effects of GC bias before drawing conclusions, or they should use a demonstrably unbiased workflow.

Identifiants

pubmed: 32052832
pii: 5735313
doi: 10.1093/gigascience/giaa008
pmc: PMC7016772
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2020. Published by Oxford University Press.

Références

BMC Genomics. 2012 Jan 03;13:1
pubmed: 22214261
Mol Cell. 2015 May 21;58(4):586-97
pubmed: 26000844
Genome Biol Evol. 2017 Sep 1;9(9):2477-2490
pubmed: 28961970
Nat Methods. 2008 Dec;5(12):1005-10
pubmed: 19034268
Nat Biotechnol. 2011 Sep 18;29(10):915-21
pubmed: 21926975
PLoS One. 2013 Apr 29;8(4):e62856
pubmed: 23638157
Exp Cell Res. 2014 Mar 10;322(1):12-20
pubmed: 24440557
BMC Genomics. 2011 Aug 08;12:402
pubmed: 21824423
Bioinformatics. 2012 Jun 1;28(11):1420-8
pubmed: 22495754
Genome Biol. 2013 May 29;14(5):R51
pubmed: 23718773
BMC Microbiol. 2015 Mar 21;15:66
pubmed: 25880246
BMC Res Notes. 2012 Jul 02;5:337
pubmed: 22748135
Nucleic Acids Res. 2012 May;40(10):e72
pubmed: 22323520
Cell Host Microbe. 2015 May 13;17(5):690-703
pubmed: 25974306
Proc Natl Acad Sci U S A. 2015 Nov 10;112(45):14024-9
pubmed: 26512100
Water Res. 2017 Sep 15;121:213-220
pubmed: 28544990
Biotechniques. 2014 Feb 01;56(2):61-4, 66, 68, passim
pubmed: 24502796
Cold Spring Harb Protoc. 2010 Jun;2010(6):pdb.prot5448
pubmed: 20516186
Appl Environ Microbiol. 2006 Jul;72(7):5069-72
pubmed: 16820507
Acta Vet Scand. 2018 Oct 11;60(1):61
pubmed: 30309375
Methods. 2016 Jun 1;102:3-11
pubmed: 27012178
Nat Commun. 2014 Nov 25;5:5498
pubmed: 25423494
ISME J. 2017 Jan;11(1):87-99
pubmed: 27552639
BMC Genomics. 2008 Feb 08;9:75
pubmed: 18261238
BMC Genomics. 2015 Oct 24;16:856
pubmed: 26496746
J Comput Biol. 2012 May;19(5):455-77
pubmed: 22506599
Front Microbiol. 2017 Mar 29;8:472
pubmed: 28424662
BMC Genomics. 2012 Jul 24;13:341
pubmed: 22827831
Nat Biotechnol. 2016 May 6;34(5):518-24
pubmed: 27153285
Genome Biol. 2011;12(2):R18
pubmed: 21338519
Nucleic Acids Res. 2015 Mar 31;43(6):e37
pubmed: 25586220
BMC Bioinformatics. 2012 Sep 19;13:238
pubmed: 22988817
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
PLoS One. 2013 Jul 22;8(7):e68484
pubmed: 23894309

Auteurs

Patrick Denis Browne (PD)

Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, 1871, Denmark.
Department of Environmental Science, Aarhus University, Frederiksborgvej 399, Roskilde, 4000, Denmark.

Tue Kjærgaard Nielsen (TK)

Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, 1871, Denmark.
Department of Environmental Science, Aarhus University, Frederiksborgvej 399, Roskilde, 4000, Denmark.

Witold Kot (W)

Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, 1871, Denmark.
Department of Environmental Science, Aarhus University, Frederiksborgvej 399, Roskilde, 4000, Denmark.

Anni Aggerholm (A)

Department of Hematology, Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, Aarhus N, 8200, Denmark.

M Thomas P Gilbert (MTP)

The GLOBE Institute, Faculty of Health and Biomedical Sciences, University of Copenhagen, Blegdamsvej 3B, Copenhagen N, 2200, Denmark.

Lara Puetz (L)

The GLOBE Institute, Faculty of Health and Biomedical Sciences, University of Copenhagen, Blegdamsvej 3B, Copenhagen N, 2200, Denmark.

Morten Rasmussen (M)

Department of Genetics, School of Medicine, Stanford University, 291 Campus Drive, Stanford, CA 94305-5051, USA.

Athanasios Zervas (A)

Department of Environmental Science, Aarhus University, Frederiksborgvej 399, Roskilde, 4000, Denmark.

Lars Hestbjerg Hansen (LH)

Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, 1871, Denmark.
Department of Environmental Science, Aarhus University, Frederiksborgvej 399, Roskilde, 4000, Denmark.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Classifications MeSH