Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph.
Ancient DNA
Reference bias
Sequence alignment
Variation graph
Journal
Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660
Informations de publication
Date de publication:
17 09 2020
17 09 2020
Historique:
received:
27
09
2019
accepted:
27
08
2020
entrez:
18
9
2020
pubmed:
19
9
2020
medline:
12
6
2021
Statut:
epublish
Résumé
During the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Alternative approaches have been developed to replace the linear reference with a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software vg to avoid reference bias for aDNA and compare with existing methods. We use vg to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants and compare with the same data aligned with bwa to the human linear reference genome. Using vg leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with bwa, especially for insertions and deletions (indels). Alternative approaches that use relaxed bwa parameter settings or filter bwa alignments can also reduce bias but can have lower sensitivity than vg, particularly for indels. Our findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analyzing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed.
Sections du résumé
BACKGROUND
During the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Alternative approaches have been developed to replace the linear reference with a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software vg to avoid reference bias for aDNA and compare with existing methods.
RESULTS
We use vg to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants and compare with the same data aligned with bwa to the human linear reference genome. Using vg leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with bwa, especially for insertions and deletions (indels). Alternative approaches that use relaxed bwa parameter settings or filter bwa alignments can also reduce bias but can have lower sensitivity than vg, particularly for indels.
CONCLUSIONS
Our findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analyzing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed.
Identifiants
pubmed: 32943086
doi: 10.1186/s13059-020-02160-7
pii: 10.1186/s13059-020-02160-7
pmc: PMC7499850
doi:
Substances chimiques
DNA, Ancient
0
Types de publication
Evaluation Study
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
250Subventions
Organisme : Wellcome Trust
ID : WT206194
Pays : United Kingdom
Organisme : Wellcome Trust
ID : WT207492
Pays : United Kingdom
Références
Nucleic Acids Res. 2010 Apr;38(6):e87
pubmed: 20028723
Bioinformatics. 2019 Dec 15;35(24):5318-5320
pubmed: 31368484
Nature. 2014 Sep 18;513(7518):409-13
pubmed: 25230663
Nature. 2010 Feb 11;463(7282):757-62
pubmed: 20148029
Bioinformatics. 2015 Jul 1;31(13):2202-4
pubmed: 25701572
Nature. 2014 Oct 23;514(7523):445-9
pubmed: 25341783
BMC Genomics. 2012 May 10;13:178
pubmed: 22574660
PLoS Genet. 2006 Dec;2(12):e190
pubmed: 17194218
Science. 2010 May 7;328(5979):710-722
pubmed: 20448178
Nature. 2016 Aug 25;536(7617):419-24
pubmed: 27459054
BMC Res Notes. 2016 Feb 12;9:88
pubmed: 26868221
Bioinformatics. 2018 Dec 15;34(24):4165-4171
pubmed: 29931305
Sci Adv. 2019 Jun 26;5(6):eaaw5873
pubmed: 31249872
Genome Res. 2017 May;27(5):665-676
pubmed: 28360232
PLoS Biol. 2005 Nov;3(11):e339
pubmed: 16216086
Science. 2012 Apr 27;336(6080):466-9
pubmed: 22539720
Gigascience. 2015 Feb 25;4:7
pubmed: 25722852
Bioinformatics. 2012 Oct 15;28(20):2678-9
pubmed: 22914218
Nat Commun. 2016 Jan 19;7:10408
pubmed: 26783965
BMC Biol. 2018 Oct 25;16(1):121
pubmed: 30359256
Nat Commun. 2016 Jan 19;7:10326
pubmed: 26783717
Bioinformatics. 2014 Oct;30(19):2811-2
pubmed: 24930139
Annu Rev Genomics Hum Genet. 2018 Aug 31;19:381-404
pubmed: 29709204
Nat Biotechnol. 2018 Oct;36(9):875-879
pubmed: 30125266
Genetics. 2012 Nov;192(3):1065-93
pubmed: 22960212
Bioinformatics. 2013 Jul 01;29(13):1682-4
pubmed: 23613487
PLoS Genet. 2019 Jul 26;15(7):e1008302
pubmed: 31348818
Genome Biol. 2020 Feb 12;21(1):35
pubmed: 32051000
Bioinformatics. 2017 Feb 15;33(4):577-579
pubmed: 27794556
Trends Genet. 2020 Feb;36(2):132-145
pubmed: 31882191
Trends Genet. 2019 May;35(5):319-329
pubmed: 30954285
Philos Trans R Soc Lond B Biol Sci. 2015 Jan 19;370(1660):20130624
pubmed: 25487342
Nat Rev Genet. 2011 Aug 18;12(9):603-14
pubmed: 21850041
Cold Spring Harb Perspect Biol. 2013 Jul 01;5(7):
pubmed: 23729639
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
BMC Bioinformatics. 2014 Nov 25;15:356
pubmed: 25420514
Genome Biol. 2010;11(5):R47
pubmed: 20441577
Nature. 2014 Feb 13;506(7487):225-9
pubmed: 24522598
Nat Genet. 2006 Aug;38(8):904-9
pubmed: 16862161
Cell. 2018 Nov 15;175(5):1185-1197.e22
pubmed: 30415837
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Genes Immun. 2005 Jun;6(4):371-4
pubmed: 15815693
Science. 2018 Jun 29;360(6396):
pubmed: 29743352
Nature. 2015 Jun 11;522(7555):167-72
pubmed: 26062507
Methods Mol Biol. 2012;840:197-228
pubmed: 22237537
Bioinformatics. 2015 Jun 15;31(12):2032-4
pubmed: 25697820
Curr Opin Genet Dev. 2016 Dec;41:115-123
pubmed: 27685850
PLoS Biol. 2005 Nov;3(11):e378
pubmed: 16248677