Recurrent miscalling of missense variation from short-read genome sequence data.


Journal

BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258

Informations de publication

Date de publication:
16 Jul 2019
Historique:
entrez: 17 7 2019
pubmed: 17 7 2019
medline: 18 12 2019
Statut: epublish

Résumé

Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2-300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3-5000 recurrent false positive variants per mouse - the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome - which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.

Sections du résumé

BACKGROUND BACKGROUND
Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation.
RESULTS RESULTS
We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2-300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3-5000 recurrent false positive variants per mouse - the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation.
CONCLUSION CONCLUSIONS
Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome - which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.

Identifiants

pubmed: 31307400
doi: 10.1186/s12864-019-5863-2
pii: 10.1186/s12864-019-5863-2
pmc: PMC6631443
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

546

Références

Bioinformatics. 2009 Jul 15;25(14):1754-60
pubmed: 19451168
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Nat Genet. 2011 May;43(5):491-8
pubmed: 21478889
Nat Rev Genet. 2011 Aug 18;12(9):628-40
pubmed: 21850043
Nature. 2011 Sep 14;477(7364):289-94
pubmed: 21921910
Hum Genomics. 2011 Oct;5(6):577-622
pubmed: 22155605
PLoS One. 2012;7(1):e30377
pubmed: 22276185
Hum Mutat. 2012 Apr;33(4):609-13
pubmed: 22294350
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Open Biol. 2012 May;2(5):120061
pubmed: 22724066
Genome Biol. 2012 Aug 23;13(8):R72
pubmed: 22916792
Genome Med. 2013 Mar 27;5(3):28
pubmed: 23537139
Nucleic Acids Res. 2013 Oct;41(19):e178
pubmed: 23935067
Nat Biotechnol. 2014 Mar;32(3):246-51
pubmed: 24531798
JAMA. 2014 Mar 12;311(10):1035-45
pubmed: 24618965
BMC Bioinformatics. 2014 Apr 12;15:104
pubmed: 24725768
BMC Med Genomics. 2014 Apr 23;7:20
pubmed: 24758382
N Engl J Med. 2014 Jun 19;370(25):2418-25
pubmed: 24941179
Bioinformatics. 2014 Oct 15;30(20):2843-51
pubmed: 24974202
Nucleic Acids Res. 2016 Jan 4;44(D1):D862-8
pubmed: 26582918
PLoS One. 2015 Nov 23;10(11):e0143199
pubmed: 26600436
Genome Biol. 2016 Jun 06;17(1):122
pubmed: 27268795
Genome Res. 2017 Jan;27(1):157-164
pubmed: 27903644

Auteurs

Matthew A Field (MA)

Department of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia.
Australian Institute of Tropical Health and Medicine, Centre for Tropical Bioinformatics and Molecular Biology, James Cook University, Cairns, Queensland, Australia.

Gaetan Burgio (G)

Department of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia.

Aaron Chuah (A)

Department of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia.

Jalila Al Shekaili (J)

Department of Microbiology and Immunology, Sultan Qaboos University Hospital, Seeb, Oman.

Batool Hassan (B)

Department of Medicine, Sultan Qaboos University Hospital, Muscat, Oman.

Nashat Al Sukaiti (N)

Department of Paediatrics, Allergy, and Clinical Immunology Unit, Royal Hospital, Muscat, Oman.

Simon J Foote (SJ)

Department of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia.

Matthew C Cook (MC)

Department of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia.
Department of Immunology, Canberra Hospital, Canberra, Australian Capital Territory, Australia.

T Daniel Andrews (TD)

Department of Immunology and Infectious Disease, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia. dan.andrews@anu.edu.au.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH