ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest.
Journal
PLoS computational biology
ISSN: 1553-7358
Titre abrégé: PLoS Comput Biol
Pays: United States
ID NLM: 101238922
Informations de publication
Date de publication:
12 2019
12 2019
Historique:
received:
29
06
2019
accepted:
21
11
2019
revised:
01
01
2020
pubmed:
19
12
2019
medline:
17
3
2020
entrez:
19
12
2019
Statut:
epublish
Résumé
Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.
Identifiants
pubmed: 31851693
doi: 10.1371/journal.pcbi.1007556
pii: PCOMPBIOL-D-19-00961
pmc: PMC6938691
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
e1007556Subventions
Organisme : NIEHS NIH HHS
ID : K01 ES028064
Pays : United States
Déclaration de conflit d'intérêts
The authors have declared that no competing interests exist.
Références
Genet Epidemiol. 2016 Jul;40(5):432-41
pubmed: 27229527
Sci Rep. 2014 Mar 31;4:4532
pubmed: 24681819
BMC Genomics. 2012 Nov 24;13:666
pubmed: 23176052
Nat Biotechnol. 2014 Mar;32(3):246-51
pubmed: 24531798
Nat Commun. 2015 Feb 25;6:6275
pubmed: 25711446
Bioinformatics. 2011 Aug 1;27(15):2156-8
pubmed: 21653522
Nat Rev Genet. 2005 Feb;6(2):95-108
pubmed: 15716906
Nat Methods. 2011 Jan;8(1):61-5
pubmed: 21102452
Genome Biol. 2013 May 29;14(5):R51
pubmed: 23718773
Nat Genet. 2014 Sep;46(9):989-93
pubmed: 25064009
Nat Rev Genet. 2010 Jun;11(6):415-25
pubmed: 20479773
BMC Bioinformatics. 2016 Mar 11;17:125
pubmed: 26968756
BMC Genomics. 2012 May 20;13:194
pubmed: 22607156
Am J Hum Genet. 2012 Jan 13;90(1):7-24
pubmed: 22243964
Curr Opin Genet Dev. 2009 Jun;19(3):212-9
pubmed: 19481926
Genomics. 1994 Apr;20(3):386-96
pubmed: 8034311
Nature. 2010 Oct 28;467(7319):1061-73
pubmed: 20981092
Nature. 2015 Oct 1;526(7571):75-81
pubmed: 26432246
Nat Biotechnol. 2011 Sep 25;29(10):908-14
pubmed: 21947028
Nucleic Acids Res. 2008 Sep;36(16):e105
pubmed: 18660515
Nat Genet. 2011 Oct 09;43(11):1066-73
pubmed: 21983784
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Proc Natl Acad Sci U S A. 2008 Jul 8;105(27):9272-7
pubmed: 18583475
Hum Genomics. 2015 Aug 19;9:20
pubmed: 26286629
Nature. 2014 Jul 24;511(7510):421-7
pubmed: 25056061
Heredity (Edinb). 2003 Jan;90(1):33-8
pubmed: 12522423
Genome Res. 2017 Jan;27(1):157-164
pubmed: 27903644
N Engl J Med. 2009 Apr 23;360(17):1696-8
pubmed: 19369660
Genome Med. 2013 Mar 27;5(3):28
pubmed: 23537139
Genome Biol. 2015 Jan 20;16:6
pubmed: 25600152
Nat Genet. 2008 Jun;40(6):695-701
pubmed: 18509313
Genomics. 2007 Sep;90(3):291-6
pubmed: 17587543
Biosystems. 1986;19(4):273-83
pubmed: 3801602
Nat Genet. 2011 May;43(5):491-8
pubmed: 21478889
Sci Rep. 2011;1:55
pubmed: 22355574
Am J Hum Genet. 2001 Jul;69(1):124-37
pubmed: 11404818
Genome Res. 2011 Jun;21(6):830-9
pubmed: 21460062
Front Bioeng Biotechnol. 2015 Jun 25;3:92
pubmed: 26161383
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
Genome Biol. 2011;12(2):R18
pubmed: 21338519
Nature. 2007 Feb 22;445(7130):881-5
pubmed: 17293876
Genomics. 2014 May-Jun;103(5-6):323-8
pubmed: 24703969
Nucleic Acids Res. 2011 Jul;39(13):e90
pubmed: 21576222
Am J Hum Genet. 2014 Feb 6;94(2):233-45
pubmed: 24507775
Am J Hum Genet. 2002 Feb;70(2):496-508
pubmed: 11791215
J Biomol Tech. 2016 Dec;27(4):125-128
pubmed: 27672352
PLoS Genet. 2014 Aug 07;10(8):e1004517
pubmed: 25102180
Nature. 2003 Dec 18;426(6968):789-96
pubmed: 14685227
Genomics. 2004 Oct;84(4):623-30
pubmed: 15475239
Neuron. 2013 Jan 23;77(2):259-73
pubmed: 23352163
Nat Genet. 2012 Dec;44(12):1326-9
pubmed: 23104005
Neurobiol Aging. 2017 Nov;59:220.e1-220.e9
pubmed: 28789839
Nat Methods. 2018 Aug;15(8):595-597
pubmed: 30013044