Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.
Journal
PloS one
ISSN: 1932-6203
Titre abrégé: PLoS One
Pays: United States
ID NLM: 101285081
Informations de publication
Date de publication:
2021
2021
Historique:
received:
30
04
2021
accepted:
02
10
2021
entrez:
14
10
2021
pubmed:
15
10
2021
medline:
26
11
2021
Statut:
epublish
Résumé
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Identifiants
pubmed: 34648558
doi: 10.1371/journal.pone.0258693
pii: PONE-D-21-14340
pmc: PMC8516232
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
e0258693Déclaration de conflit d'intérêts
The authors have declared that no competing interests exist.
Références
Genome Biol. 2017 Oct 3;18(1):186
pubmed: 28974235
Brief Bioinform. 2019 Mar 22;20(2):426-435
pubmed: 28673025
BMC Genomics. 2008 Oct 30;9:509
pubmed: 18973670
Semin Cell Dev Biol. 2016 Mar;51:14-23
pubmed: 26691180
Int J Syst Evol Microbiol. 2007 Jan;57(Pt 1):81-91
pubmed: 17220447
BMC Bioinformatics. 2008 Jan 25;9:48
pubmed: 18218139
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
F1000Res. 2016 Nov 29;5:2789
pubmed: 28105314
Sci Rep. 2016 Apr 12;6:24175
pubmed: 27067514
Bioinformatics. 2011 Apr 15;27(8):1061-7
pubmed: 21317142
Curr Biol. 2013 Nov 18;23(22):2262-2267
pubmed: 24184098
J Theor Biol. 1966 Feb;10(2):281-300
pubmed: 5964394
Pac Symp Biocomput. 2000;:418-29
pubmed: 10902190
Syst Biol. 2005 Apr;54(2):317-37
pubmed: 16012099
mSystems. 2016 Jun 7;1(3):
pubmed: 27822531
Nat Commun. 2018 Nov 30;9(1):5114
pubmed: 30504855
BMC Genomics. 2015 Jul 14;16:522
pubmed: 26169061
Philos Trans R Soc Lond B Biol Sci. 2016 Jul 19;371(1699):
pubmed: 27325836
Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5
pubmed: 17130148
Microbes Infect. 2002 Sep;4(11):1125-32
pubmed: 12361912
Cell. 2020 Apr 16;181(2):232-235
pubmed: 32302567
Nat Biotechnol. 2018 Nov;36(10):996-1004
pubmed: 30148503
Science. 2013 Feb 8;339(6120):662-7
pubmed: 23393258
Proc Natl Acad Sci U S A. 2005 Oct 4;102(40):14332-7
pubmed: 16176988
BMC Evol Biol. 2010 Jul 13;10:210
pubmed: 20626897
J Mol Biol. 1961 Oct;3:595-617
pubmed: 14498380
Bioinformatics. 2016 Feb 15;32(4):605-7
pubmed: 26515820
Nucleic Acids Res. 2014 Apr;42(8):e67
pubmed: 24523352
PeerJ. 2017 Nov 30;5:e4026
pubmed: 29204318
Nucleic Acids Res. 2012 Jan;40(Database issue):D136-43
pubmed: 22139910
Proc Natl Acad Sci U S A. 1963 Jul;50:156-64
pubmed: 13932048
Bioinformatics. 2001;17 Suppl 1:S22-9
pubmed: 11472989
Int J Syst Evol Microbiol. 2014 Feb;64(Pt 2):346-351
pubmed: 24505072
J Bacteriol. 2005 Dec;187(24):8370-4
pubmed: 16321941
Genome Res. 2016 Jan;26(1):1-11
pubmed: 26518481
PLoS Comput Biol. 2009 Dec;5(12):e1000605
pubmed: 20011109
J Theor Biol. 2004 Dec 7;231(3):377-88
pubmed: 15501469
Bioinformatics. 2017 Sep 1;33(17):2759-2761
pubmed: 28472236
Bioinformatics. 2016 Dec 1;32(23):3535-3542
pubmed: 27515739
Phys Rev E Stat Nonlin Soft Matter Phys. 2002 Sep;66(3 Pt 1):031906
pubmed: 12366151
Bioinformatics. 2015 May 15;31(10):1569-76
pubmed: 25609798
Microbiology (Reading). 2010 Jul;156(Pt 7):1909-1917
pubmed: 20430813
PeerJ. 2017 May 18;5:e3353
pubmed: 28533988
Bioinformatics. 2007 Jan 1;23(1):127-8
pubmed: 17050570
J Hum Evol. 2013 Oct;65(4):424-46
pubmed: 23981863
J Theor Biol. 1997 Oct 7;188(3):369-77
pubmed: 9344742
Nat Biotechnol. 2020 Sep;38(9):1079-1086
pubmed: 32341564
Genome Biol. 2009;10(8):R85
pubmed: 19698104
Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42
pubmed: 23193287
Mol Phylogenet Evol. 2004 Apr;31(1):256-76
pubmed: 15019624
Brief Bioinform. 2019 Jul 19;20(4):1222-1237
pubmed: 29220512
Bioinformatics. 2015 Nov 15;31(22):3718-20
pubmed: 26209431
Semin Cell Dev Biol. 2016 Mar;51:3-13
pubmed: 26701126
Trends Microbiol. 2015 Aug;23(8):448-50
pubmed: 26112912
Bioinformatics. 2020 Dec 22;36(20):5007-5013
pubmed: 32619004
PLoS One. 2007 Aug 29;2(8):e790
pubmed: 17726520
J Theor Biol. 1968 Feb;18(2):181-94
pubmed: 5647130
Front Bioeng Biotechnol. 2015 Sep 11;3:138
pubmed: 26442252
PLoS One. 2013 Jul 01;8(7):e67337
pubmed: 23840870
Bioinformatics. 2018 May 1;34(9):1589-1590
pubmed: 29309527
BMC Evol Biol. 2006 Nov 13;6:93
pubmed: 17101039
Syst Biol. 1999 Mar;48(1):1-5
pubmed: 12078634
Front Microbiol. 2019 Feb 28;10:383
pubmed: 30873148
Proc Natl Acad Sci U S A. 2009 Nov 10;106(45):19126-31
pubmed: 19855009
Integr Zool. 2015 Mar;10(2):186-98
pubmed: 25311886
Nucleic Acids Res. 2000 Jan 1;28(1):27-30
pubmed: 10592173
Philos Trans R Soc Lond B Biol Sci. 2006 Nov 29;361(1475):1929-40
pubmed: 17062412
Trends Ecol Evol. 2007 May;22(5):258-65
pubmed: 17300853
J Eukaryot Microbiol. 2005 Sep-Oct;52(5):399-451
pubmed: 16248873
Brief Bioinform. 2014 Nov;15(6):890-905
pubmed: 23904502
Proc Natl Acad Sci U S A. 2012 Dec 11;109(50):20537-42
pubmed: 23184964
Int J Syst Evol Microbiol. 2010 Jan;60(Pt 1):249-266
pubmed: 19700448
BMC Evol Biol. 2012 Jun 14;12:88
pubmed: 22697210
Microb Genom. 2018 Apr;4(4):
pubmed: 29633935
Nature. 2017 Jan 19;541(7637):353-358
pubmed: 28077874
Genome Biol. 2016 Jun 20;17(1):132
pubmed: 27323842
Genome Biol. 2019 Jul 25;20(1):144
pubmed: 31345254
Bioinformatics. 2004 Jan 22;20(2):289-90
pubmed: 14734327
PLoS Genet. 2018 Mar 29;14(3):e1007080
pubmed: 29596421
PeerJ. 2015 Aug 27;3:e1165
pubmed: 26336640
Proc Natl Acad Sci U S A. 2005 Feb 15;102(7):2567-72
pubmed: 15701695