Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.


Journal

PloS one
ISSN: 1932-6203
Titre abrégé: PLoS One
Pays: United States
ID NLM: 101285081

Informations de publication

Date de publication:
2021
Historique:
received: 30 04 2021
accepted: 02 10 2021
entrez: 14 10 2021
pubmed: 15 10 2021
medline: 26 11 2021
Statut: epublish

Résumé

Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.

Identifiants

pubmed: 34648558
doi: 10.1371/journal.pone.0258693
pii: PONE-D-21-14340
pmc: PMC8516232
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e0258693

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

Genome Biol. 2017 Oct 3;18(1):186
pubmed: 28974235
Brief Bioinform. 2019 Mar 22;20(2):426-435
pubmed: 28673025
BMC Genomics. 2008 Oct 30;9:509
pubmed: 18973670
Semin Cell Dev Biol. 2016 Mar;51:14-23
pubmed: 26691180
Int J Syst Evol Microbiol. 2007 Jan;57(Pt 1):81-91
pubmed: 17220447
BMC Bioinformatics. 2008 Jan 25;9:48
pubmed: 18218139
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
F1000Res. 2016 Nov 29;5:2789
pubmed: 28105314
Sci Rep. 2016 Apr 12;6:24175
pubmed: 27067514
Bioinformatics. 2011 Apr 15;27(8):1061-7
pubmed: 21317142
Curr Biol. 2013 Nov 18;23(22):2262-2267
pubmed: 24184098
J Theor Biol. 1966 Feb;10(2):281-300
pubmed: 5964394
Pac Symp Biocomput. 2000;:418-29
pubmed: 10902190
Syst Biol. 2005 Apr;54(2):317-37
pubmed: 16012099
mSystems. 2016 Jun 7;1(3):
pubmed: 27822531
Nat Commun. 2018 Nov 30;9(1):5114
pubmed: 30504855
BMC Genomics. 2015 Jul 14;16:522
pubmed: 26169061
Philos Trans R Soc Lond B Biol Sci. 2016 Jul 19;371(1699):
pubmed: 27325836
Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5
pubmed: 17130148
Microbes Infect. 2002 Sep;4(11):1125-32
pubmed: 12361912
Cell. 2020 Apr 16;181(2):232-235
pubmed: 32302567
Nat Biotechnol. 2018 Nov;36(10):996-1004
pubmed: 30148503
Science. 2013 Feb 8;339(6120):662-7
pubmed: 23393258
Proc Natl Acad Sci U S A. 2005 Oct 4;102(40):14332-7
pubmed: 16176988
BMC Evol Biol. 2010 Jul 13;10:210
pubmed: 20626897
J Mol Biol. 1961 Oct;3:595-617
pubmed: 14498380
Bioinformatics. 2016 Feb 15;32(4):605-7
pubmed: 26515820
Nucleic Acids Res. 2014 Apr;42(8):e67
pubmed: 24523352
PeerJ. 2017 Nov 30;5:e4026
pubmed: 29204318
Nucleic Acids Res. 2012 Jan;40(Database issue):D136-43
pubmed: 22139910
Proc Natl Acad Sci U S A. 1963 Jul;50:156-64
pubmed: 13932048
Bioinformatics. 2001;17 Suppl 1:S22-9
pubmed: 11472989
Int J Syst Evol Microbiol. 2014 Feb;64(Pt 2):346-351
pubmed: 24505072
J Bacteriol. 2005 Dec;187(24):8370-4
pubmed: 16321941
Genome Res. 2016 Jan;26(1):1-11
pubmed: 26518481
PLoS Comput Biol. 2009 Dec;5(12):e1000605
pubmed: 20011109
J Theor Biol. 2004 Dec 7;231(3):377-88
pubmed: 15501469
Bioinformatics. 2017 Sep 1;33(17):2759-2761
pubmed: 28472236
Bioinformatics. 2016 Dec 1;32(23):3535-3542
pubmed: 27515739
Phys Rev E Stat Nonlin Soft Matter Phys. 2002 Sep;66(3 Pt 1):031906
pubmed: 12366151
Bioinformatics. 2015 May 15;31(10):1569-76
pubmed: 25609798
Microbiology (Reading). 2010 Jul;156(Pt 7):1909-1917
pubmed: 20430813
PeerJ. 2017 May 18;5:e3353
pubmed: 28533988
Bioinformatics. 2007 Jan 1;23(1):127-8
pubmed: 17050570
J Hum Evol. 2013 Oct;65(4):424-46
pubmed: 23981863
J Theor Biol. 1997 Oct 7;188(3):369-77
pubmed: 9344742
Nat Biotechnol. 2020 Sep;38(9):1079-1086
pubmed: 32341564
Genome Biol. 2009;10(8):R85
pubmed: 19698104
Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42
pubmed: 23193287
Mol Phylogenet Evol. 2004 Apr;31(1):256-76
pubmed: 15019624
Brief Bioinform. 2019 Jul 19;20(4):1222-1237
pubmed: 29220512
Bioinformatics. 2015 Nov 15;31(22):3718-20
pubmed: 26209431
Semin Cell Dev Biol. 2016 Mar;51:3-13
pubmed: 26701126
Trends Microbiol. 2015 Aug;23(8):448-50
pubmed: 26112912
Bioinformatics. 2020 Dec 22;36(20):5007-5013
pubmed: 32619004
PLoS One. 2007 Aug 29;2(8):e790
pubmed: 17726520
J Theor Biol. 1968 Feb;18(2):181-94
pubmed: 5647130
Front Bioeng Biotechnol. 2015 Sep 11;3:138
pubmed: 26442252
PLoS One. 2013 Jul 01;8(7):e67337
pubmed: 23840870
Bioinformatics. 2018 May 1;34(9):1589-1590
pubmed: 29309527
BMC Evol Biol. 2006 Nov 13;6:93
pubmed: 17101039
Syst Biol. 1999 Mar;48(1):1-5
pubmed: 12078634
Front Microbiol. 2019 Feb 28;10:383
pubmed: 30873148
Proc Natl Acad Sci U S A. 2009 Nov 10;106(45):19126-31
pubmed: 19855009
Integr Zool. 2015 Mar;10(2):186-98
pubmed: 25311886
Nucleic Acids Res. 2000 Jan 1;28(1):27-30
pubmed: 10592173
Philos Trans R Soc Lond B Biol Sci. 2006 Nov 29;361(1475):1929-40
pubmed: 17062412
Trends Ecol Evol. 2007 May;22(5):258-65
pubmed: 17300853
J Eukaryot Microbiol. 2005 Sep-Oct;52(5):399-451
pubmed: 16248873
Brief Bioinform. 2014 Nov;15(6):890-905
pubmed: 23904502
Proc Natl Acad Sci U S A. 2012 Dec 11;109(50):20537-42
pubmed: 23184964
Int J Syst Evol Microbiol. 2010 Jan;60(Pt 1):249-266
pubmed: 19700448
BMC Evol Biol. 2012 Jun 14;12:88
pubmed: 22697210
Microb Genom. 2018 Apr;4(4):
pubmed: 29633935
Nature. 2017 Jan 19;541(7637):353-358
pubmed: 28077874
Genome Biol. 2016 Jun 20;17(1):132
pubmed: 27323842
Genome Biol. 2019 Jul 25;20(1):144
pubmed: 31345254
Bioinformatics. 2004 Jan 22;20(2):289-90
pubmed: 14734327
PLoS Genet. 2018 Mar 29;14(3):e1007080
pubmed: 29596421
PeerJ. 2015 Aug 27;3:e1165
pubmed: 26336640
Proc Natl Acad Sci U S A. 2005 Feb 15;102(7):2567-72
pubmed: 15701695

Auteurs

Yuval Bussi (Y)

Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel.
Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel.
Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel.

Ruti Kapon (R)

Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel.

Ziv Reich (Z)

Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Populus Soil Microbiology Soil Microbiota Fungi
Amaryllidaceae Alkaloids Lycoris NADPH-Ferrihemoprotein Reductase Gene Expression Regulation, Plant Plant Proteins

Classifications MeSH