Predicting the pathogenicity of bacterial genomes using widely spread protein families.
Commensal bacteria
Comparative genomics
Opportunistic bacteria
Pathogenic bacteria
Protein families
Random forest
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
24 Jun 2022
24 Jun 2022
Historique:
received:
08
09
2021
accepted:
13
04
2022
entrez:
24
6
2022
pubmed:
25
6
2022
medline:
29
6
2022
Statut:
epublish
Résumé
The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved. We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.
Sections du résumé
BACKGROUND
BACKGROUND
The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved.
RESULTS
RESULTS
We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.
Identifiants
pubmed: 35751023
doi: 10.1186/s12859-022-04777-w
pii: 10.1186/s12859-022-04777-w
pmc: PMC9233384
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
253Subventions
Organisme : Israel Science Foundation
ID : 939/18
Organisme : Israel Science Foundation
ID : 939/18
Organisme : Israel Science Foundation
ID : 939/18
Organisme : Israel Science Foundation
ID : 988/19
Organisme : Israel Ministry of Science and Technology
ID : 316841
Informations de copyright
© 2022. The Author(s).
Références
Nature. 2008 Feb 21;451(7181):990-3
pubmed: 18288193
Front Microbiol. 2016 Feb 08;7:118
pubmed: 26903996
Bioinformatics. 2019 Jun 1;35(12):2001-2008
pubmed: 30407484
Nucleic Acids Res. 2012 Jan;40(Database issue):D136-43
pubmed: 22139910
PLoS One. 2010 Oct 27;5(10):e13680
pubmed: 21048922
Nat Methods. 2020 Mar;17(3):261-272
pubmed: 32015543
BMC Genomics. 2008 Jan 07;9:4
pubmed: 18179692
Microbiol Mol Biol Rev. 1997 Jun;61(2):136-69
pubmed: 9184008
Radiology. 1982 Apr;143(1):29-36
pubmed: 7063747
FEMS Microbiol Lett. 2001 Jul 10;201(1):1-7
pubmed: 11445159
PLoS One. 2013 Oct 28;8(10):e77302
pubmed: 24204795
FEMS Microbiol Rev. 2018 May 1;42(3):273-292
pubmed: 29325027
Mucosal Immunol. 2017 Jan;10(1):18-26
pubmed: 27554295
Clin Microbiol Infect. 2016 Jan;22(1):12-21
pubmed: 26493844
Pac Symp Biocomput. 2003;:53-64
pubmed: 12603017
Sci Rep. 2015 Feb 10;5:8365
pubmed: 25666585
Bioinformatics. 2020 Jan 1;36(1):81-89
pubmed: 31298694
PLoS One. 2012;7(8):e42144
pubmed: 22916122
Nucleic Acids Res. 2020 Jan 8;48(D1):D606-D612
pubmed: 31667520
Bioinformatics. 2011 Jul 15;27(14):1986-94
pubmed: 21576180
Funct Integr Genomics. 2015 Mar;15(2):141-61
pubmed: 25722247
Biochem Med (Zagreb). 2013;23(2):143-9
pubmed: 23894860
Cell Mol Life Sci. 2011 Feb;68(4):613-34
pubmed: 21072677
Genome Med. 2013 Sep 20;5(9):81
pubmed: 24050114
Virulence. 2013 Aug 15;4(6):473-82
pubmed: 23863604
Biostatistics. 2007 Apr;8(2):212-27
pubmed: 16698769
Nat Rev Microbiol. 2009 Aug;7(8):555-67
pubmed: 19609257
Front Immunol. 2019 May 31;10:1203
pubmed: 31214175
Clin Infect Dis. 2001 Mar 1;32(5):675-85
pubmed: 11229834
Sci Rep. 2017 Jan 04;7:39194
pubmed: 28051068
ISME J. 2009 Feb;3(2):179-89
pubmed: 18971961
Nucleic Acids Res. 2019 Jan 8;47(D1):D94-D99
pubmed: 30365038
Cell Microbiol. 2011 Feb;13(2):171-6
pubmed: 21166974
FEMS Microbiol Rev. 2009 Jan;33(1):133-51
pubmed: 19076632
Trends Microbiol. 2012 Jul;20(7):336-42
pubmed: 22564248
Clin Microbiol Rev. 2007 Oct;20(4):593-621
pubmed: 17934076
J Chem Inf Comput Sci. 2004 Jan-Feb;44(1):1-12
pubmed: 14741005