Predicting the pathogenicity of bacterial genomes using widely spread protein families.

Commensal bacteria Comparative genomics Opportunistic bacteria Pathogenic bacteria Protein families Random forest

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
24 Jun 2022
Historique:
received: 08 09 2021
accepted: 13 04 2022
entrez: 24 6 2022
pubmed: 25 6 2022
medline: 29 6 2022
Statut: epublish

Résumé

The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved. We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.

Sections du résumé

BACKGROUND BACKGROUND
The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved.
RESULTS RESULTS
We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.

Identifiants

pubmed: 35751023
doi: 10.1186/s12859-022-04777-w
pii: 10.1186/s12859-022-04777-w
pmc: PMC9233384
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

253

Subventions

Organisme : Israel Science Foundation
ID : 939/18
Organisme : Israel Science Foundation
ID : 939/18
Organisme : Israel Science Foundation
ID : 939/18
Organisme : Israel Science Foundation
ID : 988/19
Organisme : Israel Ministry of Science and Technology
ID : 316841

Informations de copyright

© 2022. The Author(s).

Références

Nature. 2008 Feb 21;451(7181):990-3
pubmed: 18288193
Front Microbiol. 2016 Feb 08;7:118
pubmed: 26903996
Bioinformatics. 2019 Jun 1;35(12):2001-2008
pubmed: 30407484
Nucleic Acids Res. 2012 Jan;40(Database issue):D136-43
pubmed: 22139910
PLoS One. 2010 Oct 27;5(10):e13680
pubmed: 21048922
Nat Methods. 2020 Mar;17(3):261-272
pubmed: 32015543
BMC Genomics. 2008 Jan 07;9:4
pubmed: 18179692
Microbiol Mol Biol Rev. 1997 Jun;61(2):136-69
pubmed: 9184008
Radiology. 1982 Apr;143(1):29-36
pubmed: 7063747
FEMS Microbiol Lett. 2001 Jul 10;201(1):1-7
pubmed: 11445159
PLoS One. 2013 Oct 28;8(10):e77302
pubmed: 24204795
FEMS Microbiol Rev. 2018 May 1;42(3):273-292
pubmed: 29325027
Mucosal Immunol. 2017 Jan;10(1):18-26
pubmed: 27554295
Clin Microbiol Infect. 2016 Jan;22(1):12-21
pubmed: 26493844
Pac Symp Biocomput. 2003;:53-64
pubmed: 12603017
Sci Rep. 2015 Feb 10;5:8365
pubmed: 25666585
Bioinformatics. 2020 Jan 1;36(1):81-89
pubmed: 31298694
PLoS One. 2012;7(8):e42144
pubmed: 22916122
Nucleic Acids Res. 2020 Jan 8;48(D1):D606-D612
pubmed: 31667520
Bioinformatics. 2011 Jul 15;27(14):1986-94
pubmed: 21576180
Funct Integr Genomics. 2015 Mar;15(2):141-61
pubmed: 25722247
Biochem Med (Zagreb). 2013;23(2):143-9
pubmed: 23894860
Cell Mol Life Sci. 2011 Feb;68(4):613-34
pubmed: 21072677
Genome Med. 2013 Sep 20;5(9):81
pubmed: 24050114
Virulence. 2013 Aug 15;4(6):473-82
pubmed: 23863604
Biostatistics. 2007 Apr;8(2):212-27
pubmed: 16698769
Nat Rev Microbiol. 2009 Aug;7(8):555-67
pubmed: 19609257
Front Immunol. 2019 May 31;10:1203
pubmed: 31214175
Clin Infect Dis. 2001 Mar 1;32(5):675-85
pubmed: 11229834
Sci Rep. 2017 Jan 04;7:39194
pubmed: 28051068
ISME J. 2009 Feb;3(2):179-89
pubmed: 18971961
Nucleic Acids Res. 2019 Jan 8;47(D1):D94-D99
pubmed: 30365038
Cell Microbiol. 2011 Feb;13(2):171-6
pubmed: 21166974
FEMS Microbiol Rev. 2009 Jan;33(1):133-51
pubmed: 19076632
Trends Microbiol. 2012 Jul;20(7):336-42
pubmed: 22564248
Clin Microbiol Rev. 2007 Oct;20(4):593-621
pubmed: 17934076
J Chem Inf Comput Sci. 2004 Jan-Feb;44(1):1-12
pubmed: 14741005

Auteurs

Shaked Naor-Hoffmann (S)

Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel.

Dina Svetlitsky (D)

Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel.

Neta Sal-Man (N)

The Shraga Segal Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Be'er Sheva, Israel.

Yaron Orenstein (Y)

School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be'er Sheva, Israel.

Michal Ziv-Ukelson (M)

Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel. michaluz@cs.bgu.ac.il.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH