Classification of
Classification of bioinformatical subject
Machine learning
Microbial genomics
Journal
iScience
ISSN: 2589-0042
Titre abrégé: iScience
Pays: United States
ID NLM: 101724038
Informations de publication
Date de publication:
15 Mar 2024
15 Mar 2024
Historique:
received:
27
07
2023
revised:
13
12
2023
accepted:
13
02
2024
medline:
5
3
2024
pubmed:
5
3
2024
entrez:
5
3
2024
Statut:
epublish
Résumé
Whole genome sequencing of bacteria is important to enable strain classification. Using entire genomes as an input to machine learning (ML) models would allow rapid classification of strains while using information from multiple genetic elements. We developed a "bag-of-words" approach to encode, using SentencePiece or k-mer tokenization, entire bacterial genomes and analyze these with ML. Initial model selection identified SentencePiece with 8,000 and 32,000 words as the best approach for genome tokenization. We then classified in
Identifiants
pubmed: 38439962
doi: 10.1016/j.isci.2024.109257
pii: S2589-0042(24)00478-4
pmc: PMC10910294
doi:
Types de publication
Journal Article
Langues
eng
Pagination
109257Informations de copyright
© 2024 GSK.
Déclaration de conflit d'intérêts
M.P. and S.B. disclose that their postdoctoral grant at the University of Pisa, Italy, is funded by GSK. Outside of the submitted work, S.B. also discloses having received grants from the University of Siena, Italy and the University of Tuscia, Viterbo, Italy; payment or honoraria for lectures, presentations, speakers’ bureaus, manuscript writing or educational events from the University of Siena, Italy, University of Bari, Italy, and the Italica Academy srl. A.P., A.B., G.R., A.M., and M.B. are employed by GSK. M.S. was an intern at GSK during the time of the study. G.R. holds shares in GSK and in Novartis AG. A.M. holds shares in GSK. C.P. and A.S. disclose that GSK commissioned this research. The authors declare no other financial and non-financial relationships and activities and no other conflicts of interest.