Classification of

Classification of bioinformatical subject Machine learning Microbial genomics

Journal

iScience
ISSN: 2589-0042
Titre abrégé: iScience
Pays: United States
ID NLM: 101724038

Informations de publication

Date de publication:
15 Mar 2024
Historique:
received: 27 07 2023
revised: 13 12 2023
accepted: 13 02 2024
medline: 5 3 2024
pubmed: 5 3 2024
entrez: 5 3 2024
Statut: epublish

Résumé

Whole genome sequencing of bacteria is important to enable strain classification. Using entire genomes as an input to machine learning (ML) models would allow rapid classification of strains while using information from multiple genetic elements. We developed a "bag-of-words" approach to encode, using SentencePiece or k-mer tokenization, entire bacterial genomes and analyze these with ML. Initial model selection identified SentencePiece with 8,000 and 32,000 words as the best approach for genome tokenization. We then classified in

Identifiants

pubmed: 38439962
doi: 10.1016/j.isci.2024.109257
pii: S2589-0042(24)00478-4
pmc: PMC10910294
doi:

Types de publication

Journal Article

Langues

eng

Pagination

109257

Informations de copyright

© 2024 GSK.

Déclaration de conflit d'intérêts

M.P. and S.B. disclose that their postdoctoral grant at the University of Pisa, Italy, is funded by GSK. Outside of the submitted work, S.B. also discloses having received grants from the University of Siena, Italy and the University of Tuscia, Viterbo, Italy; payment or honoraria for lectures, presentations, speakers’ bureaus, manuscript writing or educational events from the University of Siena, Italy, University of Bari, Italy, and the Italica Academy srl. A.P., A.B., G.R., A.M., and M.B. are employed by GSK. M.S. was an intern at GSK during the time of the study. G.R. holds shares in GSK and in Novartis AG. A.M. holds shares in GSK. C.P. and A.S. disclose that GSK commissioned this research. The authors declare no other financial and non-financial relationships and activities and no other conflicts of interest.

Auteurs

Marco Podda (M)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.

Simone Bonechi (S)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.
Department of Computer Science, University of Pisa, 56127 Pisa, Italy.

Andrea Palladino (A)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.

Mattia Scaramuzzino (M)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.

Alessandro Brozzi (A)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.

Guglielmo Roma (G)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.

Alessandro Muzzi (A)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.

Corrado Priami (C)

Department of Computer Science, University of Pisa, 56127 Pisa, Italy.

Alina Sîrbu (A)

Department of Computer Science, University of Pisa, 56127 Pisa, Italy.

Margherita Bodini (M)

Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy.

Classifications MeSH