AutoPeptideML: A study on how to build more trustworthy peptide bioactivity predictors.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
18 Sep 2024
Historique:
received: 24 06 2024
revised: 08 08 2024
accepted: 17 09 2024
medline: 18 9 2024
pubmed: 18 9 2024
entrez: 18 9 2024
Statut: aheadofprint

Résumé

Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build their own custom models. We examine different steps in the development life-cycle of peptide bioactivity binary predictors and identify key steps where automation can not only result in a more accessible method, but also more robust and interpretable evaluation leading to more trustworthy models. We present a new automated method for drawing negative peptides that achieves better balance between specificity and generalisation than current alternatives. We study the effect of homology-based partitioning for generating the training and testing data subsets and demonstrate that model performance is overestimated when no such homology correction is used, which indicates that prior studies may have overestimated their performance when applied to new peptide sequences. We also conduct a systematic analysis of different protein language models as peptide representation methods and find that they can serve as better descriptors than a naive alternative, but that there is no significant difference across models with different sizes or algorithms. Finally, we demonstrate that an ensemble of optimised traditional machine learning algorithms can compete with more complex neural network models, while being more computationally efficient. We integrate these findings into AutoPeptideML, an easy-to-use AutoML tool to allow researchers without a computational background to build new predictive models for peptide bioactivity in a matter of minutes. Source code, documentation, and data are available at https://github.com/IBM/AutoPeptideML and a dedicated web-server at http://peptide.ucd.ie/AutoPeptideML. Raul.fernandezdiaz@ucdconnect.ie or denis.shields@ucd.ie. Supplementary Information can be accessed from Zenodo: Https://zenodo.org/records/13363975.

Identifiants

pubmed: 39292535
pii: 7760207
doi: 10.1093/bioinformatics/btae555
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2024. Published by Oxford University Press.

Auteurs

Raúl Fernández-Díaz (R)

IBM Research, Dublin, IBM Technology Campus Damastown Industrial Park Mulhuddart, Dublin, D15 HN66, Ireland.
School of Medicine, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.
Conway Institute of Biomolecular and Biomedical Science, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.
The SFI Centre for Research Training in Genomics Data Science.

Rodrigo Cossio-Pérez (R)

School of Medicine, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.
Conway Institute of Biomolecular and Biomedical Science, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.
Department of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, B1876, Argentina Provincia de Buenos Aires.

Clement Agoni (C)

School of Medicine, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.
Conway Institute of Biomolecular and Biomedical Science, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.
Discipline of Pharmaceutical Sciences, School of Health Sciences, University of KwaZulu-Natal, Durban, 4000, South Africa.

Thanh Lam Hoang (TL)

IBM Research, Dublin, IBM Technology Campus Damastown Industrial Park Mulhuddart, Dublin, D15 HN66, Ireland.

Vanessa Lopez (V)

IBM Research, Dublin, IBM Technology Campus Damastown Industrial Park Mulhuddart, Dublin, D15 HN66, Ireland.

Denis C Shields (DC)

School of Medicine, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.
Conway Institute of Biomolecular and Biomedical Science, University College Dublin, D04 C1P1, Ireland Belfield, Dublin.

Classifications MeSH