GASP: A Pan-Specific Predictor of Family 1 Glycosyltransferase Acceptor Specificity Enabled by a Pipeline for Substrate Feature Generation and Large-Scale Experimental Screening.


Journal

ACS omega
ISSN: 2470-1343
Titre abrégé: ACS Omega
Pays: United States
ID NLM: 101691658

Informations de publication

Date de publication:
25 Jun 2024
Historique:
received: 19 02 2024
revised: 27 05 2024
accepted: 29 05 2024
medline: 1 7 2024
pubmed: 1 7 2024
entrez: 1 7 2024
Statut: epublish

Résumé

Glycosylation represents a major chemical challenge; while it is one of the most common reactions in Nature, conventional chemistry struggles with stereochemistry, regioselectivity, and solubility issues. In contrast, family 1 glycosyltransferase (GT1) enzymes can glycosylate virtually any given nucleophilic group with perfect control over stereochemistry and regioselectivity. However, the appropriate catalyst for a given reaction needs to be identified among the tens of thousands of available sequences. Here, we present the glycosyltransferase acceptor specificity predictor (GASP) model, a data-driven approach to the identification of reactive GT1:acceptor pairs. We trained a random forest-based acceptor predictor on literature data and validated it on independent in-house generated data on 1001 GT1:acceptor pairs, obtaining an AUROC of 0.79 and a balanced accuracy of 72%. The performance was stable even in the case of completely new GT1s and acceptors not present in the training data set, highlighting the pan-specificity of GASP. Moreover, the model is capable of parsing all known GT1 sequences, as well as all chemicals, the latter through a pipeline for the generation of 153 chemical features for a given molecule taking the CID or SMILES as input (freely available at https://github.com/degnbol/GASP). To investigate the power of GASP, the model prediction probability scores were compared to GT1 substrate conversion yields from a newly published data set, with the top 50% of GASP predictions corresponding to reactions with >50% synthetic yields. The model was also tested in two comparative case studies: glycosylation of the antihelminth drug niclosamide and the plant defensive compound DIBOA. In the first study, the model achieved an 83% hit rate, outperforming a hit rate of 53% from a random selection assay. In the second case study, the hit rate of GASP was 50%, and while being lower than the hit rate of 83% using expert-selected enzymes, it provides a reasonable performance for the cases when an expert opinion is unavailable. The hierarchal importance of the generated chemical features was investigated by negative feature selection, revealing properties related to cyclization and atom hybridization status to be the most important characteristics for accurate prediction. Our study provides a GT1:acceptor predictor which can be trained on other data sets enabled by the automated feature generation pipelines. We also release the new in-house generated data set used for testing of GASP to facilitate the future development of GT1 activity predictors and their robust benchmarking.

Identifiants

pubmed: 38947828
doi: 10.1021/acsomega.4c01583
pmc: PMC11209901
doi:

Types de publication

Journal Article

Langues

eng

Pagination

27278-27288

Informations de copyright

© 2024 The Authors. Published by American Chemical Society.

Déclaration de conflit d'intérêts

The authors declare no competing financial interest.

Auteurs

David Harding-Larsen (D)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Christian Degnbol Madsen (CD)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.
The University of Melbourne Faculty of Science, Melbourne Integrative Genomics, University of Melbourne, Building 184, Royal Parade, Parkville 3010, Melbourne, VIC 3052, Australia.

David Teze (D)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Tiia Kittilä (T)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Mads Rosander Langhorn (MR)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Hani Gharabli (H)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Mandy Hobusch (M)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Felipe Mejia Otalvaro (FM)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Onur Kırtel (O)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Gonzalo Nahuel Bidart (GN)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Stanislav Mazurenko (S)

Department of Experimental Biology and RECETOX, Faculty of Science, Masarykova Univerzita, Kamenice 5/A4, Brno 625 00, Czech Republic.
International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, Brno 656 91, Czech Republic.

Evelyn Travnik (E)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Ditte Hededam Welner (DH)

DTU Biosustain, Technical University of Denmark, Kemitorvet 220, Lyngby, Denmark 2800.

Classifications MeSH