canSAR chemistry registration and standardization pipeline.

Canonicalization Compound hierarchy FDA-approved drugs KNIME Standardization Tautomerism canSAR

Journal

Journal of cheminformatics
ISSN: 1758-2946
Titre abrégé: J Cheminform
Pays: England
ID NLM: 101516718

Informations de publication

Date de publication:
28 May 2022
Historique:
received: 04 02 2022
accepted: 04 04 2022
entrez: 1 6 2022
pubmed: 2 6 2022
medline: 2 6 2022
Statut: epublish

Résumé

Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach. We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds' hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL's RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem's OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step. We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline .

Sections du résumé

BACKGROUND BACKGROUND
Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.
RESULTS RESULTS
We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds' hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL's RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem's OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step.
CONCLUSIONS CONCLUSIONS
We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline .

Identifiants

pubmed: 35643512
doi: 10.1186/s13321-022-00606-7
pii: 10.1186/s13321-022-00606-7
pmc: PMC9148294
doi:

Types de publication

Journal Article

Langues

eng

Pagination

28

Subventions

Organisme : Cancer Research UK
ID : C35696/A23187
Pays : United Kingdom
Organisme : Cancer Research UK
ID : C35696/A23187
Pays : United Kingdom
Organisme : Cancer Research UK
ID : C35696/A23187
Pays : United Kingdom
Organisme : Cancer Research UK
ID : C309/A11566
Pays : United Kingdom
Organisme : Wellcome Trust
ID : 212969/Z/18/Z
Pays : United Kingdom
Organisme : Wellcome Trust
ID : 204735/Z/16/Z
Pays : United Kingdom
Organisme : FP7 People: Marie-Curie Actions
ID : FP7/2007-2013

Informations de copyright

© 2022. The Author(s).

Références

J Cheminform. 2013 Jan 24;5(1):7
pubmed: 23343401
Drug Discov Today. 2011 Sep;16(17-18):747-50
pubmed: 21871970
Nucleic Acids Res. 2020 Jan 8;48(D1):D344-D353
pubmed: 31584092
J Chem Inf Model. 2016 Nov 28;56(11):2149-2161
pubmed: 27669079
J Chem Inf Model. 2020 Mar 23;60(3):1253-1275
pubmed: 32043883
Drug Discov Today Technol. 2015 Jul;14:17-24
pubmed: 26194583
SAR QSAR Environ Res. 2008 Jan-Mar;19(1-2):1-9
pubmed: 18311630
J Cheminform. 2015 May 30;7:23
pubmed: 26136848
Nucleic Acids Res. 2021 Jan 8;49(D1):D1074-D1082
pubmed: 33219674
Drug Discov Today. 2012 Jul;17(13-14):685-701
pubmed: 22426180
Nat Chem Biol. 2015 Aug;11(8):536-41
pubmed: 26196764
Nucleic Acids Res. 2019 Jan 8;47(D1):D930-D940
pubmed: 30398643
Nucleic Acids Res. 2016 Jan 4;44(D1):D1045-53
pubmed: 26481362
J Comput Aided Mol Des. 2010 Jun;24(6-7):475-84
pubmed: 20490619
J Cheminform. 2018 Aug 10;10(1):36
pubmed: 30097821
Cell Chem Biol. 2018 Feb 15;25(2):194-205.e5
pubmed: 29249694
J Cheminform. 2020 Sep 1;12(1):51
pubmed: 33431044
J Comput Aided Mol Des. 2010 Jun;24(6-7):521-51
pubmed: 20512400
J Cheminform. 2012 Dec 13;4(1):35
pubmed: 23237381

Auteurs

Daniela Dolciami (D)

Department of Data Science, The Institute of Cancer Research, London, SM2 5NG, UK.
Cancer Research UK Cancer Therapeutics Unit, The Institute of Cancer Research, London, SM2 5NG, UK.
BenevolentAI, London, W1T 5HD, UK.

Eloy Villasclaras-Fernandez (E)

Department of Data Science, The Institute of Cancer Research, London, SM2 5NG, UK.

Christos Kannas (C)

Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden.

Mirco Meniconi (M)

Cancer Research UK Cancer Therapeutics Unit, The Institute of Cancer Research, London, SM2 5NG, UK.
Dunad therapeutics, Cambridge, UK.

Bissan Al-Lazikani (B)

MD Anderson Cancer Center, Houston, TX, 77054, USA. ballazikani@mdanderson.org.

Albert A Antolin (AA)

Department of Data Science, The Institute of Cancer Research, London, SM2 5NG, UK. Albert.Antolin@icr.ac.uk.
Cancer Research UK Cancer Therapeutics Unit, The Institute of Cancer Research, London, SM2 5NG, UK. Albert.Antolin@icr.ac.uk.

Classifications MeSH