A data science roadmap for open science organizations engaged in early-stage drug discovery.

Drug Discovery / methods Machine Learning Data Science / methods Humans Artificial Intelligence Information Dissemination / methods Data Mining / methods Cloud Computing Databases, Factual

Journal

Nature communications

ISSN: 2041-1723

Titre abrégé: Nat Commun

Pays: England

ID NLM: 101528555

Informations de publication

Date de publication:
05 Jul 2024

Historique:

received: 13 12 2023

accepted: 12 06 2024

medline: 5 7 2024

pubmed: 5 7 2024

entrez: 4 7 2024

Statut: epublish

Résumé

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

Identifiants

DOI: 10.1038/s41467-024-49777-x PMID: 38965235

pubmed: 38965235

doi: 10.1038/s41467-024-49777-x

pii: 10.1038/s41467-024-49777-x

doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

Pagination

5640

Subventions

Organisme : Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada (NSERC Canadian Network for Research and Innovation in Machining Technology)

ID : RGPIN-2019-04416

Informations de copyright

Références

Carter, A. J. et al. Target 2035: probing the human proteome. Drug Discov. Today 24, 2111–2115 (2019).

doi: 10.1016/j.drudis.2019.06.020 pubmed: 31278990

For chemists, the AI revolution has yet to happen. Nature 617, 438 (2023).

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

doi: 10.1038/sdata.2016.18 pubmed: 26978244 pmcid: 4792175

Guarino, N. Formal Ontology and Information Systems. (IOS Press 1998).

Zdrazil, B. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 52, D1180–D1192 (2024).

doi: 10.1093/nar/gkad1004 pubmed: 37933841

Tom, G. et al. Self-driving laboratories for chemistry and materials science. Preprint at https://doi.org/10.26434/chemrxiv-2024-rj946 (2024).

Hohman, M. et al. Novel web-based tools combining chemistry informatics, biology and social networks for drug discovery. Drug Discov. Today 14, 261–270 (2009).

doi: 10.1016/j.drudis.2008.11.015 pubmed: 19231313

Muresan, S. et al. Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data. Drug Discov. Today 16, 1019–1030 (2011).

doi: 10.1016/j.drudis.2011.10.005 pubmed: 22024215

Sielemann, K., Hafner, A. & Pucker, B. The reuse of public datasets in the life sciences: potential risks and rewards. PeerJ 8, e9954 (2020).

doi: 10.7717/peerj.9954 pubmed: 33024631 pmcid: 7518187

Liu, R., Li, X. & Lam, K. S. Combinatorial chemistry in drug discovery. Curr. Opin. Chem. Biol. 38, 117–126 (2017).

doi: 10.1016/j.cbpa.2017.03.017 pubmed: 28494316 pmcid: 5645069

Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

doi: 10.1038/nrg.2016.49 pubmed: 27184599 pmcid: 10373632

Brenner, S. & Lerner, R. A. Encoded combinatorial chemistry. Proc. Natl. Acad. Sci. USA 89, 5381–5383 (1992).

doi: 10.1073/pnas.89.12.5381 pubmed: 1608946 pmcid: 49295

Clark, M. A. et al. Design, synthesis and selection of DNA-encoded small-molecule libraries. Nat. Chem. Biol. 5, 647–654 (2009).

doi: 10.1038/nchembio.211 pubmed: 19648931

Goodnow, R. A., Dumelin, C. E. & Keefe, A. D. DNA-encoded chemistry: enabling the deeper sampling of chemical space. Nat. Rev. Drug Discov. 16, 131–147 (2017).

doi: 10.1038/nrd.2016.213 pubmed: 27932801

Harris, P. A. et al. DNA-Encoded library screening identifies Benzo[b][1,4]oxazepin-4-ones as highly potent and Monoselective Receptor Interacting protein 1 Kinase inhibitors. J. Med. Chem. 59, 2163–2178 (2016).

doi: 10.1021/acs.jmedchem.5b01898 pubmed: 26854747

Gironda-Martínez, A., Donckele, E. J., Samain, F. & Neri, D. DNA-Encoded chemical libraries: A comprehensive review with succesful stories and future challenges. ACS Pharmacol. Transl. Sci. 4, 1265–1279 (2021).

doi: 10.1021/acsptsci.1c00118 pubmed: 34423264 pmcid: 8369695

Satz, A. L., Kuai, L. & Peng, X. Selections and screenings of DNA-encoded chemical libraries against enzyme and cellular targets. Bioorg. Med. Chem. Lett. 39, 127851 (2021).

doi: 10.1016/j.bmcl.2021.127851 pubmed: 33631371

McCloskey, K. et al. Machine learning on DNA-Encoded libraries: A new paradigm for hit finding. J. Med. Chem. 63, 8857–8866 (2020).

doi: 10.1021/acs.jmedchem.0c00452 pubmed: 32525674

REAL Space—Enamine. https://enamine.net/compound-collections/real-compounds/real-space-navigator .

Ahn, S. et al. Allosteric “beta-blocker” isolated from a DNA-encoded small molecule library. Proc. Natl. Acad. Sci. USA 114, 1708–1713 (2017).

doi: 10.1073/pnas.1620645114 pubmed: 28130548 pmcid: 5321036

Ahn, S. et al. Small-molecule positive allosteric modulators of the β2-Adrenoceptor isolated from DNA-encoded libraries. Mol. Pharmacol. 94, 850–861 (2018).

doi: 10.1124/mol.118.111948 pubmed: 29769246 pmcid: 6022804

Cai, B., El Daibani, A., Bai, Y., Che, T. & Krusemark, C. J. Direct selection of DNA-Encoded libraries for biased agonists of GPCRs on live cells. JACS Au 3, 1076–1088 (2023).

doi: 10.1021/jacsau.2c00674 pubmed: 37124302 pmcid: 10131204

Fourches, D., Muratov, E. & Tropsha, A. Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model. 50, 1189–1204 (2010).

doi: 10.1021/ci100176x pubmed: 20572635 pmcid: 2989419

Understanding open science—UNESCO Digital Library. https://unesdoc.unesco.org/ark:/48223/pf0000383323 .

Mammoliti, A. et al. Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat. Commun. 12, 5797 (2021).

doi: 10.1038/s41467-021-25974-w pubmed: 34608132 pmcid: 8490371

Press, G. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/ .

BioCompute Portal. https://www.biocomputeobject.org/ .

Simonyan, V., Goecks, J. & Mazumder, R. Biocompute objects-A step towards evaluation and validation of biomedical scientific computations. PDA J. Pharm. Sci. Technol. 71, 136–146 (2017).

doi: 10.5731/pdajpst.2016.006734 pubmed: 27974626

Holland, S., Hosny, A., Newman, S., Joseph, J. & Chmielinski, K. The dataset nutrition label: a framework to drive higher data quality standards. In: Data Protection and Privacy (eds Hallian, D. et al.) 1–26 (Bloosmbury Publishing, 2020).

George, D. G. et al. The protein information resource (PIR) and the PIR-international protein sequence database. Nucleic Acids Res. 25, 24–28 (1997).

doi: 10.1093/nar/25.1.24 pubmed: 9016497 pmcid: 146415

wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, D520–D528 (2019).

doi: 10.1093/nar/gky949

Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).

doi: 10.1093/nar/gkac956 pubmed: 36305812

Data Submission and Release Expectations | Data Sharing. https://sharing.nih.gov/genomic-data-sharing-policy/submitting-genomic-data/data-submission-and-release-expectations .

Ackloo, S. et al. CACHE (Critical Assessment of Computational Hit-finding Experiments): A public-private partnership benchmarking initiative to enable the development of computational methods for hit-finding. Nat. Rev. Chem. 6, 287–295 (2022).

doi: 10.1038/s41570-022-00363-z pubmed: 35783295 pmcid: 9246350

van Dijk, W., Schatschneider, C. & Hart, S. A. Open science in education sciences. J. Learn. Disabil. 54, 139–152 (2021).

doi: 10.1177/0022219420945267 pubmed: 32734821

Guinney, J. & Saez-Rodriguez, J. Alternative models for sharing confidential biomedical data. Nat. Biotechnol. 36, 391–392 (2018).

doi: 10.1038/nbt.4128 pubmed: 29734317

Göller, A. H. et al. Bayer’s in silico ADMET platform: a journey of machine learning over the past two decades. Drug Discov. Today 25, 1702–1709 (2020).

doi: 10.1016/j.drudis.2020.07.001 pubmed: 32652309

Montanari, F., Kuhnke, L., Ter Laak, A. & Clevert, D.-A. Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Mol. Basel Switz. 25, 44 (2019).

Zankov, D. V. et al. QSAR Modeling based on conformation ensembles using a multi-instance learning approach. J. Chem. Inf. Model. 61, 4913–4923 (2021).

doi: 10.1021/acs.jcim.1c00692 pubmed: 34554736

Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).

doi: 10.1039/C8SC04175J pubmed: 30842833

Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

doi: 10.1021/ci100050t pubmed: 20426451

Le, T., Noe, F. & Clevert, D.-A. Representation learning on biomolecular structures using equivariant graph attention. In Proceedings of the First Learning on Graphs Conference 30:1–30:17 (PMLR, 2022).

David, L., Thakkar, A., Mercado, R. & Engkvist, O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminf. 12, 56 (2020).

doi: 10.1186/s13321-020-00460-5

Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and Harnessing Adversarial Examples. Preprint at https://doi.org/10.48550/ARXIV.1412.6572 . (2014)

Mervin, L. H. et al. Probabilistic random forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty. J. Cheminf.13, 62 (2021).

doi: 10.1186/s13321-021-00539-7

Begoli, E., Bhattacharya, T. & Kusnezov, D. The need for uncertainty quantification in machine-assisted medical decision making. Nat. Mach. Intell. 1, 20–23 (2019).

doi: 10.1038/s42256-018-0004-1

Bishop, C. M. Mixture density networks. Mix. Density Netw. 1–25 (1994).

Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning.

Seung, H. S., Opper, M. & Sompolinsky, H. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory 287–294 (ACM, Pittsburgh Pennsylvania USA, 1992).

Guha, R. & Velegol, D. Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties. J. Cheminf. 15, 54 (2023).

doi: 10.1186/s13321-023-00712-0

Gregori-Puigjané, E. & Mestres, J. SHED: Shannon entropy descriptors from topological feature distributions. J. Chem. Inf. Model. 46, 1615–1622 (2006).

doi: 10.1021/ci0600509 pubmed: 16859293