A data science roadmap for open science organizations engaged in early-stage drug discovery.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
05 Jul 2024
Historique:
received: 13 12 2023
accepted: 12 06 2024
medline: 5 7 2024
pubmed: 5 7 2024
entrez: 4 7 2024
Statut: epublish

Résumé

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

Identifiants

pubmed: 38965235
doi: 10.1038/s41467-024-49777-x
pii: 10.1038/s41467-024-49777-x
doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

IM

Pagination

5640

Subventions

Organisme : Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada (NSERC Canadian Network for Research and Innovation in Machining Technology)
ID : RGPIN-2019-04416

Informations de copyright

© 2024. The Author(s).

Références

Carter, A. J. et al. Target 2035: probing the human proteome. Drug Discov. Today 24, 2111–2115 (2019).
doi: 10.1016/j.drudis.2019.06.020 pubmed: 31278990
For chemists, the AI revolution has yet to happen. Nature 617, 438 (2023).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
doi: 10.1038/sdata.2016.18 pubmed: 26978244 pmcid: 4792175
Guarino, N. Formal Ontology and Information Systems. (IOS Press 1998).
Zdrazil, B. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 52, D1180–D1192 (2024).
doi: 10.1093/nar/gkad1004 pubmed: 37933841
Tom, G. et al. Self-driving laboratories for chemistry and materials science. Preprint at https://doi.org/10.26434/chemrxiv-2024-rj946 (2024).
Hohman, M. et al. Novel web-based tools combining chemistry informatics, biology and social networks for drug discovery. Drug Discov. Today 14, 261–270 (2009).
doi: 10.1016/j.drudis.2008.11.015 pubmed: 19231313
Muresan, S. et al. Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data. Drug Discov. Today 16, 1019–1030 (2011).
doi: 10.1016/j.drudis.2011.10.005 pubmed: 22024215
Sielemann, K., Hafner, A. & Pucker, B. The reuse of public datasets in the life sciences: potential risks and rewards. PeerJ 8, e9954 (2020).
doi: 10.7717/peerj.9954 pubmed: 33024631 pmcid: 7518187
Liu, R., Li, X. & Lam, K. S. Combinatorial chemistry in drug discovery. Curr. Opin. Chem. Biol. 38, 117–126 (2017).
doi: 10.1016/j.cbpa.2017.03.017 pubmed: 28494316 pmcid: 5645069
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
doi: 10.1038/nrg.2016.49 pubmed: 27184599 pmcid: 10373632
Brenner, S. & Lerner, R. A. Encoded combinatorial chemistry. Proc. Natl. Acad. Sci. USA 89, 5381–5383 (1992).
doi: 10.1073/pnas.89.12.5381 pubmed: 1608946 pmcid: 49295
Clark, M. A. et al. Design, synthesis and selection of DNA-encoded small-molecule libraries. Nat. Chem. Biol. 5, 647–654 (2009).
doi: 10.1038/nchembio.211 pubmed: 19648931
Goodnow, R. A., Dumelin, C. E. & Keefe, A. D. DNA-encoded chemistry: enabling the deeper sampling of chemical space. Nat. Rev. Drug Discov. 16, 131–147 (2017).
doi: 10.1038/nrd.2016.213 pubmed: 27932801
Harris, P. A. et al. DNA-Encoded library screening identifies Benzo[b][1,4]oxazepin-4-ones as highly potent and Monoselective Receptor Interacting protein 1 Kinase inhibitors. J. Med. Chem. 59, 2163–2178 (2016).
doi: 10.1021/acs.jmedchem.5b01898 pubmed: 26854747
Gironda-Martínez, A., Donckele, E. J., Samain, F. & Neri, D. DNA-Encoded chemical libraries: A comprehensive review with succesful stories and future challenges. ACS Pharmacol. Transl. Sci. 4, 1265–1279 (2021).
doi: 10.1021/acsptsci.1c00118 pubmed: 34423264 pmcid: 8369695
Satz, A. L., Kuai, L. & Peng, X. Selections and screenings of DNA-encoded chemical libraries against enzyme and cellular targets. Bioorg. Med. Chem. Lett. 39, 127851 (2021).
doi: 10.1016/j.bmcl.2021.127851 pubmed: 33631371
McCloskey, K. et al. Machine learning on DNA-Encoded libraries: A new paradigm for hit finding. J. Med. Chem. 63, 8857–8866 (2020).
doi: 10.1021/acs.jmedchem.0c00452 pubmed: 32525674
REAL Space—Enamine. https://enamine.net/compound-collections/real-compounds/real-space-navigator .
Ahn, S. et al. Allosteric “beta-blocker” isolated from a DNA-encoded small molecule library. Proc. Natl. Acad. Sci. USA 114, 1708–1713 (2017).
doi: 10.1073/pnas.1620645114 pubmed: 28130548 pmcid: 5321036
Ahn, S. et al. Small-molecule positive allosteric modulators of the β2-Adrenoceptor isolated from DNA-encoded libraries. Mol. Pharmacol. 94, 850–861 (2018).
doi: 10.1124/mol.118.111948 pubmed: 29769246 pmcid: 6022804
Cai, B., El Daibani, A., Bai, Y., Che, T. & Krusemark, C. J. Direct selection of DNA-Encoded libraries for biased agonists of GPCRs on live cells. JACS Au 3, 1076–1088 (2023).
doi: 10.1021/jacsau.2c00674 pubmed: 37124302 pmcid: 10131204
Fourches, D., Muratov, E. & Tropsha, A. Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model. 50, 1189–1204 (2010).
doi: 10.1021/ci100176x pubmed: 20572635 pmcid: 2989419
Understanding open science—UNESCO Digital Library. https://unesdoc.unesco.org/ark:/48223/pf0000383323 .
Mammoliti, A. et al. Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat. Commun. 12, 5797 (2021).
doi: 10.1038/s41467-021-25974-w pubmed: 34608132 pmcid: 8490371
Press, G. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/ .
BioCompute Portal. https://www.biocomputeobject.org/ .
Simonyan, V., Goecks, J. & Mazumder, R. Biocompute objects-A step towards evaluation and validation of biomedical scientific computations. PDA J. Pharm. Sci. Technol. 71, 136–146 (2017).
doi: 10.5731/pdajpst.2016.006734 pubmed: 27974626
Holland, S., Hosny, A., Newman, S., Joseph, J. & Chmielinski, K. The dataset nutrition label: a framework to drive higher data quality standards. In: Data Protection and Privacy (eds Hallian, D. et al.) 1–26 (Bloosmbury Publishing, 2020).
George, D. G. et al. The protein information resource (PIR) and the PIR-international protein sequence database. Nucleic Acids Res. 25, 24–28 (1997).
doi: 10.1093/nar/25.1.24 pubmed: 9016497 pmcid: 146415
wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, D520–D528 (2019).
doi: 10.1093/nar/gky949
Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).
doi: 10.1093/nar/gkac956 pubmed: 36305812
Data Submission and Release Expectations | Data Sharing. https://sharing.nih.gov/genomic-data-sharing-policy/submitting-genomic-data/data-submission-and-release-expectations .
Ackloo, S. et al. CACHE (Critical Assessment of Computational Hit-finding Experiments): A public-private partnership benchmarking initiative to enable the development of computational methods for hit-finding. Nat. Rev. Chem. 6, 287–295 (2022).
doi: 10.1038/s41570-022-00363-z pubmed: 35783295 pmcid: 9246350
van Dijk, W., Schatschneider, C. & Hart, S. A. Open science in education sciences. J. Learn. Disabil. 54, 139–152 (2021).
doi: 10.1177/0022219420945267 pubmed: 32734821
Guinney, J. & Saez-Rodriguez, J. Alternative models for sharing confidential biomedical data. Nat. Biotechnol. 36, 391–392 (2018).
doi: 10.1038/nbt.4128 pubmed: 29734317
Göller, A. H. et al. Bayer’s in silico ADMET platform: a journey of machine learning over the past two decades. Drug Discov. Today 25, 1702–1709 (2020).
doi: 10.1016/j.drudis.2020.07.001 pubmed: 32652309
Montanari, F., Kuhnke, L., Ter Laak, A. & Clevert, D.-A. Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Mol. Basel Switz. 25, 44 (2019).
Zankov, D. V. et al. QSAR Modeling based on conformation ensembles using a multi-instance learning approach. J. Chem. Inf. Model. 61, 4913–4923 (2021).
doi: 10.1021/acs.jcim.1c00692 pubmed: 34554736
Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
doi: 10.1039/C8SC04175J pubmed: 30842833
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
doi: 10.1021/ci100050t pubmed: 20426451
Le, T., Noe, F. & Clevert, D.-A. Representation learning on biomolecular structures using equivariant graph attention. In Proceedings of the First Learning on Graphs Conference 30:1–30:17 (PMLR, 2022).
David, L., Thakkar, A., Mercado, R. & Engkvist, O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminf. 12, 56 (2020).
doi: 10.1186/s13321-020-00460-5
Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and Harnessing Adversarial Examples. Preprint at https://doi.org/10.48550/ARXIV.1412.6572 . (2014)
Mervin, L. H. et al. Probabilistic random forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty. J. Cheminf.13, 62 (2021).
doi: 10.1186/s13321-021-00539-7
Begoli, E., Bhattacharya, T. & Kusnezov, D. The need for uncertainty quantification in machine-assisted medical decision making. Nat. Mach. Intell. 1, 20–23 (2019).
doi: 10.1038/s42256-018-0004-1
Bishop, C. M. Mixture density networks. Mix. Density Netw. 1–25 (1994).
Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning.
Seung, H. S., Opper, M. & Sompolinsky, H. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory 287–294 (ACM, Pittsburgh Pennsylvania USA, 1992).
Guha, R. & Velegol, D. Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties. J. Cheminf. 15, 54 (2023).
doi: 10.1186/s13321-023-00712-0
Gregori-Puigjané, E. & Mestres, J. SHED: Shannon entropy descriptors from topological feature distributions. J. Chem. Inf. Model. 46, 1615–1622 (2006).
doi: 10.1021/ci0600509 pubmed: 16859293

Auteurs

Kristina Edfeldt (K)

Structural Genomics Consortium, Department of Medicine, Karolinska University Hospital and Karolinska Institutet, Stockholm, Sweden.

Aled M Edwards (AM)

Structural Genomics Consortium, University of Toronto, Toronto, ON, Canada.

Ola Engkvist (O)

Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden & Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden.

Judith Günther (J)

Bayer AG Research and Development, Computational Molecular Design, Berlin, Germany.

Matthew Hartley (M)

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK.

David G Hulcoop (DG)

Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK.
European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, UK.

Andrew R Leach (AR)

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK.

Brian D Marsden (BD)

Centre for Medicines Discovery, NDM, University of Oxford, Oxford, UK.

Amelie Menge (A)

Institute of Pharmaceutical Chemistry, Johann Wolfgang Goethe University, Frankfurt am Main, 60438, Germany & Structural Genomics Consortium (SGC), Buchmann Institute for Life Sciences, Johann Wolfgang Goethe University, Frankfurt am Main, Germany.

Leonie Misquitta (L)

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Susanne Müller (S)

Institute of Pharmaceutical Chemistry, Johann Wolfgang Goethe University, Frankfurt am Main, 60438, Germany & Structural Genomics Consortium (SGC), Buchmann Institute for Life Sciences, Johann Wolfgang Goethe University, Frankfurt am Main, Germany.

Dafydd R Owen (DR)

Pfizer Worldwide Research, Development & Medical, Cambridge, MA, USA.

Kristof T Schütt (KT)

Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences, Berlin, Germany.

Nicholas Skelton (N)

Department of Discovery Chemistry, Genentech, Inc., South San Francisco, CA, USA.

Andreas Steffen (A)

Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences, Berlin, Germany.

Alexander Tropsha (A)

Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, North Carolina, USA.

Erik Vernet (E)

Digital Science & Innovation, Novo Nordisk A/S, Maaloev, Denmark.

Yanli Wang (Y)

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

James Wellnitz (J)

Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, North Carolina, USA.

Timothy M Willson (TM)

Structural Genomics Consortium, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.

Djork-Arné Clevert (DA)

Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences, Berlin, Germany. Djork-Arne.Clevert@pfizer.com.

Benjamin Haibe-Kains (B)

Structural Genomics Consortium, University of Toronto, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.
Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.
Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.
Vector Institute for Artificial Intelligence, Toronto, ON, Canada. benjamin.haibe.kains@utoronto.ca.

Lovisa Holmberg Schiavone (LH)

Discovery Biology, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden. Lovisa.Holmberg.Schiavone@astrazeneca.com.

Matthieu Schapira (M)

Structural Genomics Consortium, University of Toronto, Toronto, ON, Canada. matthieu.schapira@utoronto.ca.
Department of Pharmacology & Toxicology, University of Toronto, Toronto, ON, Canada. matthieu.schapira@utoronto.ca.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH