A data science roadmap for open science organizations engaged in early-stage drug discovery.
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
05 Jul 2024
05 Jul 2024
Historique:
received:
13
12
2023
accepted:
12
06
2024
medline:
5
7
2024
pubmed:
5
7
2024
entrez:
4
7
2024
Statut:
epublish
Résumé
The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.
Identifiants
pubmed: 38965235
doi: 10.1038/s41467-024-49777-x
pii: 10.1038/s41467-024-49777-x
doi:
Types de publication
Journal Article
Review
Langues
eng
Sous-ensembles de citation
IM
Pagination
5640Subventions
Organisme : Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada (NSERC Canadian Network for Research and Innovation in Machining Technology)
ID : RGPIN-2019-04416
Informations de copyright
© 2024. The Author(s).
Références
Carter, A. J. et al. Target 2035: probing the human proteome. Drug Discov. Today 24, 2111–2115 (2019).
doi: 10.1016/j.drudis.2019.06.020
pubmed: 31278990
For chemists, the AI revolution has yet to happen. Nature 617, 438 (2023).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
doi: 10.1038/sdata.2016.18
pubmed: 26978244
pmcid: 4792175
Guarino, N. Formal Ontology and Information Systems. (IOS Press 1998).
Zdrazil, B. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 52, D1180–D1192 (2024).
doi: 10.1093/nar/gkad1004
pubmed: 37933841
Tom, G. et al. Self-driving laboratories for chemistry and materials science. Preprint at https://doi.org/10.26434/chemrxiv-2024-rj946 (2024).
Hohman, M. et al. Novel web-based tools combining chemistry informatics, biology and social networks for drug discovery. Drug Discov. Today 14, 261–270 (2009).
doi: 10.1016/j.drudis.2008.11.015
pubmed: 19231313
Muresan, S. et al. Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data. Drug Discov. Today 16, 1019–1030 (2011).
doi: 10.1016/j.drudis.2011.10.005
pubmed: 22024215
Sielemann, K., Hafner, A. & Pucker, B. The reuse of public datasets in the life sciences: potential risks and rewards. PeerJ 8, e9954 (2020).
doi: 10.7717/peerj.9954
pubmed: 33024631
pmcid: 7518187
Liu, R., Li, X. & Lam, K. S. Combinatorial chemistry in drug discovery. Curr. Opin. Chem. Biol. 38, 117–126 (2017).
doi: 10.1016/j.cbpa.2017.03.017
pubmed: 28494316
pmcid: 5645069
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
doi: 10.1038/nrg.2016.49
pubmed: 27184599
pmcid: 10373632
Brenner, S. & Lerner, R. A. Encoded combinatorial chemistry. Proc. Natl. Acad. Sci. USA 89, 5381–5383 (1992).
doi: 10.1073/pnas.89.12.5381
pubmed: 1608946
pmcid: 49295
Clark, M. A. et al. Design, synthesis and selection of DNA-encoded small-molecule libraries. Nat. Chem. Biol. 5, 647–654 (2009).
doi: 10.1038/nchembio.211
pubmed: 19648931
Goodnow, R. A., Dumelin, C. E. & Keefe, A. D. DNA-encoded chemistry: enabling the deeper sampling of chemical space. Nat. Rev. Drug Discov. 16, 131–147 (2017).
doi: 10.1038/nrd.2016.213
pubmed: 27932801
Harris, P. A. et al. DNA-Encoded library screening identifies Benzo[b][1,4]oxazepin-4-ones as highly potent and Monoselective Receptor Interacting protein 1 Kinase inhibitors. J. Med. Chem. 59, 2163–2178 (2016).
doi: 10.1021/acs.jmedchem.5b01898
pubmed: 26854747
Gironda-Martínez, A., Donckele, E. J., Samain, F. & Neri, D. DNA-Encoded chemical libraries: A comprehensive review with succesful stories and future challenges. ACS Pharmacol. Transl. Sci. 4, 1265–1279 (2021).
doi: 10.1021/acsptsci.1c00118
pubmed: 34423264
pmcid: 8369695
Satz, A. L., Kuai, L. & Peng, X. Selections and screenings of DNA-encoded chemical libraries against enzyme and cellular targets. Bioorg. Med. Chem. Lett. 39, 127851 (2021).
doi: 10.1016/j.bmcl.2021.127851
pubmed: 33631371
McCloskey, K. et al. Machine learning on DNA-Encoded libraries: A new paradigm for hit finding. J. Med. Chem. 63, 8857–8866 (2020).
doi: 10.1021/acs.jmedchem.0c00452
pubmed: 32525674
REAL Space—Enamine. https://enamine.net/compound-collections/real-compounds/real-space-navigator .
Ahn, S. et al. Allosteric “beta-blocker” isolated from a DNA-encoded small molecule library. Proc. Natl. Acad. Sci. USA 114, 1708–1713 (2017).
doi: 10.1073/pnas.1620645114
pubmed: 28130548
pmcid: 5321036
Ahn, S. et al. Small-molecule positive allosteric modulators of the β2-Adrenoceptor isolated from DNA-encoded libraries. Mol. Pharmacol. 94, 850–861 (2018).
doi: 10.1124/mol.118.111948
pubmed: 29769246
pmcid: 6022804
Cai, B., El Daibani, A., Bai, Y., Che, T. & Krusemark, C. J. Direct selection of DNA-Encoded libraries for biased agonists of GPCRs on live cells. JACS Au 3, 1076–1088 (2023).
doi: 10.1021/jacsau.2c00674
pubmed: 37124302
pmcid: 10131204
Fourches, D., Muratov, E. & Tropsha, A. Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model. 50, 1189–1204 (2010).
doi: 10.1021/ci100176x
pubmed: 20572635
pmcid: 2989419
Understanding open science—UNESCO Digital Library. https://unesdoc.unesco.org/ark:/48223/pf0000383323 .
Mammoliti, A. et al. Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat. Commun. 12, 5797 (2021).
doi: 10.1038/s41467-021-25974-w
pubmed: 34608132
pmcid: 8490371
Press, G. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/ .
BioCompute Portal. https://www.biocomputeobject.org/ .
Simonyan, V., Goecks, J. & Mazumder, R. Biocompute objects-A step towards evaluation and validation of biomedical scientific computations. PDA J. Pharm. Sci. Technol. 71, 136–146 (2017).
doi: 10.5731/pdajpst.2016.006734
pubmed: 27974626
Holland, S., Hosny, A., Newman, S., Joseph, J. & Chmielinski, K. The dataset nutrition label: a framework to drive higher data quality standards. In: Data Protection and Privacy (eds Hallian, D. et al.) 1–26 (Bloosmbury Publishing, 2020).
George, D. G. et al. The protein information resource (PIR) and the PIR-international protein sequence database. Nucleic Acids Res. 25, 24–28 (1997).
doi: 10.1093/nar/25.1.24
pubmed: 9016497
pmcid: 146415
wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, D520–D528 (2019).
doi: 10.1093/nar/gky949
Kim, S. et al. PubChem 2023 update. Nucleic Acids Res. 51, D1373–D1380 (2023).
doi: 10.1093/nar/gkac956
pubmed: 36305812
Data Submission and Release Expectations | Data Sharing. https://sharing.nih.gov/genomic-data-sharing-policy/submitting-genomic-data/data-submission-and-release-expectations .
Ackloo, S. et al. CACHE (Critical Assessment of Computational Hit-finding Experiments): A public-private partnership benchmarking initiative to enable the development of computational methods for hit-finding. Nat. Rev. Chem. 6, 287–295 (2022).
doi: 10.1038/s41570-022-00363-z
pubmed: 35783295
pmcid: 9246350
van Dijk, W., Schatschneider, C. & Hart, S. A. Open science in education sciences. J. Learn. Disabil. 54, 139–152 (2021).
doi: 10.1177/0022219420945267
pubmed: 32734821
Guinney, J. & Saez-Rodriguez, J. Alternative models for sharing confidential biomedical data. Nat. Biotechnol. 36, 391–392 (2018).
doi: 10.1038/nbt.4128
pubmed: 29734317
Göller, A. H. et al. Bayer’s in silico ADMET platform: a journey of machine learning over the past two decades. Drug Discov. Today 25, 1702–1709 (2020).
doi: 10.1016/j.drudis.2020.07.001
pubmed: 32652309
Montanari, F., Kuhnke, L., Ter Laak, A. & Clevert, D.-A. Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Mol. Basel Switz. 25, 44 (2019).
Zankov, D. V. et al. QSAR Modeling based on conformation ensembles using a multi-instance learning approach. J. Chem. Inf. Model. 61, 4913–4923 (2021).
doi: 10.1021/acs.jcim.1c00692
pubmed: 34554736
Winter, R., Montanari, F., Noé, F. & Clevert, D.-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).
doi: 10.1039/C8SC04175J
pubmed: 30842833
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
doi: 10.1021/ci100050t
pubmed: 20426451
Le, T., Noe, F. & Clevert, D.-A. Representation learning on biomolecular structures using equivariant graph attention. In Proceedings of the First Learning on Graphs Conference 30:1–30:17 (PMLR, 2022).
David, L., Thakkar, A., Mercado, R. & Engkvist, O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminf. 12, 56 (2020).
doi: 10.1186/s13321-020-00460-5
Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and Harnessing Adversarial Examples. Preprint at https://doi.org/10.48550/ARXIV.1412.6572 . (2014)
Mervin, L. H. et al. Probabilistic random forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty. J. Cheminf.13, 62 (2021).
doi: 10.1186/s13321-021-00539-7
Begoli, E., Bhattacharya, T. & Kusnezov, D. The need for uncertainty quantification in machine-assisted medical decision making. Nat. Mach. Intell. 1, 20–23 (2019).
doi: 10.1038/s42256-018-0004-1
Bishop, C. M. Mixture density networks. Mix. Density Netw. 1–25 (1994).
Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning.
Seung, H. S., Opper, M. & Sompolinsky, H. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory 287–294 (ACM, Pittsburgh Pennsylvania USA, 1992).
Guha, R. & Velegol, D. Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties. J. Cheminf. 15, 54 (2023).
doi: 10.1186/s13321-023-00712-0
Gregori-Puigjané, E. & Mestres, J. SHED: Shannon entropy descriptors from topological feature distributions. J. Chem. Inf. Model. 46, 1615–1622 (2006).
doi: 10.1021/ci0600509
pubmed: 16859293