MISATO: machine learning dataset of protein-ligand complexes for structure-based drug discovery.


Journal

Nature computational science
ISSN: 2662-8457
Titre abrégé: Nat Comput Sci
Pays: United States
ID NLM: 101775476

Informations de publication

Date de publication:
10 May 2024
Historique:
received: 30 05 2023
accepted: 11 04 2024
medline: 11 5 2024
pubmed: 11 5 2024
entrez: 10 5 2024
Statut: aheadofprint

Résumé

Large language models have greatly enhanced our ability to understand biology and chemistry, yet robust methods for structure-based drug discovery, quantum chemistry and structural biology are still sparse. Precise biomolecule-ligand interaction datasets are urgently needed for large language models. To address this, we present MISATO, a dataset that combines quantum mechanical properties of small molecules and associated molecular dynamics simulations of ~20,000 experimental protein-ligand complexes with extensive validation of experimental data. Starting from the existing experimental structures, semi-empirical quantum mechanics was used to systematically refine these structures. A large collection of molecular dynamics traces of protein-ligand complexes in explicit water is included, accumulating over 170 μs. We give examples of machine learning (ML) baseline models proving an improvement of accuracy by employing our data. An easy entry point for ML experts is provided to enable the next generation of drug discovery artificial intelligence models.

Identifiants

pubmed: 38730184
doi: 10.1038/s43588-024-00627-2
pii: 10.1038/s43588-024-00627-2
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
ID : SUPREME, 031L0268
Organisme : Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
ID : SUPREME, 031L0268
Organisme : Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
ID : SUPREME, 031L0268

Informations de copyright

© 2024. The Author(s).

Références

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
doi: 10.1038/s41586-021-03819-2
Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Mol. Biol. 10, 980 (2003).
doi: 10.1038/nsb1203-980
Mohs, R. C. & Greig, N. H. Drug discovery and development: role of basic biological research. Alzheimer’s Dement. Transl. Res. Clin. Interv. 3, 651–657 (2017).
doi: 10.1016/j.trci.2017.10.005
Sliwoski, G., Kothiwale, S., Meiler, J. & Lowe, E. W. Computational methods in drug discovery. Pharm. Rev. 66, 334–395 (2014).
doi: 10.1124/pr.112.007336
Thiel, W. Semiempirical quantum-chemical methods. WIREs Comput. Mol. Sci. 4, 145–157 (2014).
doi: 10.1002/wcms.1161
Hollingsworth, S. A. & Dror, R. O. Molecular dynamics simulation for all. Neuron 99, 1129–1143 (2018).
doi: 10.1016/j.neuron.2018.08.011
Siebenmorgen, T. & Zacharias, M. Computational prediction of protein–protein binding affinities. WIREs Comput. Mol. Sci. 10, e1448 (2020).
doi: 10.1002/wcms.1448
Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).
doi: 10.1002/jcc.21334
Kmiecik, S. et al. Coarse-grained protein models and their applications. Chem. Rev. 116, 7898–7936 (2016).
doi: 10.1021/acs.chemrev.6b00163
Spicher, S. & Grimme, S. Robust atomistic modeling of materials, organometallic, and biochemical systems. Angew. Chem. Int. Ed. 59, 15665–15673 (2020).
doi: 10.1002/anie.202004239
Vandenbrande, S., Waroquier, M., Speybroeck, V. V. & Verstraelen, T. The monomer electron density force field (MEDFF): a physically inspired model for noncovalent interactions. J. Chem. Theory Comput. 13, 161–179 (2017).
doi: 10.1021/acs.jctc.6b00969
Wang, J. & Dokholyan, N. V. Yuel: improving the generalizability of structure-free compound–protein interaction prediction. J. Chem. Inf. Model. 62, 463–471 (2022).
doi: 10.1021/acs.jcim.1c01531
Ponder, J. W. et al. Current status of the AMOEBA polarizable force field. J. Phys. Chem. B 114, 2549–2564 (2010).
doi: 10.1021/jp910674d
Chen, B. et al. Automated discovery of fundamental variables hidden in experimental data. Nat. Comput Sci. 2, 433–442 (2022).
doi: 10.1038/s43588-022-00281-6
Durrant, J. D. & McCammon, J. A. NNScore: a neural-network-based scoring function for the characterization of protein−ligand complexes. J. Chem. Inf. Model. 50, 1865–1871 (2010).
doi: 10.1021/ci100244v
Wang, X., Terashi, G., Christoffer, C. W., Zhu, M. & Kihara, D. Protein docking model evaluation by 3D deep convolutional neural networks. Bioinformatics 36, 2113–2118 (2020).
doi: 10.1093/bioinformatics/btz870
Wang, N.-N. et al. ADME properties evaluation in drug discovery: prediction of Caco-2 cell permeability using a combination of NSGA-II and boosting. J. Chem. Inf. Model. 56, 763–773 (2016).
doi: 10.1021/acs.jcim.5b00642
Ishida, S., Terayama, K., Kojima, R., Takasu, K. & Okuno, Y. AI-driven synthetic route design incorporated with retrosynthesis knowledge. J. Chem. Inf. Model. 62, 1357–1367 (2022).
doi: 10.1021/acs.jcim.1c01074
Karpov, P., Godin, G. & Tetko, I. V. A transformer model for retrosynthesis. In Artificial Neural Networks and Machine Learning—ICANN 2019: Workshop and Special Sessions (eds Tetko, I. V. et al.) 817–830 (Springer, 2019).
Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 34, i821–i829 (2018).
doi: 10.1093/bioinformatics/bty593
Karimi, M., Wu, D., Wang, Z. & Shen, Y. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 35, 3329–3338 (2019).
doi: 10.1093/bioinformatics/btz111
Hassan-Harrirou, H., Zhang, C. & Lemmin, T. RosENet: improving binding affinity prediction by leveraging molecular mechanics energies with an ensemble of 3D convolutional neural networks. J. Chem. Inf. Model. 60, 2791–2802 (2020).
doi: 10.1021/acs.jcim.0c00075
Feinberg, E. N. et al. PotentialNet for molecular property prediction. ACS Cent. Sci. 4, 1520–1530 (2018).
doi: 10.1021/acscentsci.8b00507
Li, Y., Rezaei, M. A., Li, C. & Li, X. DeepAtom: a framework for protein–ligand binding affinity prediction. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 303–310 (IEEE, 2019).
Wang, R., Fang, X., Lu, Y., Yang, C.-Y. & Wang, S. The PDBbind database: methodologies and updates. J. Med. Chem. 48, 4111–4119 (2005).
doi: 10.1021/jm048957q
Liu, T., Lin, Y., Wen, X., Jorissen, R. N. & Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 35, D198–D201 (2007).
doi: 10.1093/nar/gkl999
Hu, L., Benson, M. L., Smith, R. D., Lerner, M. G. & Carlson, H. A. Binding MOAD (Mother Of All Databases). Proteins Struct. Funct. Bioinform. 60, 333–340 (2005).
doi: 10.1002/prot.20512
Friedrich, N.-O., Simsir, M. & Kirchmair, J. How diverse are the protein-bound conformations of small-molecule drugs and cofactors? Front. Chem. 6, 68 (2018).
doi: 10.3389/fchem.2018.00068
Korlepara, D. B. et al. PLAS-5k: dataset of protein–ligand affinities from molecular dynamics for machine learning applications. Sci. Data 9, 548 (2022).
doi: 10.1038/s41597-022-01631-9
Korlepara, D. B. et al. PLAS-20k: extended dataset of protein–ligand affinities from MD simulations for machine learning applications. Sci. Data 11, 180 (2024).
doi: 10.1038/s41597-023-02872-y
Yang, J., Shen, C. & Huang, N. Predicting or pretending: artificial intelligence for protein–ligand interactions lack of sufficiently large and unbiased datasets. Front. Pharmacol. 11, 69 (2020).
doi: 10.3389/fphar.2020.00069
Volkov, M. et al. On the frustration to predict binding affinities from protein–ligand structures with deep neural networks. J. Med. Chem. 65, 7946–7958 (2022).
doi: 10.1021/acs.jmedchem.2c00487
Vajda, S., Beglov, D., Wakefield, A. E., Egbert, M. & Whitty, A. Cryptic binding sites on proteins: definition, detection, and druggability. Curr. Opin. Chem. Biol. 44, 1–8 (2018).
doi: 10.1016/j.cbpa.2018.05.003
Zeng, L. et al. Selective small molecules blocking HIV-1 Tat and coactivator PCAF association. J. Am. Chem. Soc. 127, 2376–2377 (2005).
doi: 10.1021/ja044885g
Johnson, R. D. III (ed). Computational Chemistry Comparison and Benchmark Database Standard Reference Database Number 101 Release 22 (NIST, accessed 12 Jul 2022); http://cccbdb.nist.gov/
Bista, M. et al. Transient protein states in designing inhibitors of the MDM2–p53 interaction. Structure 21, 2143–2151 (2013).
doi: 10.1016/j.str.2013.09.006
Xie, M. et al. Structural basis of inhibition of ERα–coactivator interaction by high-affinity N-terminus isoaspartic acid tethered helical peptides. J. Med. Chem. 60, 8731–8740 (2017).
doi: 10.1021/acs.jmedchem.7b00732
Jakalian, A., Jack, D. B. & Bayly, C. I. Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. J. Comput. Chem. 23, 1623–1641 (2002).
doi: 10.1002/jcc.10128
Dodda, L. S., Vilseck, J. Z., Tirado-Rives, J. & Jorgensen, W. L. 1.14*CM1A-LBCC: localized bond-charge corrected CM1A charges for condensed-phase simulations. J. Phys. Chem. B 121, 3864–3870 (2017).
doi: 10.1021/acs.jpcb.7b00272
Jorgensen, W. L., Maxwell, D. S. & Tirado-Rives, J. Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 118, 11225–11236 (1996).
doi: 10.1021/ja9621760
Storer, J. W., Giesen, D. J., Cramer, C. J. & Truhlar, D. G. Class IV charge models: a new semiempirical approach in quantum chemistry. J. Comput. Aided Mol. Des. 9, 87–110 (1995).
doi: 10.1007/BF00117280
Li, J., Zhu, T., Cramer, C. J. & Truhlar, D. G. New class IV charge model for extracting accurate partial charges from wave functions. J. Phys. Chem. A 102, 1820–1831 (1998).
doi: 10.1021/jp972682r
Thompson, J. D., Cramer, C. J. & Truhlar, D. G. Parameterization of charge model 3 for AM1, PM3, BLYP, and B3LYP. J. Comput. Chem. 24, 1291–1304 (2003).
doi: 10.1002/jcc.10244
Grimme, S. & Bannwarth, C. Ultra-fast computation of electronic spectra for large systems by tight-binding based simplified Tamm–Dancoff approximation (sTDA-xTB). J. Chem. Phys. 145, 054103 (2016).
doi: 10.1063/1.4959605
Wang, E. et al. End-point binding free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug design. Chem. Rev. 119, 9478–9508 (2019).
doi: 10.1021/acs.chemrev.9b00055
Sun, Z., Liu, Q., Qu, G., Feng, Y. & Reetz, M. T. Utility of B factors in protein science: interpreting rigidity, flexibility, and internal motion and engineering thermostability. Chem. Rev. 119, 1626–1665 (2019).
doi: 10.1021/acs.chemrev.8b00290
Guilligay, D. et al. The structural basis for cap binding by influenza virus polymerase subunit PB2. Nat. Struct. Mol. Biol. 15, 500–506 (2008).
doi: 10.1038/nsmb.1421
Rayne, S. & Forest, K. Benchmarking semiempirical, Hartree–Fock, DFT, and MP2 methods against the ionization energies and electron affinities of short- through long-chain [n]acenes and [n]phenacenes. Can. J. Chem. 94, 251–258 (2016).
doi: 10.1139/cjc-2015-0526
Zhan, C.-G., Nichols, J. A. & Dixon, D. A. Ionization potential, electron affinity, electronegativity, hardness, and electron excitation energy: molecular properties from density functional theory orbital energies. J. Phys. Chem. A 107, 4184–4195 (2003).
doi: 10.1021/jp0225774
Lange, G. et al. Requirements for specific binding of low affinity inhibitor fragments to the SH2 domain of pp60Src are identical to those for high affinity binding of full length inhibitors. J. Med. Chem. 46, 5184–5195 (2003).
doi: 10.1021/jm020970s
Öster, L., Tapani, S., Xue, Y. & Käck, H. Successful generation of structural information for fragment-based drug discovery. Drug Discov. Today 20, 1104–1111 (2015).
doi: 10.1016/j.drudis.2015.04.005
Heinzlmeir, S. et al. Chemoproteomics-aided medicinal chemistry for the discovery of EPHA2 inhibitors. ChemMedChem 12, 999–1011 (2017).
doi: 10.1002/cmdc.201700217
Gaieb, Z. et al. D3R Grand Challenge 2: blind prediction of protein–ligand poses, affinity rankings, and relative binding free energies. J. Comput. Aided Mol. Des. 32, 1–20 (2018).
doi: 10.1007/s10822-017-0088-4
Whitehouse, A. J. et al. Development of inhibitors against Mycobacterium abscessus tRNA (m1G37) methyltransferase (TrmD) using fragment-based approaches. J. Med. Chem. 62, 7210–7232 (2019).
doi: 10.1021/acs.jmedchem.9b00809
Menezes, F. & Popowicz, G. M. ULYSSES: an efficient and easy to use semiempirical library for C. J. Chem. Inf. Model. 62, 3685–3694 (2022).
doi: 10.1021/acs.jcim.2c00757
Bannwarth, C., Ehlert, S. & Grimme, S. GFN2-xTB—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions. J. Chem. Theory Comput. 15, 1652–1671 (2019).
doi: 10.1021/acs.jctc.8b01176
Dewar, M. J. S., Zoebisch, E. G., Healy, E. F. & Stewart, J. J. P. Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model. J. Am. Chem. Soc. 107, 3902–3909 (1985).
doi: 10.1021/ja00299a024
Stewart, J. J. P. Application of the PM6 method to modeling proteins. J. Mol. Model. 15, 765–805 (2009).
doi: 10.1007/s00894-008-0420-y
Sigalov, G., Fenley, A. & Onufriev, A. Analytical electrostatics for biomolecules: beyond the generalized Born approximation. J. Chem. Phys. 124, 124902 (2006).
doi: 10.1063/1.2177251
Christensen, A. S., Kubař, T., Cui, Q. & Elstner, M. Semiempirical quantum mechanical methods for noncovalent interactions for chemical and biochemical applications. Chem. Rev. 116, 5301–5337 (2016).
doi: 10.1021/acs.chemrev.5b00584
Dixon, S. L. & Merz, K. M. Fast, accurate semiempirical molecular orbital calculations for macromolecules. J. Chem. Phys. 107, 879–893 (1997).
doi: 10.1063/1.474386
O’Boyle, N. M. et al. Open Babel: an open chemical toolbox. J. Cheminform. 3, 33 (2011).
doi: 10.1186/1758-2946-3-33
Caldeweyher, E. et al. A generally applicable atomic-charge dependent London dispersion correction. J. Chem. Phys. 150, 154122 (2019).
doi: 10.1063/1.5090222
Hanwell, M. D. et al. Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J. Cheminform. 4, 17 (2012).
doi: 10.1186/1758-2946-4-17
Case, D. A. et al. Amber 2021 (Univ. of California, San Francisco, 2021).
Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general Amber force field. J. Comput. Chem. 25, 1157–1174 (2004).
doi: 10.1002/jcc.20035
Maier, J. A. et al. ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713 (2015).
doi: 10.1021/acs.jctc.5b00255
Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935 (1983).
doi: 10.1063/1.445869
Townshend, R. J. L. et al. ATOM3D: tasks on molecules in three dimensions. Preprint at https://doi.org/10.48550/arXiv.2012.04035 (2022).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. Preprint at https://doi.org/10.48550/arXiv.1609.02907 (2017).
Huang, Y., Niu, B., Gao, Y., Fu, L. & Li, W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26, 680–682 (2010).
doi: 10.1093/bioinformatics/btq003
Forli, S. et al. Computational protein–ligand docking and virtual drug screening with the AutoDock suite. Nat. Protoc. 11, 905–919 (2016).
doi: 10.1038/nprot.2016.051
Zhao, Y., Stoffler, D. & Sanner, M. Hierarchical and multi-resolution representation of protein flexibility. Bioinformatics 22, 2768–2774 (2006).
doi: 10.1093/bioinformatics/btl481
Ravindranath, P. A., Forli, S., Goodsell, D. S., Olson, A. J. & Sanner, M. F. AutoDockFR: advances in protein–ligand docking with explicitly specified binding site flexibility. PLoS Comput. Biol. 11, e1004586 (2015).
doi: 10.1371/journal.pcbi.1004586
Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277–293 (1995).
doi: 10.1007/BF00197809
Johnson, B. A. & Blevins, R. A. NMR View: a computer program for the visualization and analysis of NMR data. J. Biomol. NMR 4, 603–614 (1994).
doi: 10.1007/BF00404272
Siebenmorgen, T. et al. MISATO—machine learning dataset for structure-based drug discovery. Zenodo https://doi.org/10.5281/zenodo.7711953 (2023).
t7morgen/misato-dataset: release for publication. Zenodo https://doi.org/10.5281/zenodo.10926008 (2024).

Auteurs

Till Siebenmorgen (T)

Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
TUM School of Natural Sciences, Department of Bioscience, Bayerisches NMR Zentrum, Technical University of Munich, Garching, Germany.

Filipe Menezes (F)

Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
TUM School of Natural Sciences, Department of Bioscience, Bayerisches NMR Zentrum, Technical University of Munich, Garching, Germany.

Sabrina Benassou (S)

Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany.

Erinc Merdivan (E)

Helmholtz AI, Helmholtz Munich, Neuherberg, Germany.

Kieran Didi (K)

Computer Laboratory, Cambridge University, Cambridge, UK.

André Santos Dias Mourão (ASD)

Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
TUM School of Natural Sciences, Department of Bioscience, Bayerisches NMR Zentrum, Technical University of Munich, Garching, Germany.

Radosław Kitel (R)

Faculty of Chemistry, Jagiellonian University, Krakow, Poland.

Pietro Liò (P)

Computer Laboratory, Cambridge University, Cambridge, UK.

Stefan Kesselheim (S)

Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany.

Marie Piraud (M)

Helmholtz AI, Helmholtz Munich, Neuherberg, Germany.

Fabian J Theis (FJ)

Helmholtz AI, Helmholtz Munich, Neuherberg, Germany.
Computational Health Center, Institute of Computational Biology, Helmholtz Munich, Neuherberg, Germany.
TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.

Michael Sattler (M)

Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
TUM School of Natural Sciences, Department of Bioscience, Bayerisches NMR Zentrum, Technical University of Munich, Garching, Germany.

Grzegorz M Popowicz (GM)

Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany. grzegorz.popowicz@helmholtz-munich.de.
TUM School of Natural Sciences, Department of Bioscience, Bayerisches NMR Zentrum, Technical University of Munich, Garching, Germany. grzegorz.popowicz@helmholtz-munich.de.

Classifications MeSH