A workflow for deriving chemical entities from crystallographic data and its application to the Crystallography Open Database.

Chemical structure assignment Crystallography Open Database Molecular perception PubChem

Journal

Journal of cheminformatics
ISSN: 1758-2946
Titre abrégé: J Cheminform
Pays: England
ID NLM: 101516718

Informations de publication

Date de publication:
19 Dec 2023
Historique:
received: 28 08 2023
accepted: 09 11 2023
medline: 20 12 2023
pubmed: 20 12 2023
entrez: 20 12 2023
Statut: epublish

Résumé

Knowledge about the 3-dimensional structure, orientation and interaction of chemical compounds is important in many areas of science and technology. X-ray crystallography is one of the experimental techniques capable of providing a large amount of structural information for a given compound, and it is widely used for characterisation of organic and metal-organic molecules. The method provides precise 3D coordinates of atoms inside crystals, however, it does not directly deliver information about certain chemical characteristics such as bond orders, delocalization, charges, lone electron pairs or lone electrons. These aspects of a molecular model have to be derived from crystallographic data using refined information about interatomic distances and atom types as well as employing general chemical knowledge. This publication describes a curated automatic pipeline for the derivation of chemical attributes of molecules from crystallographic models. The method is applied to build a catalogue of chemical entities in an open-access crystallographic database, the Crystallography Open Database (COD). The catalogue of such chemical entities is provided openly as a derived database. The content of this catalogue and the problems arising in the fully automated pipeline are discussed, along with the possibilities to introduce manual data curation into the process.

Identifiants

pubmed: 38115123
doi: 10.1186/s13321-023-00780-2
pii: 10.1186/s13321-023-00780-2
doi:

Types de publication

Journal Article

Langues

eng

Pagination

123

Subventions

Organisme : Research Council of Lithuania
ID : MIP-23-87
Organisme : Research Council of Lithuania
ID : MIP-23-87
Organisme : Research Council of Lithuania
ID : MIP-23-87

Informations de copyright

© 2023. This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply.

Références

Spicher S, Grimme S (2020) Robust atomistic modeling of materials, organometallic, and biochemical systems. Angewandte Chemie International Edition 59(36):15665–15673. https://doi.org/10.1002/anie.202004239
doi: 10.1002/anie.202004239 pubmed: 32343883
Baber JC, Hodgkin EE (1992) Automatic assignment of chemical connectivity to organic molecules in the Cambridge Structural Database. J Chem Inform Model 32(5):401–406. https://doi.org/10.1021/ci00009a001
doi: 10.1021/ci00009a001
Hendlich M, Rippmann F, Barnickel G (1997) BALI: Automatic assignment of bond and atom types for protein ligands in the Brookhaven Protein Databank. J Chem Inform Comput Sci 37(4):774–778. https://doi.org/10.1021/ci9603487
doi: 10.1021/ci9603487
Sayle RA. PDB: Cruft to Content (perception of Molecular Connectivity from 3D Coordinates). https://www.daylight.com/meetings/mug01/Sayle/m4xbondage.html Accessed 2023-08-21
Labute P (2005) On the perception of molecules from 3D atomic coordinates. J Chem Inform Model 45(2):215–221. https://doi.org/10.1021/ci049915d
doi: 10.1021/ci049915d
Froeyen M, Herdewijn P (2005) Correct bond order assignment in a molecular framework using integer linear programming with application to molecules where only non-hydrogen atom coordinates are available. J Chem Inform Model 45(5):1267–1274. https://doi.org/10.1021/ci049645z
doi: 10.1021/ci049645z
Feldman HJ, Snyder KA, Ticoll A, Pintilie G, Hogue CWV (2006) A complete small molecule dataset from the Protein Data Bank. FEBS Lett 580(6):1649–1653. https://doi.org/10.1016/j.febslet.2006.02.003
doi: 10.1016/j.febslet.2006.02.003 pubmed: 16494871
Zhao Y, Cheng T, Wang R (2007) Automatic perception of organic molecules based on essential structural information. J Chem Inform Model 47(4):1379–1385. https://doi.org/10.1021/ci700028w
doi: 10.1021/ci700028w
Kadukova M, Grudinin S (2016) Knodle: A support vector machines-based automatic perception of organic molecules from 3D coordinates. J Chem Inform Model 56(8):1410–1419. https://doi.org/10.1021/acs.jcim.5b00512
doi: 10.1021/acs.jcim.5b00512
Welsh ID, Allison JR (2019) Automated simultaneous assignment of bond orders and formal charges. J Cheminform 11:1. https://doi.org/10.1186/s13321-019-0340-0
doi: 10.1186/s13321-019-0340-0
Lazzari F, Salvadori A, Mancini G, Barone V (2020) Molecular perception for visualization and computation: The Proxima library. J Chem Inform Model 60(6):2668–2672. https://doi.org/10.1021/acs.jcim.0c00076
doi: 10.1021/acs.jcim.0c00076
Bruno IJ, Shields GP, Taylor R (2011) Deducing chemical structure from crystallographically determined atomic coordinates. Acta Crystallographica B 67(4):333–349. https://doi.org/10.1107/s0108768111024608
doi: 10.1107/s0108768111024608
Quirós M, Gražulis S, Girdzijauskaitė S, Merkys A, Vaitkus A (2018) Using SMILES strings for the description of chemical connectivity in the Crystallography Open Database. J Cheminform 10:1. https://doi.org/10.1186/s13321-018-0279-6
doi: 10.1186/s13321-018-0279-6
Clark AM (2011) Accurate specification of molecular structures: the case for zero-order bonds and explicit hydrogen counting. J Chem Inform Model 51(12):3149–3157. https://doi.org/10.1021/ci200488k
doi: 10.1021/ci200488k
Apodaca RL. Of Zero-Order Bonds and Bonding Systems. https://depth-first.com/articles/2021/05/04/of-zero-order-bonds-and-bonding-systems/ Accessed 21 Mar 2023
Vaitkus A. cif-perceive-chemistry, Version 0.4.0. svn://www.crystallography.net/cif-perceive-chemistry/tags/v0.4.0 Accessed 21 Aug 2023
Gražulis S, Chateigner D, Downs RT, Yokochi AFT, Quirós M, Lutterotti L, Manakova E, Butkus J, Moeck P, Le Bail A (2009) Crystallography Open Database—an open-access collection of crystal structures. J Appl Crystallogr 42(4):726–729. https://doi.org/10.1107/S0021889809016690
doi: 10.1107/S0021889809016690 pubmed: 22477773 pmcid: 3253730
Gražulis S, Daškevič A, Merkys A, Chateigner D, Lutterotti L, Quirós M, Serebryanaya NR, Moeck P, Downs RT, Le Bail A (2012) Crystallography Open Database (COD): an open-access collection of crystal structures and platform for world-wide collaboration. Nucleic Acids Res 40(D1):420–427. https://doi.org/10.1093/nar/gkr900
doi: 10.1093/nar/gkr900
Hall SR, Allen FH, Brown ID (1991) The crystallographic information file (CIF): a new standard archive file for crystallography. Acta Crystallographica A 47(6):655–685. https://doi.org/10.1107/S010876739101067X
doi: 10.1107/S010876739101067X
Bernstein HJ, Bollinger JC, Brown ID, Gražulis S, Hester JR, McMahon B, Spadaccini N, Westbrook JD, Westrip SP (2016) Specification of the crystallographic information file format, version 2.0. J Appl Crystallogr 49(1):277–284. https://doi.org/10.1107/s1600576715021871
doi: 10.1107/s1600576715021871
Gražulis S, Merkys A, Vaitkus A, Okulič-Kazarinas M (2015) Computing stoichiometric molecular composition from crystal structures. J Appl Crystallogr 48(1):85–91. https://doi.org/10.1107/s1600576714025904
doi: 10.1107/s1600576714025904 pubmed: 26089747 pmcid: 4453171
Petrauskas K, Merkys A, Vaitkus A, Laibinis L, Gražulis S (2022) Proving the correctness of the algorithm for building a crystallographic space group. J Appl Crystallogr 55(3):515–525. https://doi.org/10.1107/s1600576722003107
doi: 10.1107/s1600576722003107
Vaitkus A, Merkys A, Gražulis. cod-tools, Version 3.6.0. svn://www.crystallography.net/cod-tools/tags/v3.6.0 Accessed 21 Aug 2023
Nespolo M, Benahsene AH (2021) Symmetry and chirality in crystals. J Appl Crystallogr 54(6):1594–1599. https://doi.org/10.1107/S1600576721009109
doi: 10.1107/S1600576721009109
CTFile formats. Technical report, BIOVIA (2020). https://discover.3ds.com/sites/default/files/2020-08/biovia_ctfileformats_2020.pdf Accessed 21 Aug 2023
Lindner P. IANA, Text Media Types, Definition of Tab-separated-values (tsv). U of MN Internet Gopher Team. https://www.iana.org/assignments/media-types/text/tab-separated-values Accessed 21 Aug 2023
TSV, TAB-separated Values. Library of Congress. https://www.loc.gov/preservation/digital/formats/fdd/fdd000533.shtml Accessed 21 Aug 2023
Sander T, Rufener C, Bär R, Korff M. OpenChemLib - Open Source Java-based Chemistry Library. https://github.com/Actelion/openchemlib . Accessed 21 Aug 2023
Sander T, Freyss J, Korff M, Rufener C (2015) DataWarrior: an open-source program for chemistry aware data visualization and analysis. J Chem Inform Model 55(2):460–473. https://doi.org/10.1021/ci500588j
doi: 10.1021/ci500588j
Sander T. The .dwar File Format. https://openmolecules.org/help/fileformats.html#dwar . Accessed 28 Aug 2023
Ortmann DA, Weberndörfer B, Ilg K, Laubender M, Werner H (2002) Carbene iridium(I) and iridium(III) complexes containing the metal center in different stereochemical environments. Organometallics 21(12):2369–2381. https://doi.org/10.1021/om020069a
doi: 10.1021/om020069a
Hanson RM (2010) Jmol—a paradigm shift in crystallographic visualization. J Appl Crystallogr 43(5):1250–1260. https://doi.org/10.1107/S0021889810030256
doi: 10.1107/S0021889810030256
Sander T, Rufener C, Bär R, Korff M. Molecule.java Class from the OpenChemLib Framework, Version 2022-11-1. https://raw.githubusercontent.com/Actelion/openchemlib/2de8ed734271d2d0ff1cdd54c1e8267c628e0e74/src/main/java/com/actelion/research/chem/Molecule.java . Accessed 21 Aug 2023
O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminform 3:33. https://doi.org/10.1186/1758-2946-3-33
doi: 10.1186/1758-2946-3-33 pubmed: 21982300 pmcid: 3198950
Gražulis S. cml-tools, Version 0.2.0. svn://saulius-grazulis.lt/cml-tools/tags/v0.2.0. Accessed 21 Aug 2023
Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D (2015) InChI, the IUPAC international chemical identifier. J Cheminform 7:1. https://doi.org/10.1186/s13321-015-0068-4
doi: 10.1186/s13321-015-0068-4
Crystallography Open Database - PubChem Data Source. PubChem. https://pubchem.ncbi.nlm.nih.gov/source/849 . Accessed 21 Aug 2023
Merkys A, Vaitkus A, Grybauskas A, Konovalovas A, Quirós M, Gražulis S (2023) Graph isomorphism-based algorithm for cross-checking chemical and crystallographic descriptions. J Cheminform 15:1. https://doi.org/10.1186/s13321-023-00692-1
doi: 10.1186/s13321-023-00692-1
Vaitkus A. Feature #1166: Add Means to Select a Specific Disorder Group Combination. COD. https://projects.ibt.lt/repositories/issues/1166 . Accessed 21 Aug 2023
Crystal Structure Information from COD in PubChem for CID 700843. PubChem. https://pubchem.ncbi.nlm.nih.gov/compound/700843#section=Crystal-Structures &fullscreen=true . Accessed 21 Aug 2023
Crystal Structure Information from COD in PubChem for SID 385842820. PubChem. https://pubchem.ncbi.nlm.nih.gov/substance?source=Crystallography+Open+Database &sourceid=1100299#section=Crystal-Structures &fullscreen=true . Accessed 21 Aug 2023

Auteurs

Antanas Vaitkus (A)

Section of Crystallography and Chemical Informatics, Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio al. 7, Vilnius, LT-10257, Lithuania. antanas.vaitkus@bti.vu.lt.

Andrius Merkys (A)

Section of Crystallography and Chemical Informatics, Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio al. 7, Vilnius, LT-10257, Lithuania.

Thomas Sander (T)

Scientific Computing Drug Discovery, Idorsia Pharmaceuticals Ltd., Hegenheimermattweg 89, Allschwil, 4123, Switzerland.

Miguel Quirós (M)

Departamento de Química Inorgánica, Universidad de Granada, Granada, 18071, Spain.

Paul A Thiessen (PA)

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

Evan E Bolton (EE)

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. bolton@ncbi.nlm.nih.gov.

Saulius Gražulis (S)

Section of Crystallography and Chemical Informatics, Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio al. 7, Vilnius, LT-10257, Lithuania.
Faculty of Mathematics and Informatics, Vilnius University, Naugarduko g. 24, Vilnius, LT-03225, Lithuania.

Classifications MeSH