Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations.


Journal

Database : the journal of biological databases and curation
ISSN: 1758-0463
Titre abrégé: Database (Oxford)
Pays: England
ID NLM: 101517697

Informations de publication

Date de publication:
28 11 2023
Historique:
received: 25 11 2022
revised: 22 09 2023
accepted: 30 10 2023
medline: 30 11 2023
pubmed: 28 11 2023
entrez: 28 11 2023
Statut: ppublish

Résumé

It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug-gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However, their systematic, large-scale exploitation is key for developing tools, impacting knowledge fields as diverse as drug design or metabolic pathway research. Previous efforts in the extraction of drug-gene/protein interactions from the literature did not address these scalability and granularity issues. To tackle them, we have organized the DrugProt track at BioCreative VII. In the context of the track, we have released the DrugProt Gold Standard corpus, a collection of 5000 PubMed abstracts, manually annotated with granular drug-gene/protein interactions. We have proposed a novel large-scale track to evaluate the capacity of natural language processing systems to scale to the range of millions of documents, and generate with their predictions a silver standard knowledge graph of 53 993 602 nodes and 19 367 406 edges. Its use exceeds the shared task and points toward pharmacological and biological applications such as drug discovery or continuous database curation. Finally, we have created a persistent evaluation scenario on CodaLab to continuously evaluate new relation extraction systems that may arise. Thirty teams from four continents, which involved 110 people, sent 107 submission runs for the Main DrugProt track, and nine teams submitted 21 runs for the Large Scale DrugProt track. Most participants implemented deep learning approaches based on pretrained transformer-like language models (LMs) such as BERT or BioBERT, reaching precision and recall values as high as 0.9167 and 0.9542 for some relation types. Finally, some initial explorations of the applicability of the knowledge graph have shown its potential to explore the chemical-protein relations described in the literature, or chemical compound-enzyme interactions. Database URL:  https://doi.org/10.5281/zenodo.4955410.

Identifiants

pubmed: 38015956
pii: 7453369
doi: 10.1093/database/baad080
pmc: PMC10683943
pii:
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2023. Published by Oxford University Press.

Références

BMC Bioinformatics. 2012 Jan 30;13:17
pubmed: 22289351
Database (Oxford). 2016 May 09;2016:
pubmed: 27161011
Bioinformatics. 2021 Jun 9;37(9):1332-1334
pubmed: 32976572
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):540-3
pubmed: 21846785
Pac Symp Biocomput. 2000;:505-16
pubmed: 10902198
Nucleic Acids Res. 2014 Jan;42(Database issue):D1083-90
pubmed: 24214965
J Biomed Inform. 2013 Oct;46(5):914-20
pubmed: 23906817
BMC Bioinformatics. 2011 Jun 24;12:257
pubmed: 21702939
Nucleic Acids Res. 2018 Jan 4;46(D1):D1121-D1127
pubmed: 29140520
Nucleic Acids Res. 2017 Jan 4;45(D1):D945-D954
pubmed: 27899562
Genome Biol. 2008;9 Suppl 2:S4
pubmed: 18834495
Database (Oxford). 2018 Jan 1;2018:
pubmed: 30020437
PeerJ. 2016 Mar 21;4:e1811
pubmed: 27019783
Bioinformatics. 2003;19 Suppl 1:i180-2
pubmed: 12855455
Nucleic Acids Res. 2016 Jan 4;44(D1):D1054-68
pubmed: 26464438
Nucleic Acids Res. 2008 Jan;36(Database issue):D684-8
pubmed: 18084021
J Biomed Semantics. 2011 Oct 06;2 Suppl 5:S11
pubmed: 22166494
Nucleic Acids Res. 2016 Jan 4;44(D1):D548-54
pubmed: 26467481
PLoS One. 2014 Jul 18;9(7):e102039
pubmed: 25036529
Genome Biol. 2008;9 Suppl 2:S2
pubmed: 18834493
Int J Med Inform. 2002 Dec 4;67(1-3):49-61
pubmed: 12460631
Pac Symp Biocomput. 2000;:517-28
pubmed: 10902199
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2
pubmed: 25810773
Biochim Biophys Acta Gene Regul Mech. 2022 Jan;1865(1):194778
pubmed: 34875418
Adv Healthc Mater. 2023 Oct;12(25):e2300150
pubmed: 37563883
BMC Bioinformatics. 2012 Jul 23;13:172
pubmed: 22823282
IEEE Int Conf Healthc Inform. 2017 Aug;2017:5-12
pubmed: 29034375
BMC Bioinformatics. 2007 Feb 09;8:50
pubmed: 17291334
PLoS Comput Biol. 2015 Jul 28;11(7):e1004216
pubmed: 26219079
BMC Bioinformatics. 2005;6 Suppl 1:S3
pubmed: 15960837
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D668-72
pubmed: 16381955
Bioinformatics. 2020 Feb 15;36(4):1234-1240
pubmed: 31501885

Auteurs

Antonio Miranda-Escalada (A)

Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.

Farrokh Mehryary (F)

TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland.

Jouni Luoma (J)

TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland.

Darryl Estrada-Zavala (D)

Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.

Luis Gasco (L)

Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.

Sampo Pyysalo (S)

TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland.

Alfonso Valencia (A)

Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.

Martin Krallinger (M)

Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH