Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.


Journal

Database : the journal of biological databases and curation
ISSN: 1758-0463
Titre abrégé: Database (Oxford)
Pays: England
ID NLM: 101517697

Informations de publication

Date de publication:
07 03 2023
Historique:
received: 19 08 2022
revised: 06 01 2023
accepted: 15 02 2023
entrez: 7 3 2023
pubmed: 8 3 2023
medline: 10 3 2023
Statut: ppublish

Résumé

The BioCreative National Library of Medicine (NLM)-Chem track calls for a community effort to fine-tune automated recognition of chemical names in the biomedical literature. Chemicals are one of the most searched biomedical entities in PubMed, and-as highlighted during the coronavirus disease 2019 pandemic-their identification may significantly advance research in multiple biomedical subfields. While previous community challenges focused on identifying chemical names mentioned in titles and abstracts, the full text contains valuable additional detail. We, therefore, organized the BioCreative NLM-Chem track as a community effort to address automated chemical entity recognition in full-text articles. The track consisted of two tasks: (i) chemical identification and (ii) chemical indexing. The chemical identification task required predicting all chemicals mentioned in recently published full-text articles, both span [i.e. named entity recognition (NER)] and normalization (i.e. entity linking), using Medical Subject Headings (MeSH). The chemical indexing task required identifying which chemicals reflect topics for each article and should therefore appear in the listing of MeSH terms for the document in the MEDLINE article indexing. This manuscript summarizes the BioCreative NLM-Chem track and post-challenge experiments. We received a total of 85 submissions from 17 teams worldwide. The highest performance achieved for the chemical identification task was 0.8672 F-score (0.8759 precision and 0.8587 recall) for strict NER performance and 0.8136 F-score (0.8621 precision and 0.7702 recall) for strict normalization performance. The highest performance achieved for the chemical indexing task was 0.6073 F-score (0.7417 precision and 0.5141 recall). This community challenge demonstrated that (i) the current substantial achievements in deep learning technologies can be utilized to improve automated prediction accuracy further and (ii) the chemical indexing task is substantially more challenging. We look forward to further developing biomedical text-mining methods to respond to the rapid growth of biomedical literature. The NLM-Chem track dataset and other challenge materials are publicly available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.

Identifiants

pubmed: 36882099
pii: 7071696
doi: 10.1093/database/baad005
pmc: PMC9991492
pii:
doi:

Types de publication

Journal Article Research Support, N.I.H., Intramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

Published by Oxford University Press 2023. This work is written by (a) US Government employee(s) and is in the public domain in the US.

Références

Nucleic Acids Res. 2021 Jan 8;49(D1):D1388-D1395
pubmed: 33151290
Bull Med Libr Assoc. 2000 Jul;88(3):265-6
pubmed: 10928714
F1000Res. 2014 Apr 25;3:96
pubmed: 25254099
BMC Bioinformatics. 2017 Aug 15;18(1):368
pubmed: 28810903
Sci Data. 2019 May 10;6(1):52
pubmed: 31076572
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2
pubmed: 25810773
J Cheminform. 2016 Jun 10;8:32
pubmed: 27293485
Bioinformatics. 2020 Feb 15;36(4):1234-1240
pubmed: 31501885
Sci Data. 2021 Mar 25;8(1):91
pubmed: 33767203
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:561-568
pubmed: 32477678
Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593
pubmed: 31114887
J Am Med Inform Assoc. 2021 Aug 13;28(9):1892-1899
pubmed: 34157094
BMC Bioinformatics. 2008 Sep 25;9:402
pubmed: 18817555
Patterns (N Y). 2023 Jan 13;4(1):100659
pubmed: 36471749
Bioinformatics. 2008 Jul 1;24(13):i268-76
pubmed: 18586724
Database (Oxford). 2016 May 09;2016:
pubmed: 27161011
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1
pubmed: 25810766
Database (Oxford). 2022 Dec 1;2022:
pubmed: 36458799
Database (Oxford). 2009;2009:bap018
pubmed: 20157491
Proc AMIA Symp. 2001;:17-21
pubmed: 11825149
Front Res Metr Anal. 2021 Mar 25;6:654438
pubmed: 33870071
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3
pubmed: 25810774
Genome Biol. 2008;9 Suppl 2:S2
pubmed: 18834493
PLoS Biol. 2020 Jun 1;18(6):e3000716
pubmed: 32479517
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70
pubmed: 14681409
BMC Bioinformatics. 2012 Jul 09;13:161
pubmed: 22776079
Nucleic Acids Res. 2021 Jan 8;49(D1):D1534-D1540
pubmed: 33166392
Proc AMIA Symp. 1999;:176-80
pubmed: 10566344
Nucleic Acids Res. 2013 Jan;41(Database issue):D456-63
pubmed: 23180789
Annu Rev Biomed Data Sci. 2021 Jul 20;4:313-339
pubmed: 34465169
J Am Med Inform Assoc. 2020 Oct 1;27(10):1529-1537
pubmed: 32968800
J Chem Inf Model. 2022 May 9;62(9):2035-2045
pubmed: 34115937
Chem Rev. 2017 Jun 28;117(12):7673-7761
pubmed: 28475312
Database (Oxford). 2013 Sep 18;2013:bat064
pubmed: 24048470
Pac Symp Biocomput. 2006;:28-39
pubmed: 17094225
BMC Bioinformatics. 2015 Apr 30;16:138
pubmed: 25925131
J Biomed Inform. 2007 Feb;40(1):30-43
pubmed: 16697710

Auteurs

Robert Leaman (R)

National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Rezarta Islamaj (R)

National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Virginia Adams (V)

NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA.

Mohammed A Alliheedi (MA)

Department of Computer Science, Al Baha University, 4781 King Fahd Rd, Al Aqiq 65779, Saudi Arabia.

João Rafael Almeida (JR)

Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal.
Department of Information and Communications Technologies, University of A Coruña, Camiño do Lagar de Castro, A Coruña 15008, Spain.

Rui Antunes (R)

Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal.

Robert Bevan (R)

Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK.

Yung-Chun Chang (YC)

Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Da'an District, Taipei City , Taipei 106, Taiwan.

Arslan Erdengasileng (A)

Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA.

Matthew Hodgskiss (M)

Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK.

Ryuki Ida (R)

Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan.

Hyunjae Kim (H)

Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea.

Keqiao Li (K)

Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA.

Robert E Mercer (RE)

Department of Computer Science, The University of Western Ontario, Room 355, Middlesex College, Ontario , London N6A 5B7, Canada.

Lukrécia Mertová (L)

Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany.

Ghadeer Mobasher (G)

Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany.
Institute of Computer Science, Heidelberg University, Im Neuenheimer Feld 205, Heidelberg 69120, Germany.

Hoo-Chang Shin (HC)

NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA.

Mujeen Sung (M)

Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea.

Tomoki Tsujimura (T)

Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan.

Wen-Chao Yeh (WC)

Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan.

Zhiyong Lu (Z)

National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH