Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature.

Active learning Deep learning Drug-drug interaction Information retrieval Positive sampling Random negative sampling Similarity sampling Uncertainty sampling

Journal

Journal of biomedical semantics
ISSN: 2041-1480
Titre abrégé: J Biomed Semantics
Pays: England
ID NLM: 101531992

Informations de publication

Date de publication:
30 05 2023
Historique:
received: 09 03 2022
accepted: 29 04 2023
medline: 31 5 2023
pubmed: 30 5 2023
entrez: 29 5 2023
Statut: epublish

Résumé

Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper. PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively. By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.

Sections du résumé

BACKGROUND
Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper.
RESULTS
PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively.
CONCLUSIONS
By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.

Identifiants

pubmed: 37248476
doi: 10.1186/s13326-023-00287-7
pii: 10.1186/s13326-023-00287-7
pmc: PMC10228061
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

5

Subventions

Organisme : NICHD NIH HHS
ID : P30 HD106451
Pays : United States
Organisme : NIDA NIH HHS
ID : R01 DA048001
Pays : United States
Organisme : NCI NIH HHS
ID : U01 CA248240
Pays : United States
Organisme : NLM NIH HHS
ID : R01 LM011945
Pays : United States

Informations de copyright

© 2023. The Author(s).

Références

BMC Bioinformatics. 2017 Oct 10;18(1):445
pubmed: 29017459
Pharmacoepidemiol Drug Saf. 2014 May;23(5):489-97
pubmed: 24616171
Expert Opin Drug Saf. 2014 Jan;13(1):57-65
pubmed: 24073682
PLoS One. 2020 Sep 11;15(9):e0238694
pubmed: 32915836
Front Pharmacol. 2021 Apr 23;11:582470
pubmed: 34017245
Int J Med Inform. 2017 Oct;106:25-31
pubmed: 28870380
J Natl Cancer Inst. 2011 Aug 17;103(16):1222-6
pubmed: 21765011
Bioinformatics. 2018 Mar 1;34(5):828-835
pubmed: 29077847
BMC Bioinformatics. 2013 Feb 01;14:35
pubmed: 23374886
J Basic Clin Physiol Pharmacol. 2020 Sep 8;:
pubmed: 32903207
Chem Rev. 2017 Jun 28;117(12):7673-7761
pubmed: 28475312
Bioinformatics. 2021 Jul 19;37(12):1739-1746
pubmed: 33098410
P T. 2018 Jun;43(6):340-351
pubmed: 29896033
Bioinformatics. 2016 Nov 15;32(22):3444-3453
pubmed: 27466626
J Biomed Inform. 2018 May;81:83-92
pubmed: 29601989
BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):578
pubmed: 29297301
Trends Pharmacol Sci. 2013 Mar;34(3):178-84
pubmed: 23414686
PLoS One. 2015 May 11;10(5):e0122199
pubmed: 25961290
Methods Mol Biol. 2022;2496:259-282
pubmed: 35713869
Brief Bioinform. 2018 Sep 28;19(5):863-877
pubmed: 28334070
J Clin Pharmacol. 2003 May;43(5):443-69
pubmed: 12751267
J Basic Clin Pharm. 2014 Mar;5(2):44-8
pubmed: 25031499
Database (Oxford). 2022 May 18;2022:
pubmed: 35616099
Methods Mol Biol. 2022;2496:237-258
pubmed: 35713868
J Clin Pharm Ther. 2021 Jun;46(3):853-855
pubmed: 33277702
Pharmacoepidemiol Drug Saf. 2010 Sep;19(9):901-10
pubmed: 20623513
Clin Pharmacol Ther. 2016 Jan;99(1):92-100
pubmed: 26479278
BMC Bioinformatics. 2022 Aug 14;23(Suppl 7):338
pubmed: 35965308
Methods Mol Biol. 2014;1159:47-75
pubmed: 24788261
Nucleic Acids Res. 2021 Jan 8;49(D1):D1358-D1364
pubmed: 33151297
Expert Opin Drug Saf. 2012 Jan;11(1):83-94
pubmed: 22022824

Auteurs

Weixin Xie (W)

Department of Biomedical Informatics, Ohio State University, Columbus, OH, 43210, USA.

Kunjie Fan (K)

Department of Biomedical Informatics, Ohio State University, Columbus, OH, 43210, USA.

Shijun Zhang (S)

Department of Biomedical Informatics, Ohio State University, Columbus, OH, 43210, USA.

Lang Li (L)

Department of Biomedical Informatics, Ohio State University, Columbus, OH, 43210, USA. lang.li@osumc.edu.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
1.00
Humans Magnetic Resonance Imaging Brain Infant, Newborn Infant, Premature
Humans Algorithms Software Artificial Intelligence Computer Simulation

Classifications MeSH