TermInformer: unsupervised term mining and analysis in biomedical literature.

Biomedical literature GloVe Sequence labelling Term embeddings Term mining Unsupervised learning

Journal

Neural computing & applications
ISSN: 0941-0643
Titre abrégé: Neural Comput Appl
Pays: England
ID NLM: 9313239

Informations de publication

Date de publication:
16 Sep 2020
Historique:
received: 17 06 2020
accepted: 02 09 2020
entrez: 22 9 2020
pubmed: 23 9 2020
medline: 23 9 2020
Statut: aheadofprint

Résumé

Terminology is the most basic information that researchers and literature analysis systems need to understand. Mining terms and revealing the semantic relationships between terms can help biomedical researchers find solutions to some major health problems and motivate researchers to explore innovative biomedical research issues. However, how to mine terms from biomedical literature remains a challenge. At present, the research on text segmentation in natural language processing (NLP) technology has not been well applied in the biomedical field. Named entity recognition models usually require a large amount of training corpus, and the types of entities that the model can recognize are limited. Besides, dictionary-based methods mainly use pre-established vocabularies to match the text. However, this method can only match terms in a specific field, and the process of collecting terms is time-consuming and labour-intensive. Many scenarios faced in the field of biomedical research are unsupervised, i.e. unlabelled corpora, and the system may not have much prior knowledge. This paper proposes the TermInformer project, which aims to mine the meaning of terms in an open fashion by calculating terms and find solutions to some of the significant problems in our society. We propose an unsupervised method that can automatically mine terms in the text without relying on external resources. Our method can generally be applied to any document data. Combined with the word vector training algorithm, we can obtain reusable term embeddings, which can be used in any NLP downstream application. This paper compares term embeddings with existing word embeddings. The results show that our method can better reflect the semantic relationship between terms. Finally, we use the proposed method to find potential factors and treatments for lung cancer, breast cancer, and coronavirus.

Identifiants

pubmed: 32958982
doi: 10.1007/s00521-020-05335-2
pii: 5335
pmc: PMC7494250
doi:

Types de publication

Journal Article

Langues

eng

Pagination

1-14

Informations de copyright

© Springer-Verlag London Ltd., part of Springer Nature 2020.

Déclaration de conflit d'intérêts

Conflict of interestThe authors declare that they have no conflict of interest.

Références

BMC Bioinformatics. 2019 May 29;20(Suppl 10):249
pubmed: 31138109
Bioinformatics. 2019 May 15;35(10):1745-1752
pubmed: 30307536
BMC Bioinformatics. 2019 Dec 27;20(1):735
pubmed: 31881938
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
Bioinformatics. 2020 Feb 15;36(4):1234-1240
pubmed: 31501885
BMC Bioinformatics. 2019 Aug 16;20(1):427
pubmed: 31419937
Bioinformatics. 2017 Jul 15;33(14):i37-i48
pubmed: 28881963
Biomed Res Int. 2014;2014:240403
pubmed: 24729964
Pac Symp Biocomput. 2008;:652-63
pubmed: 18229723

Auteurs

Prayag Tiwari (P)

Department of Information Engineering, University of Padova, Padua, Italy.

Sagar Uprety (S)

The Open University, London, UK.

Shahram Dehdashti (S)

School of Information Systems, Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia.

M Shamim Hossain (MS)

Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, 11543 Saudi Arabia.

Classifications MeSH