Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data.
biomedical text mining
entity normalization
ontology
word embedding
Journal
Genomics & informatics
ISSN: 1598-866X
Titre abrégé: Genomics Inform
Pays: Korea (South)
ID NLM: 101223836
Informations de publication
Date de publication:
Jun 2019
Jun 2019
Historique:
received:
14
03
2019
accepted:
31
05
2019
entrez:
16
7
2019
pubmed:
16
7
2019
medline:
16
7
2019
Statut:
ppublish
Résumé
Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.
Identifiants
pubmed: 31307135
pii: GI.2019.17.2.e20
doi: 10.5808/GI.2019.17.2.e20
pmc: PMC6808633
doi:
Types de publication
Journal Article
Langues
eng
Pagination
e20Références
Proc Int Conf Intell Syst Mol Biol. 1999;:77-86
pubmed: 10786289
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
KDD. 2016 Aug;2016:855-864
pubmed: 27853626
Genome Biol. 2008;9 Suppl 2:S3
pubmed: 18834494
Bioinformatics. 2007 May 15;23(10):1274-81
pubmed: 17344234