Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data.

biomedical text mining entity normalization ontology word embedding

Journal

Genomics & informatics
ISSN: 1598-866X
Titre abrégé: Genomics Inform
Pays: Korea (South)
ID NLM: 101223836

Informations de publication

Date de publication:
Jun 2019
Historique:
received: 14 03 2019
accepted: 31 05 2019
entrez: 16 7 2019
pubmed: 16 7 2019
medline: 16 7 2019
Statut: ppublish

Résumé

Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.

Identifiants

pubmed: 31307135
pii: GI.2019.17.2.e20
doi: 10.5808/GI.2019.17.2.e20
pmc: PMC6808633
doi:

Types de publication

Journal Article

Langues

eng

Pagination

e20

Références

Proc Int Conf Intell Syst Mol Biol. 1999;:77-86
pubmed: 10786289
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
KDD. 2016 Aug;2016:855-864
pubmed: 27853626
Genome Biol. 2008;9 Suppl 2:S3
pubmed: 18834494
Bioinformatics. 2007 May 15;23(10):1274-81
pubmed: 17344234

Auteurs

Arnaud Ferré (A)

MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France.
LIMSI, CNRS, Paris-Saclay University, 91405 Orsay, France.

Mouhamadou Ba (M)

MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France.

Robert Bossy (R)

MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France.

Classifications MeSH