Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.

Algorithms Data Mining / methods Deep Learning Humans Medical Informatics Natural Language Processing Semantics Unified Medical Language System

Biomedical text mining Clustering Contextualized embeddings Deep learning, domain knowledge Text summarization

Journal

Computer methods and programs in biomedicine

ISSN: 1872-7565

Titre abrégé: Comput Methods Programs Biomed

Pays: Ireland

ID NLM: 8506513

Informations de publication

Date de publication:
Feb 2020

Historique:

received: 14 07 2019

revised: 19 09 2019

accepted: 03 10 2019

pubmed: 19 10 2019

medline: 7 1 2021

entrez: 19 10 2019

Statut: ppublish

Résumé

Capturing the context of text is a challenging task in biomedical text summarization. The objective of this research is to show how contextualized embeddings produced by a deep bidirectional language model can be utilized to quantify the informative content of sentences in biomedical text summarization. We propose a novel summarization method that utilizes contextualized embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT) model, a deep learning model that recently demonstrated state-of-the-art results in several natural language processing tasks. We combine different versions of BERT with a clustering method to identify the most relevant and informative sentences of input documents. Using the ROUGE toolkit, we evaluate the summarizer against several methods previously described in literature. The summarizer obtains state-of-the-art results and significantly improves the performance of biomedical text summarization in comparison to a set of domain-specific and domain-independent methods. The largest language model not specifically pretrained on biomedical text outperformed other models. However, among language models of the same size, the one further pretrained on biomedical text obtained best results. We demonstrate that a hybrid system combining a deep bidirectional language model and a clustering method yields state-of-the-art results without requiring labor-intensive creation of annotated features or knowledge bases or computationally demanding domain-specific pretraining. This study provides a starting point towards investigating deep contextualized language models for biomedical text summarization.

Sections du résumé

BACKGROUND AND OBJECTIVE OBJECTIVE

METHODS METHODS

We propose a novel summarization method that utilizes contextualized embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT) model, a deep learning model that recently demonstrated state-of-the-art results in several natural language processing tasks. We combine different versions of BERT with a clustering method to identify the most relevant and informative sentences of input documents. Using the ROUGE toolkit, we evaluate the summarizer against several methods previously described in literature.

RESULTS RESULTS

The summarizer obtains state-of-the-art results and significantly improves the performance of biomedical text summarization in comparison to a set of domain-specific and domain-independent methods. The largest language model not specifically pretrained on biomedical text outperformed other models. However, among language models of the same size, the one further pretrained on biomedical text obtained best results.

CONCLUSIONS CONCLUSIONS

We demonstrate that a hybrid system combining a deep bidirectional language model and a clustering method yields state-of-the-art results without requiring labor-intensive creation of annotated features or knowledge bases or computationally demanding domain-specific pretraining. This study provides a starting point towards investigating deep contextualized language models for biomedical text summarization.

Identifiants

DOI: 10.1016/j.cmpb.2019.105117 PMID: 31627150

pubmed: 31627150

pii: S0169-2607(19)31137-X

doi: 10.1016/j.cmpb.2019.105117

pii:

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

105117

Informations de copyright

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors have no conflicts of interest to declare.

Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Déclaration de conflit d'intérêts

Auteurs

Milad Moradi (M)

Georg Dorffner (G)

Matthias Samwald (M)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH