Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.

Biomedical text mining Clustering Contextualized embeddings Deep learning, domain knowledge Text summarization

Journal

Computer methods and programs in biomedicine
ISSN: 1872-7565
Titre abrégé: Comput Methods Programs Biomed
Pays: Ireland
ID NLM: 8506513

Informations de publication

Date de publication:
Feb 2020
Historique:
received: 14 07 2019
revised: 19 09 2019
accepted: 03 10 2019
pubmed: 19 10 2019
medline: 7 1 2021
entrez: 19 10 2019
Statut: ppublish

Résumé

Capturing the context of text is a challenging task in biomedical text summarization. The objective of this research is to show how contextualized embeddings produced by a deep bidirectional language model can be utilized to quantify the informative content of sentences in biomedical text summarization. We propose a novel summarization method that utilizes contextualized embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT) model, a deep learning model that recently demonstrated state-of-the-art results in several natural language processing tasks. We combine different versions of BERT with a clustering method to identify the most relevant and informative sentences of input documents. Using the ROUGE toolkit, we evaluate the summarizer against several methods previously described in literature. The summarizer obtains state-of-the-art results and significantly improves the performance of biomedical text summarization in comparison to a set of domain-specific and domain-independent methods. The largest language model not specifically pretrained on biomedical text outperformed other models. However, among language models of the same size, the one further pretrained on biomedical text obtained best results. We demonstrate that a hybrid system combining a deep bidirectional language model and a clustering method yields state-of-the-art results without requiring labor-intensive creation of annotated features or knowledge bases or computationally demanding domain-specific pretraining. This study provides a starting point towards investigating deep contextualized language models for biomedical text summarization.

Sections du résumé

BACKGROUND AND OBJECTIVE OBJECTIVE
Capturing the context of text is a challenging task in biomedical text summarization. The objective of this research is to show how contextualized embeddings produced by a deep bidirectional language model can be utilized to quantify the informative content of sentences in biomedical text summarization.
METHODS METHODS
We propose a novel summarization method that utilizes contextualized embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT) model, a deep learning model that recently demonstrated state-of-the-art results in several natural language processing tasks. We combine different versions of BERT with a clustering method to identify the most relevant and informative sentences of input documents. Using the ROUGE toolkit, we evaluate the summarizer against several methods previously described in literature.
RESULTS RESULTS
The summarizer obtains state-of-the-art results and significantly improves the performance of biomedical text summarization in comparison to a set of domain-specific and domain-independent methods. The largest language model not specifically pretrained on biomedical text outperformed other models. However, among language models of the same size, the one further pretrained on biomedical text obtained best results.
CONCLUSIONS CONCLUSIONS
We demonstrate that a hybrid system combining a deep bidirectional language model and a clustering method yields state-of-the-art results without requiring labor-intensive creation of annotated features or knowledge bases or computationally demanding domain-specific pretraining. This study provides a starting point towards investigating deep contextualized language models for biomedical text summarization.

Identifiants

pubmed: 31627150
pii: S0169-2607(19)31137-X
doi: 10.1016/j.cmpb.2019.105117
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

105117

Informations de copyright

Copyright © 2019. Published by Elsevier B.V.

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors have no conflicts of interest to declare.

Auteurs

Milad Moradi (M)

Institute for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria. Electronic address: milad.moradivastegani@meduniwien.ac.at.

Georg Dorffner (G)

Institute for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria.

Matthias Samwald (M)

Institute for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH