Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection.


Journal

AMIA ... Annual Symposium proceedings. AMIA Symposium
ISSN: 1942-597X
Titre abrégé: AMIA Annu Symp Proc
Pays: United States
ID NLM: 101209213

Informations de publication

Date de publication:
2021
Historique:
entrez: 21 3 2022
pubmed: 22 3 2022
medline: 12 4 2022
Statut: epublish

Résumé

Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.

Identifiants

pubmed: 35308957
pii: 3576693
pmc: PMC8861722

Types de publication

Journal Article Research Support, N.I.H., Intramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

677-686

Informations de copyright

©2021 AMIA - All rights reserved.

Références

Bioinformatics. 2003;19 Suppl 1:i180-2
pubmed: 12855455
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6
pubmed: 21685143
BMC Med Inform Decis Mak. 2015;15 Suppl 2:S4
pubmed: 26099994
AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:88-97
pubmed: 27570656

Auteurs

Daniel X Le (DX)

Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894.

James G Mork (JG)

Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894.

Sameer Antani (S)

Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH