Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection.
Journal
AMIA ... Annual Symposium proceedings. AMIA Symposium
ISSN: 1942-597X
Titre abrégé: AMIA Annu Symp Proc
Pays: United States
ID NLM: 101209213
Informations de publication
Date de publication:
2021
2021
Historique:
entrez:
21
3
2022
pubmed:
22
3
2022
medline:
12
4
2022
Statut:
epublish
Résumé
Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.
Types de publication
Journal Article
Research Support, N.I.H., Intramural
Langues
eng
Sous-ensembles de citation
IM
Pagination
677-686Informations de copyright
©2021 AMIA - All rights reserved.
Références
Bioinformatics. 2003;19 Suppl 1:i180-2
pubmed: 12855455
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6
pubmed: 21685143
BMC Med Inform Decis Mak. 2015;15 Suppl 2:S4
pubmed: 26099994
AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:88-97
pubmed: 27570656