Improving Information Extraction from Pathology Reports using Named Entity Recognition.


Journal

Research square
Titre abrégé: Res Sq
Pays: United States
ID NLM: 101768035

Informations de publication

Date de publication:
03 Jul 2023
Historique:
pubmed: 18 7 2023
medline: 18 7 2023
entrez: 18 7 2023
Statut: epublish

Résumé

Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two significant limitations. First, they typically frame their tasks as report classification, which restricts the granularity of extracted information. Second, they often fail to generalize to unseen reports due to variations in language, negation, and human error. To overcome these challenges, we propose a BERT (bidirectional encoder representations from transformers) named entity recognition (NER) system to extract key diagnostic elements from pathology reports. We also introduce four data augmentation methods to improve the robustness of our model. Trained and evaluated on 1438 annotated breast pathology reports, acquired from a large medical center in the United States, our BERT model trained with data augmentation achieves an entity F1-score of 0.916 on an internal test set, surpassing the BERT baseline (0.843). We further assessed the model's generalizability using an external validation dataset from the United Arab Emirates, where our model maintained satisfactory performance (F1-score 0.860). Our findings demonstrate that our NER systems can effectively extract fine-grained information from widely diverse medical reports, offering the potential for large-scale information extraction in a wide range of medical and AI research. We publish our code at https://github.com/nyukat/pathology_extraction.

Identifiants

pubmed: 37461545
doi: 10.21203/rs.3.rs-3035772/v1
pmc: PMC10350195
pii:
doi:

Types de publication

Preprint

Langues

eng

Subventions

Organisme : NCATS NIH HHS
ID : UL1 TR001445
Pays : United States

Déclaration de conflit d'intérêts

Competing Statements The authors declare no competing interests.

Auteurs

Ken G Zeng (KG)

New York University, New York, NY, USA.

Tarun Dutt (T)

New York University Grossman School of Medicine, New York, NY, USA.

Jan Witowski (J)

New York University Grossman School of Medicine, New York, NY, USA.

G V Kranthi Kiran (GV)

New York University, New York, NY, USA.

Frank Yeung (F)

New York University Grossman School of Medicine, New York, NY, USA.

Michelle Kim (M)

New York University Grossman School of Medicine, New York, NY, USA.

Jesi Kim (J)

New York University Grossman School of Medicine, New York, NY, USA.

Mitchell Pleasure (M)

New York University Grossman School of Medicine, New York, NY, USA.

Christopher Moczulski (C)

New York University Grossman School of Medicine, New York, NY, USA.

L Julian Lechuga Lopez (LJL)

New York University Abu Dhabi, Abu Dhabi, United Arab Emirates.

Hao Zhang (H)

New York University Grossman School of Medicine, New York, NY, USA.

Mariam Al Harbi (MA)

Abu Dhabi Health Services, United Arab Emirates.

Farah E Shamout (FE)

New York University Abu Dhabi, Abu Dhabi, United Arab Emirates.

Vincent J Major (VJ)

New York University Grossman School of Medicine, New York, NY, USA.

Laura Heacock (L)

New York University Grossman School of Medicine, New York, NY, USA.

Linda Moy (L)

New York University Grossman School of Medicine, New York, NY, USA.

Freya Schnabel (F)

New York University Grossman School of Medicine, New York, NY, USA.

Linda M Pak (LM)

New York University Grossman School of Medicine, New York, NY, USA.

Yiqiu Shen (Y)

New York University, New York, NY, USA.

Krzysztof J Geras (KJ)

New York University Grossman School of Medicine, New York, NY, USA.

Classifications MeSH