Benchmark Pathology Report Text Corpus with Cancer Type Classification.
BERT
TCGA
cancer pathology
cancer type
classification
large language models
machine learning
pathology reports
resource
transformer model
Journal
medRxiv : the preprint server for health sciences
Titre abrégé: medRxiv
Pays: United States
ID NLM: 101767986
Informations de publication
Date de publication:
08 Aug 2023
08 Aug 2023
Historique:
pubmed:
23
8
2023
medline:
23
8
2023
entrez:
23
8
2023
Statut:
epublish
Résumé
In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.
Identifiants
pubmed: 37609238
doi: 10.1101/2023.08.03.23293618
pmc: PMC10441484
pii:
doi:
Types de publication
Preprint
Langues
eng
Subventions
Organisme : NIGMS NIH HHS
ID : R35 GM131905
Pays : United States
Déclaration de conflit d'intérêts
Declaration of interests: The authors declare no competing interests.