Benchmark Pathology Report Text Corpus with Cancer Type Classification.

BERT TCGA cancer pathology cancer type classification large language models machine learning pathology reports resource transformer model

Journal

medRxiv : the preprint server for health sciences

Titre abrégé: medRxiv

Pays: United States

ID NLM: 101767986

Informations de publication

Date de publication:
08 Aug 2023

Historique:

pubmed: 23 8 2023

medline: 23 8 2023

entrez: 23 8 2023

Statut: epublish

Résumé

In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.

Identifiants

DOI: 10.1101/2023.08.03.23293618 PMID: 37609238 PMC: PMC10441484

pubmed: 37609238

doi: 10.1101/2023.08.03.23293618

pmc: PMC10441484

pii:

doi:

Types de publication

Preprint

Langues

eng

Subventions

Organisme : NIGMS NIH HHS

ID : R35 GM131905

Pays : United States

Déclaration de conflit d'intérêts

Declaration of interests: The authors declare no competing interests.

Benchmark Pathology Report Text Corpus with Cancer Type Classification.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Subventions

Déclaration de conflit d'intérêts

Auteurs

Jenna Kefeli (J)

Nicholas Tatonetti (N)

Classifications MeSH