Generalizable and automated classification of TNM stage from pathology reports with external validation.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
16 Oct 2024
Historique:
received: 13 07 2023
accepted: 04 10 2024
medline: 17 10 2024
pubmed: 17 10 2024
entrez: 16 10 2024
Statut: epublish

Résumé

Cancer staging is an essential clinical attribute informing patient prognosis and clinical trial eligibility. However, it is not routinely recorded in structured electronic health records. Here, we present BB-TEN: Big Bird - TNM staging Extracted from Notes, a generalizable method for the automated classification of TNM stage directly from pathology report text. We train a BERT-based model using publicly available pathology reports across approximately 7000 patients and 23 cancer types. We explore the use of different model types, with differing input sizes, parameters, and model architectures. Our final model goes beyond term-extraction, inferring TNM stage from context when it is not included in the report text explicitly. As external validation, we test our model on almost 8000 pathology reports from Columbia University Medical Center, finding that our trained model achieved an AU-ROC of 0.815-0.942. This suggests that our model can be applied broadly to other institutions without additional institution-specific fine-tuning.

Identifiants

pubmed: 39414770
doi: 10.1038/s41467-024-53190-9
pii: 10.1038/s41467-024-53190-9
doi:

Types de publication

Journal Article Validation Study

Langues

eng

Sous-ensembles de citation

IM

Pagination

8916

Informations de copyright

© 2024. The Author(s).

Références

White, M. C. et al. The history and use of cancer registry data by public health cancer control programs in the United States. Cancer 123, 4969–4976 (2017).
doi: 10.1002/cncr.30905
Edwards, P. et al. Operational characteristics of central cancer registries that support the generation of high-quality surveillance data. J. Regist. Manag. 49, 10–16 (2022).
Rollison, D. E. et al. Current and emerging informatics initiatives impactful to cancer registries. J. Regist. Manag. 49, 153–160 (2022).
Zhang, A., Xing, L., Zou, J. & Wu, J. C. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng. 6, 1330–1345 (2022).
doi: 10.1038/s41551-022-00898-y
Yala, A. et al. Using machine learning to parse breast pathology reports. Breast Cancer Res Treat 161, 203–211 (2017).
Glaser, A. P. et al. Automated extraction of grade, stage, and quality information from transurethral resection of bladder tumor pathology reports using natural language processing. JCO Clin. Cancer Inf. 2, 1–8 (2018).
Truhn, D. et al. Extracting structured information from unstructured histopathology reports using generative pre‐trained transformer 4 (GPT‐4). J. Pathol. 262, 310–319 (2024).
doi: 10.1002/path.6232
Choi, H. S., Song, J. Y., Shin, K. H., Chang, J. H. & Jang, B.-S. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat. Oncol. J. 41, 209–216 (2023).
doi: 10.3857/roj.2023.00633
Chang, C. H., Lucas, M. M., Lee, Y., Yang, C.C. & Lu-Yao, G. et al. Beyond Self-consistency: Ensemble Reasoning Boosts Consistency and Accuracy of LLMs in Cancer Staging. In Artificial Intelligence in Medicine. AIME 2024. Lecture Notes in Computer Science, (eds Finkelstein, J., Moskovitch, R. & Parimbelli, E.) Vol. 14844 (Springer, Cham, 2024).
Abedian, S. et al. Automated extraction of tumor staging and diagnosis information from surgical pathology reports. JCO Clin. Cancer Inf. 5, 1054–1061 (2021).
doi: 10.1200/CCI.21.00065
Preston, S. et al. Toward structuring real-world data: deep learning for extracting oncology information from clinical text with patient-level supervision. Patterns 4, 100726 (2023).
doi: 10.1016/j.patter.2023.100726
Huang, J. et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digit. Med. 7, 106 (2024).
doi: 10.1038/s41746-024-01079-8
Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 1–38 (2023).
doi: 10.1145/3571730
Kefeli, J. & Tatonetti, N. TCGA-reports: a machine-readable pathology report resource for benchmarking text-based AI models. Patterns 5, 100933 (2024).
doi: 10.1016/j.patter.2024.100933
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arxiv.1810.04805 (2018).
Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H., & Luo, Y. A comparative study of pretrained language models for long clinical text. J Am. Med. Inform. Assoc. 30, 340–347 (2023).
Chang, K. et al. The cancer genome Atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
doi: 10.1038/ng.2764
Alsentzer, E. et al. Publicly available clinical BERT embeddings. Preprint at https://doi.org/10.48550/arxiv.1904.03323 (2019).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://doi.org/10.48550/arxiv.2302.13971 (2023).
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
doi: 10.1038/sdata.2016.35
Meta. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ (2024).
Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022).
doi: 10.1038/s41591-022-01981-2
Amin, M. B. et al. AJCC Cancer Staging Manual (Springer, 2017).
Hsu, E., Malagaris, I., Kuo, Y.-F., Sultana, R. & Roberts, K. Deep learning-based NLP data pipeline for EHR-scanned document information extraction. JAMIA Open 5, ooac045 (2022).
doi: 10.1093/jamiaopen/ooac045
Mahajan, D. et al. Identification of semantically similar sentences in clinical notes: iterative intermediate training using multi-task learning. JMIR Med. Inf. 8, e22508 (2020).
doi: 10.2196/22508
Kumar, A. et al. Closing the loop: automatically identifying abnormal imaging results in scanned documents. J. Am. Méd. Inform. Assoc. 29, 831–840 (2022).
doi: 10.1093/jamia/ocac007
Zaheer, M. et al. Big bird: transformers for longer sequences. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20). 17283–17297 (Curran Associates Inc., Red Hook, NY, USA, 2020).
Kefeli, J., Cortina, J. M. A., Tsang, K. K., Berkowitz, J. & Tatonetti, N. P. tatonetti-lab/tnm-stage-classifier: BBTEN Version 1.0.0. https://doi.org/10.5281/zenodo.13174292 (2024).

Auteurs

Jenna Kefeli (J)

Department of Systems Biology, Columbia University, New York, NY, USA.

Jacob Berkowitz (J)

Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA.

Jose M Acitores Cortina (JM)

Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA.

Kevin K Tsang (KK)

Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA.

Nicholas P Tatonetti (NP)

Department of Systems Biology, Columbia University, New York, NY, USA. nicholas.tatonetti@cshs.org.
Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA. nicholas.tatonetti@cshs.org.
Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA. nicholas.tatonetti@cshs.org.
Department of Biomedical Informatics, Columbia University, New York, NY, USA. nicholas.tatonetti@cshs.org.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH