Applying computer text mining algorithms for oversampling tumor mutation status in medical records for the NCI Patterns of Care studies.
ALK
EGFR
Non-small cell lung cancer
Text mining algorithm
Tumor mutation
Journal
International journal of medical informatics
ISSN: 1872-8243
Titre abrégé: Int J Med Inform
Pays: Ireland
ID NLM: 9711057
Informations de publication
Date de publication:
09 2023
09 2023
Historique:
received:
25
02
2023
revised:
29
06
2023
accepted:
16
07
2023
medline:
14
8
2023
pubmed:
22
7
2023
entrez:
22
7
2023
Statut:
ppublish
Résumé
The National Cancer Institute (NCI) conducts Patterns of Care (POC) studies for selected cancer sites under a Congressional Mandate. These studies aim to collect treatment information beyond what is typically collected by the NCI's Surveillance, Epidemiology, and End Results (SEER) Program. The 2019 POC study focused on non-small cell lung cancer (NSCLC) and melanoma cancer sites. For the NSCLC cases, one of the primary sampling objectives was to oversample patients who tested positive for EGFR/ALK mutations, but initial information on mutation test results was unavailable prior to selecting the study sample. To address this, text mining algorithms were developed to screen all eligible NSCLC cases from the SEER database. These algorithms were designed to identify the mutation test status, allowing for stratified sampling based on SEER registry, sex, race/ethnicity, and tumor mutation test results. The final NSCLC sample included 2,434 patients aged 20+ with advanced stage (IIIB-IVB) NSCLC diagnosed in 2017 and 2018. Among this sample, 692 cases (13.2%) tested positive for EGFR/ALK mutations. An evaluation of the text mining algorithms performance, based on cases where both algorithm results and known EGFR/ALK status from medical chart abstraction were available, showed good results: sensitivity of 77.6%, specificity of 90.8%, and an overall accuracy 84.8%. The adaption of text mining algorithm proved effective in oversample patients with uncommon conditions in studies where electronic medical records are accessible. The 2019 POC study provides valuable data for researchers to evaluate cancer therapy details and patient characteristics, particularly among those with EGFR/ALK test positive cases.
Sections du résumé
BACKGROUNDS
The National Cancer Institute (NCI) conducts Patterns of Care (POC) studies for selected cancer sites under a Congressional Mandate. These studies aim to collect treatment information beyond what is typically collected by the NCI's Surveillance, Epidemiology, and End Results (SEER) Program. The 2019 POC study focused on non-small cell lung cancer (NSCLC) and melanoma cancer sites. For the NSCLC cases, one of the primary sampling objectives was to oversample patients who tested positive for EGFR/ALK mutations, but initial information on mutation test results was unavailable prior to selecting the study sample.
METHODS
To address this, text mining algorithms were developed to screen all eligible NSCLC cases from the SEER database. These algorithms were designed to identify the mutation test status, allowing for stratified sampling based on SEER registry, sex, race/ethnicity, and tumor mutation test results.
RESULTS
The final NSCLC sample included 2,434 patients aged 20+ with advanced stage (IIIB-IVB) NSCLC diagnosed in 2017 and 2018. Among this sample, 692 cases (13.2%) tested positive for EGFR/ALK mutations. An evaluation of the text mining algorithms performance, based on cases where both algorithm results and known EGFR/ALK status from medical chart abstraction were available, showed good results: sensitivity of 77.6%, specificity of 90.8%, and an overall accuracy 84.8%.
CONCLUSIONS
The adaption of text mining algorithm proved effective in oversample patients with uncommon conditions in studies where electronic medical records are accessible. The 2019 POC study provides valuable data for researchers to evaluate cancer therapy details and patient characteristics, particularly among those with EGFR/ALK test positive cases.
Identifiants
pubmed: 37480595
pii: S1386-5056(23)00175-2
doi: 10.1016/j.ijmedinf.2023.105157
pii:
doi:
Substances chimiques
Anaplastic Lymphoma Kinase
EC 2.7.10.1
ErbB Receptors
EC 2.7.10.1
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
105157Informations de copyright
Copyright © 2023. Published by Elsevier B.V.
Déclaration de conflit d'intérêts
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.