Comparison of Methods for Estimating Temporal Topic Models From Primary Care Clinical Text Data: Retrospective Closed Cohort Study.

BERTopic clinical text data latent Dirichlet allocation nonnegative matrix factorization structural topic model temporal topic model text mining

Journal

JMIR medical informatics
ISSN: 2291-9694
Titre abrégé: JMIR Med Inform
Pays: Canada
ID NLM: 101645109

Informations de publication

Date de publication:
19 Dec 2022
Historique:
received: 06 06 2022
accepted: 18 09 2022
revised: 01 09 2022
entrez: 19 12 2022
pubmed: 20 12 2022
medline: 20 12 2022
Statut: epublish

Résumé

Health care organizations are collecting increasing volumes of clinical text data. Topic models are a class of unsupervised machine learning algorithms for discovering latent thematic patterns in these large unstructured document collections. We aimed to comparatively evaluate several methods for estimating temporal topic models using clinical notes obtained from primary care electronic medical records from Ontario, Canada. We used a retrospective closed cohort design. The study spanned from January 01, 2011, through December 31, 2015, discretized into 20 quarterly periods. Patients were included in the study if they generated at least 1 primary care clinical note in each of the 20 quarterly periods. These patients represented a unique cohort of individuals engaging in high-frequency use of the primary care system. The following temporal topic modeling algorithms were fitted to the clinical note corpus: nonnegative matrix factorization, latent Dirichlet allocation, the structural topic model, and the BERTopic model. Temporal topic models consistently identified latent topical patterns in the clinical note corpus. The learned topical bases identified meaningful activities conducted by the primary health care system. Latent topics displaying near-constant temporal dynamics were consistently estimated across models (eg, pain, hypertension, diabetes, sleep, mood, anxiety, and depression). Several topics displayed predictable seasonal patterns over the study period (eg, respiratory disease and influenza immunization programs). Nonnegative matrix factorization, latent Dirichlet allocation, structural topic model, and BERTopic are based on different underlying statistical frameworks (eg, linear algebra and optimization, Bayesian graphical models, and neural embeddings), require tuning unique hyperparameters (optimizers, priors, etc), and have distinct computational requirements (data structures, computational hardware, etc). Despite the heterogeneity in statistical methodology, the learned latent topical summarizations and their temporal evolution over the study period were consistently estimated. Temporal topic models represent an interesting class of models for characterizing and monitoring the primary health care system.

Sections du résumé

BACKGROUND BACKGROUND
Health care organizations are collecting increasing volumes of clinical text data. Topic models are a class of unsupervised machine learning algorithms for discovering latent thematic patterns in these large unstructured document collections.
OBJECTIVE OBJECTIVE
We aimed to comparatively evaluate several methods for estimating temporal topic models using clinical notes obtained from primary care electronic medical records from Ontario, Canada.
METHODS METHODS
We used a retrospective closed cohort design. The study spanned from January 01, 2011, through December 31, 2015, discretized into 20 quarterly periods. Patients were included in the study if they generated at least 1 primary care clinical note in each of the 20 quarterly periods. These patients represented a unique cohort of individuals engaging in high-frequency use of the primary care system. The following temporal topic modeling algorithms were fitted to the clinical note corpus: nonnegative matrix factorization, latent Dirichlet allocation, the structural topic model, and the BERTopic model.
RESULTS RESULTS
Temporal topic models consistently identified latent topical patterns in the clinical note corpus. The learned topical bases identified meaningful activities conducted by the primary health care system. Latent topics displaying near-constant temporal dynamics were consistently estimated across models (eg, pain, hypertension, diabetes, sleep, mood, anxiety, and depression). Several topics displayed predictable seasonal patterns over the study period (eg, respiratory disease and influenza immunization programs).
CONCLUSIONS CONCLUSIONS
Nonnegative matrix factorization, latent Dirichlet allocation, structural topic model, and BERTopic are based on different underlying statistical frameworks (eg, linear algebra and optimization, Bayesian graphical models, and neural embeddings), require tuning unique hyperparameters (optimizers, priors, etc), and have distinct computational requirements (data structures, computational hardware, etc). Despite the heterogeneity in statistical methodology, the learned latent topical summarizations and their temporal evolution over the study period were consistently estimated. Temporal topic models represent an interesting class of models for characterizing and monitoring the primary health care system.

Identifiants

pubmed: 36534443
pii: v10i12e40102
doi: 10.2196/40102
pmc: PMC9808604
doi:

Types de publication

Journal Article

Langues

eng

Pagination

e40102

Informations de copyright

©Christopher Meaney, Michael Escobar, Therese A Stukel, Peter C Austin, Liisa Jaakkimainen. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 19.12.2022.

Références

Nature. 1999 Oct 21;401(6755):788-91
pubmed: 10548103
Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1:5228-35
pubmed: 14872004
Elife. 2019 Feb 05;8:
pubmed: 30719973

Auteurs

Christopher Meaney (C)

Dalla Lana School of Public Health, Division of Biostatistics, University of Toronto, Toronto, ON, Canada.
Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada.

Michael Escobar (M)

Dalla Lana School of Public Health, Division of Biostatistics, University of Toronto, Toronto, ON, Canada.

Therese A Stukel (TA)

ICES, Toronto, ON, Canada.
Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada.

Peter C Austin (PC)

ICES, Toronto, ON, Canada.
Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada.

Liisa Jaakkimainen (L)

Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada.
ICES, Toronto, ON, Canada.
Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada.

Classifications MeSH