Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis.

COVID-19 incidence rate Vietnam analytical method crude RMSE crude bias imputation method pandemic percentage change population health root mean square error surveillance

Journal

JMIR public health and surveillance
ISSN: 2369-2960
Titre abrégé: JMIR Public Health Surveill
Pays: Canada
ID NLM: 101669345

Informations de publication

Date de publication:
20 Aug 2024
Historique:
received: 17 10 2023
revised: 05 06 2024
accepted: 12 06 2024
medline: 21 8 2024
pubmed: 21 8 2024
entrez: 21 8 2024
Statut: epublish

Résumé

The COVID-19 pandemic has revealed significant challenges in disease forecasting and in developing a public health response, emphasizing the need to manage missing data from various sources in making accurate forecasts. We aimed to show how handling missing data can affect estimates of the COVID-19 incidence rate (CIR) in different pandemic situations. This study used data from the COVID-19/SARS-CoV-2 surveillance system at the National Institute of Hygiene and Epidemiology, Vietnam. We separated the available data set into 3 distinct periods: zero COVID-19, transition, and new normal. We randomly removed 5% to 30% of data that were missing completely at random, with a break of 5% at each time point in the variable daily caseload of COVID-19. We selected 7 analytical methods to assess the effects of handling missing data and calculated statistical and epidemiological indices to measure the effectiveness of each method. Our study examined missing data imputation performance across 3 study time periods: zero COVID-19 (n=3149), transition (n=1290), and new normal (n=9288). Imputation analyses showed that K-nearest neighbor (KNN) had the lowest mean absolute percentage change (APC) in CIR across the range (5% to 30%) of missing data. For instance, with 15% missing data, KNN resulted in 10.6%, 10.6%, and 9.7% average bias across the zero COVID-19, transition, and new normal periods, compared to 39.9%, 51.9%, and 289.7% with the maximum likelihood method. The autoregressive integrated moving average model showed the greatest mean APC in the mean number of confirmed cases of COVID-19 during each COVID-19 containment cycle (CCC) when we imputed the missing data in the zero COVID-19 period, rising from 226.3% at the 5% missing level to 6955.7% at the 30% missing level. Imputing missing data with median imputation methods had the lowest bias in the average number of confirmed cases in each CCC at all levels of missing data. In detail, in the 20% missing scenario, while median imputation had an average bias of 16.3% for confirmed cases in each CCC, which was lower than the KNN figure, maximum likelihood imputation showed a bias on average of 92.4% for confirmed cases in each CCC, which was the highest figure. During the new normal period in the 25% and 30% missing data scenarios, KNN imputation had average biases for CIR and confirmed cases in each CCC ranging from 21% to 32% for both, while maximum likelihood and moving average imputation showed biases on average above 250% for both CIR and confirmed cases in each CCC. Our study emphasizes the importance of understanding that the specific imputation method used by investigators should be tailored to the specific epidemiological context and data collection environment to ensure reliable estimates of the CIR.

Sections du résumé

Background UNASSIGNED
The COVID-19 pandemic has revealed significant challenges in disease forecasting and in developing a public health response, emphasizing the need to manage missing data from various sources in making accurate forecasts.
Objective UNASSIGNED
We aimed to show how handling missing data can affect estimates of the COVID-19 incidence rate (CIR) in different pandemic situations.
Methods UNASSIGNED
This study used data from the COVID-19/SARS-CoV-2 surveillance system at the National Institute of Hygiene and Epidemiology, Vietnam. We separated the available data set into 3 distinct periods: zero COVID-19, transition, and new normal. We randomly removed 5% to 30% of data that were missing completely at random, with a break of 5% at each time point in the variable daily caseload of COVID-19. We selected 7 analytical methods to assess the effects of handling missing data and calculated statistical and epidemiological indices to measure the effectiveness of each method.
Results UNASSIGNED
Our study examined missing data imputation performance across 3 study time periods: zero COVID-19 (n=3149), transition (n=1290), and new normal (n=9288). Imputation analyses showed that K-nearest neighbor (KNN) had the lowest mean absolute percentage change (APC) in CIR across the range (5% to 30%) of missing data. For instance, with 15% missing data, KNN resulted in 10.6%, 10.6%, and 9.7% average bias across the zero COVID-19, transition, and new normal periods, compared to 39.9%, 51.9%, and 289.7% with the maximum likelihood method. The autoregressive integrated moving average model showed the greatest mean APC in the mean number of confirmed cases of COVID-19 during each COVID-19 containment cycle (CCC) when we imputed the missing data in the zero COVID-19 period, rising from 226.3% at the 5% missing level to 6955.7% at the 30% missing level. Imputing missing data with median imputation methods had the lowest bias in the average number of confirmed cases in each CCC at all levels of missing data. In detail, in the 20% missing scenario, while median imputation had an average bias of 16.3% for confirmed cases in each CCC, which was lower than the KNN figure, maximum likelihood imputation showed a bias on average of 92.4% for confirmed cases in each CCC, which was the highest figure. During the new normal period in the 25% and 30% missing data scenarios, KNN imputation had average biases for CIR and confirmed cases in each CCC ranging from 21% to 32% for both, while maximum likelihood and moving average imputation showed biases on average above 250% for both CIR and confirmed cases in each CCC.
Conclusions UNASSIGNED
Our study emphasizes the importance of understanding that the specific imputation method used by investigators should be tailored to the specific epidemiological context and data collection environment to ensure reliable estimates of the CIR.

Identifiants

pubmed: 39166439
pii: v10i1e53719
doi: 10.2196/53719
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e53719

Informations de copyright

© Hai-Thanh Pham, Toan Do, Jonggyu Baek, Cong-Khanh Nguyen, Quang-Thai Pham, Hoa L Nguyen, Robert Goldberg, Quang Loc Pham, Le Minh Giang. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org).

Auteurs

Hai-Thanh Pham (HT)

School of Preventive Medicine and Public Health, Hanoi Medical University, 1 Ton That Tung Street, Kim Lien Ward, Dong Da District, Hanoi, 100000, Vietnam, 84 368-577-4236.

Toan Do (T)

School of Preventive Medicine and Public Health, Hanoi Medical University, 1 Ton That Tung Street, Kim Lien Ward, Dong Da District, Hanoi, 100000, Vietnam, 84 368-577-4236.

Jonggyu Baek (J)

UMass Chan Medical School, University of Massachusetts Medical School, Worcester, MA, United States.

Cong-Khanh Nguyen (CK)

National Institute of Hygiene and Epidemiology, Hanoi, Vietnam.

Quang-Thai Pham (QT)

School of Preventive Medicine and Public Health, Hanoi Medical University, 1 Ton That Tung Street, Kim Lien Ward, Dong Da District, Hanoi, 100000, Vietnam, 84 368-577-4236.
National Institute of Hygiene and Epidemiology, Hanoi, Vietnam.

Hoa L Nguyen (HL)

UMass Chan Medical School, University of Massachusetts Medical School, Worcester, MA, United States.

Robert Goldberg (R)

UMass Chan Medical School, University of Massachusetts Medical School, Worcester, MA, United States.

Quang Loc Pham (QL)

School of Preventive Medicine and Public Health, Hanoi Medical University, 1 Ton That Tung Street, Kim Lien Ward, Dong Da District, Hanoi, 100000, Vietnam, 84 368-577-4236.

Le Minh Giang (LM)

School of Preventive Medicine and Public Health, Hanoi Medical University, 1 Ton That Tung Street, Kim Lien Ward, Dong Da District, Hanoi, 100000, Vietnam, 84 368-577-4236.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH