Multiple imputation using auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random.

Humans Bias Data Interpretation, Statistical Models, Statistical Computer Simulation Algorithms Logistic Models Research Design / statistics & numerical data

ALSPAC Auxiliary variable Bias amplification Missing data Multiple imputation

Journal

BMC medical research methodology

ISSN: 1471-2288

Titre abrégé: BMC Med Res Methodol

Pays: England

ID NLM: 100968545

Informations de publication

Date de publication:
07 Oct 2024

Historique:

received: 06 11 2023

accepted: 26 09 2024

medline: 8 10 2024

pubmed: 8 10 2024

entrez: 7 10 2024

Statut: epublish

Résumé

Epidemiological and clinical studies often have missing data, frequently analysed using multiple imputation (MI). In general, MI estimates will be biased if data are missing not at random (MNAR). Bias due to data MNAR can be reduced by including other variables ("auxiliary variables") in imputation models, in addition to those required for the substantive analysis. Common advice is to take an inclusive approach to auxiliary variable selection (i.e. include all variables thought to be predictive of missingness and/or the missing values). There are no clear guidelines about the impact of this strategy when data may be MNAR. We explore the impact of including an auxiliary variable predictive of missingness but, in truth, unrelated to the partially observed variable, when data are MNAR. We quantify, algebraically and by simulation, the magnitude of the additional bias of the MI estimator for the exposure coefficient (fitting either a linear or logistic regression model), when the (continuous or binary) partially observed variable is either the analysis outcome or the exposure. Here, "additional bias" refers to the difference in magnitude of the MI estimator when the imputation model includes (i) the auxiliary variable and the other analysis model variables; (ii) just the other analysis model variables, noting that both will be biased due to data MNAR. We illustrate the extent of this additional bias by re-analysing data from a birth cohort study. The additional bias can be relatively large when the outcome is partially observed and missingness is caused by the outcome itself, and even larger if missingness is caused by both the outcome and the exposure (when either the outcome or exposure is partially observed). When using MI, the naïve and commonly used strategy of including all available auxiliary variables should be avoided. We recommend including the variables most predictive of the partially observed variable as auxiliary variables, where these can be identified through consideration of the plausible casual diagrams and missingness mechanisms, as well as data exploration (noting that associations with the partially observed variable in the complete records may be distorted due to selection bias).

Sections du résumé

BACKGROUND BACKGROUND

METHODS METHODS

We explore the impact of including an auxiliary variable predictive of missingness but, in truth, unrelated to the partially observed variable, when data are MNAR. We quantify, algebraically and by simulation, the magnitude of the additional bias of the MI estimator for the exposure coefficient (fitting either a linear or logistic regression model), when the (continuous or binary) partially observed variable is either the analysis outcome or the exposure. Here, "additional bias" refers to the difference in magnitude of the MI estimator when the imputation model includes (i) the auxiliary variable and the other analysis model variables; (ii) just the other analysis model variables, noting that both will be biased due to data MNAR. We illustrate the extent of this additional bias by re-analysing data from a birth cohort study.

RESULTS RESULTS

The additional bias can be relatively large when the outcome is partially observed and missingness is caused by the outcome itself, and even larger if missingness is caused by both the outcome and the exposure (when either the outcome or exposure is partially observed).

CONCLUSIONS CONCLUSIONS

When using MI, the naïve and commonly used strategy of including all available auxiliary variables should be avoided. We recommend including the variables most predictive of the partially observed variable as auxiliary variables, where these can be identified through consideration of the plausible casual diagrams and missingness mechanisms, as well as data exploration (noting that associations with the partially observed variable in the complete records may be distorted due to selection bias).

Identifiants

DOI: 10.1186/s12874-024-02353-9 PMID: 39375597

pubmed: 39375597

doi: 10.1186/s12874-024-02353-9

pii: 10.1186/s12874-024-02353-9

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

231

Informations de copyright

Références

Carpenter JR, Smuk M. Missing data: a statistical framework for practice. Biom J. 2021;63(5):915–47.

doi: 10.1002/bimj.202000196 pubmed: 33624862 pmcid: 7615108

Curnow E, Carpenter JR, Heron JE, Cornish RP, Rach S, Didelez V, et al. Multiple imputation of missing data under missing at random: compatible imputation models are not sufficient to avoid bias if they are mis-specified. J Clin Epidemiol. 2023;160:100–9.

doi: 10.1016/j.jclinepi.2023.06.011 pubmed: 37343895 pmcid: 7615471

Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.

doi: 10.1177/0962280214521348 pubmed: 24525487 pmcid: 4513015

Carpenter JR, Kenward MG, Bartlett JW, Morris TP, Quartagno MW, Angela M. Multiple Imputation and its Application 2e. Chichester: Wiley; 2023.

doi: 10.1002/9781119756118

Rubin DB. Multiple imputation for nonresponse in surveys. New York, USA: Wiley; 1987.

doi: 10.1002/9780470316696

Kenward MG, Goetghebeur EJ, Molenberghs G. Sensitivity analysis for incomplete categorical data. Stat Model. 2001;1(1):31–48.

doi: 10.1177/1471082X0100100104

Tompsett DM, Leacy F, Moreno-Betancur M, Heron J, White IR. On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice. Stat Med. 2018;37(15):2338–53.

doi: 10.1002/sim.7643 pubmed: 29611205 pmcid: 6001532

Cornish R, Macleod J, Carpenter J, Tilling K. Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study. Emerg Themes Epidemiol. 2017;14(14):1–13.

Hughes R, Heron J, Sterne J, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48:1294–304.

doi: 10.1093/ije/dyz032 pubmed: 30879056 pmcid: 6693809

Collins LM, Schafer JL, Kam C-M. A Comparison of Inclusive and Restrictive Strategies in modern missing data procedures. Psychol Methods. 2001;6(4):330–51.

doi: 10.1037/1082-989X.6.4.330 pubmed: 11778676

Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9(null):157–66.

doi: 10.2147/CLEP.S129785 pubmed: 28352203 pmcid: 5358992

Enders CK. Multiple imputation as a flexible tool for missing data handling in clinical research. Behav Res Ther. 2017;98:4–18.

doi: 10.1016/j.brat.2016.11.008 pubmed: 27890222

Steiner PM, Kim Y. The mechanics of omitted variable bias: bias amplification and cancellation of offsetting biases. J Causal Inference. 2016;4(2):20160009.

doi: 10.1515/jci-2016-0009 pubmed: 30123732 pmcid: 6095678

Cornish RP, Tilling K, Boyd A, Davies A, Macleod J. Using linked educational attainment data to reduce bias due to missing outcome data in estimates of the association between the duration of breastfeeding and IQ at 15 years. Int J Epidemiol. 2015;44(3):937–45.

doi: 10.1093/ije/dyv035 pubmed: 25855709 pmcid: 4521129

Thoemmes F, Rose N. A cautious note on auxiliary variables that can increase bias in missing data problems. Multivar Behav Res. 2014;49(5):443–59.

doi: 10.1080/00273171.2014.931799

Morris TP, White IR, Royston P. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med Res Methodol. 2014;14:75.

doi: 10.1186/1471-2288-14-75 pubmed: 24903709 pmcid: 4051964

Vansteelandt S, Carpenter JR, Kenward MG. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology. 2010;6(1):37–48.

doi: 10.1027/1614-2241/a000005

Curnow E, Tilling K, Heron JE, Cornish RP, Carpenter JR. Multiple imputation of missing data under missing at random: including a collider as an auxiliary variable in the imputation model can induce bias. Front Epidemiol. 2023;3:1237447.

doi: 10.3389/fepid.2023.1237447 pubmed: 37974561 pmcid: 7615309

Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York, USA: Cambridge University Press; 2006.

doi: 10.1017/CBO9780511790942

Boyd A, Golding J, Macleod J, Lawlor DA, Fraser A, Henderson J, et al. Cohort profile: the ‘children of the 90s’; the index offspring of the Avon Longitudinal Study of Parents and Children (ALSPAC). Int J Epidemiol. 2013;42(1):111–27.

doi: 10.1093/ije/dys064 pubmed: 22507743

Fraser A, Macdonald-Wallis C, Tilling K, Boyd A, Golding J, Davey Smith G, et al. Cohort profile: the avon longitudinal study of parents and children: ALSPAC mothers cohort. Int J Epidemiol. 2013;42:97–110.

doi: 10.1093/ije/dys066 pubmed: 22507742

Breslau N, Paneth N, Lucia VC, Paneth-Pollak R. Maternal smoking during pregnancy and offspring IQ. Int J Epidemiol. 2005;34(5):1047–53.

doi: 10.1093/ije/dyi163 pubmed: 16085682

Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16:219–42.

doi: 10.1177/0962280206074463 pubmed: 17621469

Daniel RM, Kenward MG, Cousens SN, Stavola BLD. Using causal diagrams to guide analysis in missing data problems. Stat Methods Med Res. 2012;21(3):243–56.

doi: 10.1177/0962280210394469 pubmed: 21389091

Lee KJ, Carlin JB, Simpson JA, Moreno-Betancur M. Assumptions and analysis planning in studies with missing data in multiple variables: moving beyond the MCAR/MAR/MNAR classification. Int J Epidemiol. 2023;52(4):1268–75.

doi: 10.1093/ije/dyad008 pubmed: 36779333 pmcid: 10396404

Liu J, Gelman A, Hill J, Su YS, Kropko J. On the stationary distribution of iterative imputations. Biometrika. 2013;101(1):155–73.

Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JAC. Joint modelling rationale for chained equations. BMC Med Res Methodol. 2014;14:28+.

doi: 10.1186/1471-2288-14-28 pubmed: 24559129 pmcid: 3936896

Seaman S, Galati J, Jackson D, Carlin J. What is meant by “missing at random”? Stat Sci. 2013;28(2):257–68, 12.

doi: 10.1214/13-STS415

Ding P, Vanderweele TJ, Robins JM. Instrumental variables as bias amplifiers with general outcome and confounding. Biometrika. 2017;104(2):291–302.

doi: 10.1093/biomet/asx009 pubmed: 29033459

Multiple imputation using auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Elinor Curnow (E)

Rosie P Cornish (RP)

Jon E Heron (JE)

James R Carpenter (JR)

Kate Tilling (K)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH