Multiple imputation using auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random.
ALSPAC
Auxiliary variable
Bias amplification
Missing data
Multiple imputation
Journal
BMC medical research methodology
ISSN: 1471-2288
Titre abrégé: BMC Med Res Methodol
Pays: England
ID NLM: 100968545
Informations de publication
Date de publication:
07 Oct 2024
07 Oct 2024
Historique:
received:
06
11
2023
accepted:
26
09
2024
medline:
8
10
2024
pubmed:
8
10
2024
entrez:
7
10
2024
Statut:
epublish
Résumé
Epidemiological and clinical studies often have missing data, frequently analysed using multiple imputation (MI). In general, MI estimates will be biased if data are missing not at random (MNAR). Bias due to data MNAR can be reduced by including other variables ("auxiliary variables") in imputation models, in addition to those required for the substantive analysis. Common advice is to take an inclusive approach to auxiliary variable selection (i.e. include all variables thought to be predictive of missingness and/or the missing values). There are no clear guidelines about the impact of this strategy when data may be MNAR. We explore the impact of including an auxiliary variable predictive of missingness but, in truth, unrelated to the partially observed variable, when data are MNAR. We quantify, algebraically and by simulation, the magnitude of the additional bias of the MI estimator for the exposure coefficient (fitting either a linear or logistic regression model), when the (continuous or binary) partially observed variable is either the analysis outcome or the exposure. Here, "additional bias" refers to the difference in magnitude of the MI estimator when the imputation model includes (i) the auxiliary variable and the other analysis model variables; (ii) just the other analysis model variables, noting that both will be biased due to data MNAR. We illustrate the extent of this additional bias by re-analysing data from a birth cohort study. The additional bias can be relatively large when the outcome is partially observed and missingness is caused by the outcome itself, and even larger if missingness is caused by both the outcome and the exposure (when either the outcome or exposure is partially observed). When using MI, the naïve and commonly used strategy of including all available auxiliary variables should be avoided. We recommend including the variables most predictive of the partially observed variable as auxiliary variables, where these can be identified through consideration of the plausible casual diagrams and missingness mechanisms, as well as data exploration (noting that associations with the partially observed variable in the complete records may be distorted due to selection bias).
Sections du résumé
BACKGROUND
BACKGROUND
Epidemiological and clinical studies often have missing data, frequently analysed using multiple imputation (MI). In general, MI estimates will be biased if data are missing not at random (MNAR). Bias due to data MNAR can be reduced by including other variables ("auxiliary variables") in imputation models, in addition to those required for the substantive analysis. Common advice is to take an inclusive approach to auxiliary variable selection (i.e. include all variables thought to be predictive of missingness and/or the missing values). There are no clear guidelines about the impact of this strategy when data may be MNAR.
METHODS
METHODS
We explore the impact of including an auxiliary variable predictive of missingness but, in truth, unrelated to the partially observed variable, when data are MNAR. We quantify, algebraically and by simulation, the magnitude of the additional bias of the MI estimator for the exposure coefficient (fitting either a linear or logistic regression model), when the (continuous or binary) partially observed variable is either the analysis outcome or the exposure. Here, "additional bias" refers to the difference in magnitude of the MI estimator when the imputation model includes (i) the auxiliary variable and the other analysis model variables; (ii) just the other analysis model variables, noting that both will be biased due to data MNAR. We illustrate the extent of this additional bias by re-analysing data from a birth cohort study.
RESULTS
RESULTS
The additional bias can be relatively large when the outcome is partially observed and missingness is caused by the outcome itself, and even larger if missingness is caused by both the outcome and the exposure (when either the outcome or exposure is partially observed).
CONCLUSIONS
CONCLUSIONS
When using MI, the naïve and commonly used strategy of including all available auxiliary variables should be avoided. We recommend including the variables most predictive of the partially observed variable as auxiliary variables, where these can be identified through consideration of the plausible casual diagrams and missingness mechanisms, as well as data exploration (noting that associations with the partially observed variable in the complete records may be distorted due to selection bias).
Identifiants
pubmed: 39375597
doi: 10.1186/s12874-024-02353-9
pii: 10.1186/s12874-024-02353-9
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
231Informations de copyright
© 2024. The Author(s).
Références
Carpenter JR, Smuk M. Missing data: a statistical framework for practice. Biom J. 2021;63(5):915–47.
doi: 10.1002/bimj.202000196
pubmed: 33624862
pmcid: 7615108
Curnow E, Carpenter JR, Heron JE, Cornish RP, Rach S, Didelez V, et al. Multiple imputation of missing data under missing at random: compatible imputation models are not sufficient to avoid bias if they are mis-specified. J Clin Epidemiol. 2023;160:100–9.
doi: 10.1016/j.jclinepi.2023.06.011
pubmed: 37343895
pmcid: 7615471
Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.
doi: 10.1177/0962280214521348
pubmed: 24525487
pmcid: 4513015
Carpenter JR, Kenward MG, Bartlett JW, Morris TP, Quartagno MW, Angela M. Multiple Imputation and its Application 2e. Chichester: Wiley; 2023.
doi: 10.1002/9781119756118
Rubin DB. Multiple imputation for nonresponse in surveys. New York, USA: Wiley; 1987.
doi: 10.1002/9780470316696
Kenward MG, Goetghebeur EJ, Molenberghs G. Sensitivity analysis for incomplete categorical data. Stat Model. 2001;1(1):31–48.
doi: 10.1177/1471082X0100100104
Tompsett DM, Leacy F, Moreno-Betancur M, Heron J, White IR. On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice. Stat Med. 2018;37(15):2338–53.
doi: 10.1002/sim.7643
pubmed: 29611205
pmcid: 6001532
Cornish R, Macleod J, Carpenter J, Tilling K. Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study. Emerg Themes Epidemiol. 2017;14(14):1–13.
Hughes R, Heron J, Sterne J, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48:1294–304.
doi: 10.1093/ije/dyz032
pubmed: 30879056
pmcid: 6693809
Collins LM, Schafer JL, Kam C-M. A Comparison of Inclusive and Restrictive Strategies in modern missing data procedures. Psychol Methods. 2001;6(4):330–51.
doi: 10.1037/1082-989X.6.4.330
pubmed: 11778676
Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9(null):157–66.
doi: 10.2147/CLEP.S129785
pubmed: 28352203
pmcid: 5358992
Enders CK. Multiple imputation as a flexible tool for missing data handling in clinical research. Behav Res Ther. 2017;98:4–18.
doi: 10.1016/j.brat.2016.11.008
pubmed: 27890222
Steiner PM, Kim Y. The mechanics of omitted variable bias: bias amplification and cancellation of offsetting biases. J Causal Inference. 2016;4(2):20160009.
doi: 10.1515/jci-2016-0009
pubmed: 30123732
pmcid: 6095678
Cornish RP, Tilling K, Boyd A, Davies A, Macleod J. Using linked educational attainment data to reduce bias due to missing outcome data in estimates of the association between the duration of breastfeeding and IQ at 15 years. Int J Epidemiol. 2015;44(3):937–45.
doi: 10.1093/ije/dyv035
pubmed: 25855709
pmcid: 4521129
Thoemmes F, Rose N. A cautious note on auxiliary variables that can increase bias in missing data problems. Multivar Behav Res. 2014;49(5):443–59.
doi: 10.1080/00273171.2014.931799
Morris TP, White IR, Royston P. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med Res Methodol. 2014;14:75.
doi: 10.1186/1471-2288-14-75
pubmed: 24903709
pmcid: 4051964
Vansteelandt S, Carpenter JR, Kenward MG. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology. 2010;6(1):37–48.
doi: 10.1027/1614-2241/a000005
Curnow E, Tilling K, Heron JE, Cornish RP, Carpenter JR. Multiple imputation of missing data under missing at random: including a collider as an auxiliary variable in the imputation model can induce bias. Front Epidemiol. 2023;3:1237447.
doi: 10.3389/fepid.2023.1237447
pubmed: 37974561
pmcid: 7615309
Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York, USA: Cambridge University Press; 2006.
doi: 10.1017/CBO9780511790942
Boyd A, Golding J, Macleod J, Lawlor DA, Fraser A, Henderson J, et al. Cohort profile: the ‘children of the 90s’; the index offspring of the Avon Longitudinal Study of Parents and Children (ALSPAC). Int J Epidemiol. 2013;42(1):111–27.
doi: 10.1093/ije/dys064
pubmed: 22507743
Fraser A, Macdonald-Wallis C, Tilling K, Boyd A, Golding J, Davey Smith G, et al. Cohort profile: the avon longitudinal study of parents and children: ALSPAC mothers cohort. Int J Epidemiol. 2013;42:97–110.
doi: 10.1093/ije/dys066
pubmed: 22507742
Breslau N, Paneth N, Lucia VC, Paneth-Pollak R. Maternal smoking during pregnancy and offspring IQ. Int J Epidemiol. 2005;34(5):1047–53.
doi: 10.1093/ije/dyi163
pubmed: 16085682
Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16:219–42.
doi: 10.1177/0962280206074463
pubmed: 17621469
Daniel RM, Kenward MG, Cousens SN, Stavola BLD. Using causal diagrams to guide analysis in missing data problems. Stat Methods Med Res. 2012;21(3):243–56.
doi: 10.1177/0962280210394469
pubmed: 21389091
Lee KJ, Carlin JB, Simpson JA, Moreno-Betancur M. Assumptions and analysis planning in studies with missing data in multiple variables: moving beyond the MCAR/MAR/MNAR classification. Int J Epidemiol. 2023;52(4):1268–75.
doi: 10.1093/ije/dyad008
pubmed: 36779333
pmcid: 10396404
Liu J, Gelman A, Hill J, Su YS, Kropko J. On the stationary distribution of iterative imputations. Biometrika. 2013;101(1):155–73.
Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JAC. Joint modelling rationale for chained equations. BMC Med Res Methodol. 2014;14:28+.
doi: 10.1186/1471-2288-14-28
pubmed: 24559129
pmcid: 3936896
Seaman S, Galati J, Jackson D, Carlin J. What is meant by “missing at random”? Stat Sci. 2013;28(2):257–68, 12.
doi: 10.1214/13-STS415
Ding P, Vanderweele TJ, Robins JM. Instrumental variables as bias amplifiers with general outcome and confounding. Biometrika. 2017;104(2):291–302.
doi: 10.1093/biomet/asx009
pubmed: 29033459