Accounting for careless and insufficient effort responding in large-scale survey data-development, evaluation, and application of a screen-time-based weighting procedure.

Careless responding Data screening Finite mixture modeling Item response theory Maximum pseudo-likelihood estimation Screen times

Journal

Behavior research methods
ISSN: 1554-3528
Titre abrégé: Behav Res Methods
Pays: United States
ID NLM: 101244316

Informations de publication

Date de publication:
03 Mar 2023
Historique:
accepted: 09 12 2022
entrez: 3 3 2023
pubmed: 4 3 2023
medline: 4 3 2023
Statut: aheadofprint

Résumé

Careless and insufficient effort responding (C/IER) poses a major threat to the quality of large-scale survey data. Traditional indicator-based procedures for its detection are limited in that they are only sensitive to specific types of C/IER behavior, such as straight lining or rapid responding, rely on arbitrary threshold settings, and do not allow taking the uncertainty of C/IER classification into account. Overcoming these limitations, we develop a two-step screen-time-based weighting procedure for computer-administered surveys. The procedure allows considering the uncertainty in C/IER identification, is agnostic towards the specific types of C/IE response patterns, and can feasibly be integrated with common analysis workflows for large-scale survey data. In Step 1, we draw on mixture modeling to identify subcomponents of log screen time distributions presumably stemming from C/IER. In Step 2, the analysis model of choice is applied to item response data, with respondents' posterior class probabilities being employed to downweigh response patterns according to their probability of stemming from C/IER. We illustrate the approach on a sample of more than 400,000 respondents being administered 48 scales of the PISA 2018 background questionnaire. We gather supporting validity evidence by investigating relationships between C/IER proportions and screen characteristics that entail higher cognitive burden, such as screen position and text length, relating identified C/IER proportions to other indicators of C/IER as well as by investigating rank-order consistency in C/IER behavior across screens. Finally, in a re-analysis of the PISA 2018 background questionnaire data, we investigate the impact of the C/IER adjustments on country-level comparisons.

Identifiants

pubmed: 36867339
doi: 10.3758/s13428-022-02053-6
pii: 10.3758/s13428-022-02053-6
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© 2023. The Author(s).

Références

Arias, V. B., Garrido, L., Jenaro, C., Martinez-Molina, A., & Arias, B. (2020). A little garbage in, lots of garbage out: Assessing the impact of careless responding in personality survey data. Behavior Research Methods, 52, 2489–2505. https://doi.org/10.3758/s13428-020-01401-8
doi: 10.3758/s13428-020-01401-8 pubmed: 32462604
Bauer, D. J., & Curran, P. J. (2003). Distributional assumptions of growth mixture models: Implications for overextraction of latent trajectory classes. Psychological Methods, 8(3), 338–363. https://doi.org/10.1037/1082-989X.8.3.338
doi: 10.1037/1082-989X.8.3.338 pubmed: 14596495
Boe, E. E., May, H., & Boruch, R. F. (2002). Student Task Persistence in the Third International Mathematics and Science Study: A Major Source of Achievement Differences at the National, Classroom, and Student Levels. Pennsylvania Univ., Philadelphia. Center for Research and Evaluation in Social Policy.
Bowling, N. A., Gibson, A. M., Houpt, J. W., & Brower, C. K. (2020). Will the questions ever end? Person-level increases in careless responding during questionnaire completion. Organizational Research Methods 1–21. https://doi.org/10.1177/1094428120947794 .
Bowling, N. A., Huang, J. L., Bragg, C. B., Khazon, S., Liu, M., & Blackmore, C. E. (2016). Who cares and who is careless? Insufficient effort responding as a reflection of respondent personality. Journal of Personality and Social Psychology, 111(2), 218.
doi: 10.1037/pspp0000085 pubmed: 26927958
Bowling, N. A., Huang, J. L., Brower, C. K., & Bragg, C. B. (2021). The quick and the careless: The construct validity of page time as a measure of insufficient effort responding to surveys. Organizational Research Methods. https://doi.org/10.1177/10944281211056520 .
Bradburn, N. (1978). Respondent burden. In Proceedings of the Survey Research Methods Section of the American Statistical Association, (Vol. 35 pp. 35–40). VA: American Statistical Association Alexandria.
Brower, C. K. (2018). Too long and too boring: The effects of survey length and interest on careless responding (Master’s thesis, Wright State University).
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., ..., Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76(1). https://doi.org/10.18637/jss.v076.i01 .
Chalmers, R. P. (2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06 .
doi: 10.18637/jss.v048.i06
Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19. https://doi.org/10.1016/j.jesp.2015.07.006
doi: 10.1016/j.jesp.2015.07.006
DeSimone, J. A., DeSimone, A. J., Harms, P., & Wood, D. (2018). The differential impacts of two forms of insufficient effort responding. Applied Psychology, 67(2), 309–338. https://doi.org/10.1111/apps.12117
doi: 10.1111/apps.12117
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38 (1), 67–86. https://doi.org/10.1111/j.2044-8317.1985.tb00817.x
doi: 10.1111/j.2044-8317.1985.tb00817.x
Ferrando, P. J., & Lorenzo-Seva, U. (2007). An item response theory model for incorporating response time data in binary personality items. Applied Psychological Measurement, 31(6), 525–543. https://doi.org/10.1177/0146621606295197
doi: 10.1177/0146621606295197
Frey, A., Spoden, C., Goldhammer, F., & Wenzel, S. F. C. (2018). Response time-based treatment of omitted responses in computer-based testing. Behaviormetrika, 45(2), 505–526. https://doi.org/10.1007/s41237-018-0073-9
doi: 10.1007/s41237-018-0073-9
Galesic, M., & Bosnjak, M. (2009). Effects of questionnaire length on participation and indicators of response quality in a web survey. Public opinion quarterly, 73(2), 349–360. https://doi.org/10.1093/poq/nfp031
doi: 10.1093/poq/nfp031
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical science, 7(4), 457–472. https://doi.org/10.1214/ss/1177011136
doi: 10.1214/ss/1177011136
Gelman, A., & Shirley, K. (2011). Inference from simulations and monitoring convergence. In S. Brooks, A. Gelman, G. Jones, & X.-L. Meng (Eds.) Handbook of Markov Chain Monte Carlo (pp. 163–174). Boca Raton: Chapman Hall.
Guo, H., Rios, J. A., Haberman, S., Liu, O. L., Wang, J., & Paek, I. (2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education, 29(3), 173–183. https://doi.org/10.1080/08957347.2016.1171766
doi: 10.1080/08957347.2016.1171766
Guo, J., Gabry, J., & Goodrich, B. (2018). rstan: R interface to Stan. package version 2.18.2. Retrieved from https://CRAN.R-project.org/package=rstan .
Hong, M. R., & Cheng, Y. (2019). Robust maximum marginal likelihood (RMML) estimation for item response theory models. Behavior Research Methods, 51(2), 573–588. https://doi.org/10.3758/s13428-018-1150-4
doi: 10.3758/s13428-018-1150-4 pubmed: 30350024
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27(1), 99–114. https://doi.org/10.1007/s10869-011-9231-8
doi: 10.1007/s10869-011-9231-8
Huang, J. L., Liu, M., & Bowling, N. A. (2015). Insufficient effort responding: Examining an insidious confound in survey data. Journal of Applied Psychology, 100(3), 828–845. https://doi.org/10.1037/a0038510
doi: 10.1037/a0038510 pubmed: 25495093
Johnson, J. A. (2005). Ascertaining the validity of individual protocols from web-based personality inventories. Journal of Research in Personality, 39(1), 103–129. https://doi.org/10.1016/j.jrp.2004.09.009
doi: 10.1016/j.jrp.2004.09.009
Kam, C. C. S., & Meyer, J. P. (2015). How careless responding and acquiescence response bias can influence construct dimensionality: The case of job satisfaction. Organizational Research Methods, 18(3), 512–541. https://doi.org/10.1177/1094428115571894
doi: 10.1177/1094428115571894
Kim, Y., Dykema, J., Stevenson, J., Black, P., & Moberg, D. P. (2018). Straightlining: Overview of measurement, comparison of indicators, and effects in mail–web mixed-mode surveys. Social Science Computer Review, 37(2), 214–233. https://doi.org/10.1177/0894439317752406
doi: 10.1177/0894439317752406
Knowles, E., Cook, D., & Neville, J. (1989). Modifiers of context effects on personality tests: Verbal ability and need for cognition. In Annual Meeting of the American Psychological Society, Alexandria, VA.
Kroehne, U., Buchholz, J., & Goldhammer, F. (2019). Detecting carelessly invalid responses in item sets using item-level response times. Paper presented at the Annual Meeting of the National Council on Measurement in Education. Toronto, Canada.
Kroehne, U., & Goldhammer, F. (2018). How to conceptualize, represent, and analyze log data from technology-based assessments? A generic framework and an application to questionnaire items. Behaviormetrika, 45(2), 527–563. https://doi.org/10.1007/s41237-018-0063-y
doi: 10.1007/s41237-018-0063-y
Kuncel, R. B., & Fiske, D. W. (1974). Stability of response process and response. Educational and Psychological Measurement, 34(4), 743–755. https://doi.org/10.1177/00131644740.3400401
doi: 10.1177/00131644740.3400401
Lee, Y.-H., & Jia, Y. (2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-scale Assessments in Education 2(1). https://doi.org/10.1186/s40536-014-0008-1 .
Liu, Y., Li, Z., Liu, H., & Luo, F. (2019). Modeling test-taking non-effort in MIRT models. Frontiers in Psychology 10. https://doi.org/10.3389/fpsyg.2019.00145 .
Magraw-Mickelson, Z., Wang, H., & Gollwitzer, M. (2020). Survey mode and data quality: Careless responding across three modes in cross-cultural contexts. International Journal of Testing, 22(2), 121–153. https://doi.org/10.1080/15305058.2021.2019747
doi: 10.1080/15305058.2021.2019747
Maniaci, M. R., & Rogge, R. D. (2014). Caring about carelessness: Participant inattention and its effects on research. Journal of Research in Personality, 48, 61–83. https://doi.org/10.1016/j.jrp.2013.09.008
doi: 10.1016/j.jrp.2013.09.008
McKay, A. S., Garcia, D. M., Clapper, J. P., & Shultz, K. S. (2018). The attentive and the careless: Examining the relationship between benevolent and malevolent personality traits with careless responding in online surveys. Computers in Human Behavior, 84, 295–303. https://doi.org/10.1016/j.chb.2018.03.007
doi: 10.1016/j.chb.2018.03.007
McLachlan, G. J., & Peel, D. (2004) Finite mixture models. New York: Wiley.
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437–455. https://doi.org/10.1037/a0028085
doi: 10.1037/a0028085 pubmed: 22506584
Molenaar, D., Bolsinova, M., & Vermunt, J. K. (2018). A semi-parametric within-subject mixture approach to the analyses of responses and response times. British Journal of Mathematical and Statistical Psychology, 71(2), 205–228. https://doi.org/10.1111/bmsp.12117
doi: 10.1111/bmsp.12117 pubmed: 29044460
Molenaar, D., Tuerlinckx, F., Maas, H. L., & van der Maas, V.D. (2015). A bivariate generalized linear item response theory modeling framework to the analysis of responses and response times. Multivariate Behavioral Research, 50(1), 56–74. https://doi.org/10.1080/00273171.2014.962684
doi: 10.1080/00273171.2014.962684 pubmed: 26609743
Muraki, E. (1997). A generalized partial credit model. In Handbook of modern item response theory (pp. 153–164). Berlin: Springer.
Nagy, G., & Ulitzsch, E. (2021). A multilevel mixture IRT framework for modeling response times as predictors or indicators of response engagement in IRT models. Educational and Psychological Measurement. https://doi.org/10.1177/00131644211045351 .
National Center for Education Statistics (2009). NAEP technical documentation. National Center for Education Statistics. Retrieved from. https://nces.ed.gov/nationsreportcard/tdw/ .
Nichols, A. L., & Edlund, J. E. (2020). Why don’t we care more about carelessness? Understanding the causes and consequences of careless participants. International Journal of Social Research Methodology, 23(6), 625–638. https://doi.org/10.1080/13645579.2020.1719618
doi: 10.1080/13645579.2020.1719618
Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2016). Detecting careless respondents in web-based questionnaires: Which method to use? Journal of Research in Personality, 63, 1–11. https://doi.org/10.1016/j.jrp.2016.04.010
doi: 10.1016/j.jrp.2016.04.010
OECD (2013). Technical report of the Survey of Adult Skills (PIAAC). Organisation for Economic Co-operation and Development. Paris, France. Retrieved from https://www.oecd.org/skills/piaac/_Technical%20Report_17OCT13.pdf .
OECD (2020). PISA 2018 technical report. OECD Publishing. Paris, France. Retrieved from https://www.oecd.org/pisa/data/pisa2018technicalreport/ .
Patton, J. M., Cheng, Y., Hong, M., & Diao, Q. (2019). Detection and treatment of careless responses to improve item parameter estimation. Journal of Educational and Behavioral Statistics, 44(3), 309–341. https://doi.org/10.3102/1076998618825116
doi: 10.3102/1076998618825116
Pokropek, A. (2016). Grade of membership response time model for detecting guessing behaviors. Journal of Educational and Behavioral Statistics, 41(3), 300–325. https://doi.org/10.3102/1076998616636618
doi: 10.3102/1076998616636618
R Development Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. Retrieved from http://www.R-project.org .
Rios, J. A., & Guo, H. (2020). Can culture be a salient predictor of test-taking engagement? An analysis of differential noneffortful responding on an international college-level assessment of critical thinking. Applied Measurement in Education, 33(4), 263–279. https://doi.org/10.1080/08957347.2020.1789141
doi: 10.1080/08957347.2020.1789141
Rios, J. A., Guo, H., Mao, L., & Liu, O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104. https://doi.org/10.1080/15305058.2016.1231193
doi: 10.1080/15305058.2016.1231193
Samejima, F. (2016). Graded response models. In Handbook of item response theory (pp. 123–136): Chapman Hall/CRC.
Schmitt, N., & Stuits, D. M. (1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9(4), 367–373. https://doi.org/10.1177/014662168500900405
doi: 10.1177/014662168500900405
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317. https://doi.org/10.32614/RJ-2016-021
doi: 10.32614/RJ-2016-021 pubmed: 27818791 pmcid: 5096736
Sinharay, S. (2016). Asymptotically correct standardization of person-fit statistics beyond dichotomous items. Psychometrika, 81(4), 992–1013.
doi: 10.1007/s11336-015-9465-x pubmed: 25953476
Soland, J., Kuhfeld, M., & Rios, J. (2021). Comparing different response time threshold setting methods to detect low effort on a large-scale assessment. Large-Scale Assessments in Education, 9(1), 1–21. https://doi.org/10.1186/s40536-021-00100-w
doi: 10.1186/s40536-021-00100-w
Soland, J., Wise, S. L., & Gao, L. (2019). Identifying disengaged survey responses: New evidence using response time metadata. Applied Measurement in Education, 32(2), 151–165. https://doi.org/10.1080/08957347.2019.1577244
doi: 10.1080/08957347.2019.1577244
Srebro, N., Shakhnarovich, G., & Roweis, S. (2006). An investigation of computational and informational limits in Gaussian mixture clustering. In Proceedings of the 23rd international conference on Machine learning (pp. 865–872), DOI https://doi.org/10.1145/1143844.1143953 , (to appear in print).
Steinmann, I., Strietholt, R., & Braeken, J. (2022). A constrained factor mixture analysis model for consistent and inconsistent respondents to mixed-worded scales. Psychological Methods, 27(2), 667–702. https://doi.org/10.1037/met0000392
doi: 10.1037/met0000392 pubmed: 33829811
Thomas, D. R., & Cyr, A. (2002). Applying item response theory methods to complex survey data. In Proceedings of the Survey Methods Section (pp. 17–25).
Ulitzsch, E., Penk, C., von Davier, M., & Pohl, S. (2021). Modell meets reality: Validating a new behavioral measure for test-taking effort. Educational Assessment. https://doi.org/10.1080/10627197.2020.1858786 .
Ulitzsch, E., Pohl, S., Khorramdel, L., Kroehne, U., & von Davier, M. (2021). A response-time-based latent response mixture model for identifying and modeling careless and insufficient effort responding in survey data. Psychometrika. https://doi.org/10.1007/s11336-021-09817-7 .
Ulitzsch, E., von Davier, M., & Pohl, S. (2020). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level nonresponse. British Journal of Mathematical and Statistical Psychology. https://doi.org/10.1111/bmsp.12188 .
Ulitzsch, E., Yildirim-Erbasli, S. N., Gorgun, G., & Bulut, O. (2022). An explanatory mixture IRT model for careless and insufficient effort responding in survey data. British Journal of Mathematical and Statistical Psychology. https://doi.org/10.1111/bmsp.12272 .
van Laar, S., & Braeken, J. (2022). Random Responders in the TIMSS 2015 Student Questionnaire: A Threat to Validity? Journal of Educational Measurement. https://doi.org/10.1111/jedm.12317 .
Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456–477. https://doi.org/10.1111/bmsp.12054
doi: 10.1111/bmsp.12054 pubmed: 25873487
Wise, S. L. (2019). An information-based approach to identifying rapid-guessing thresholds. Applied Measurement in Education, 32(4), 325–336. https://doi.org/10.1080/08957347.2019.1660350
doi: 10.1080/08957347.2019.1660350
Wise, S. L. (2017). Rapid-Guessing Behavior: Its Identification, Interpretation, and Implications. Educational Measurement: Issues and Practice, 36(4), 52–61. https://doi.org/10.1111/emip.12165
doi: 10.1111/emip.12165
Wise, S. L., & DeMars, C. E. (2006). An Application of Item Response Time: The Effort-Moderated IRT Model. Journal of Educational Measurement, 43(1), 19–38. https://doi.org/10.1111/j.1745-3984.2006.00002.x
doi: 10.1111/j.1745-3984.2006.00002.x
Wise, S. L., & Gao, L. (2017). A general approach to measuring test-taking effort on computer-based tests. Applied Measurement in Education, 30(4), 343–354. https://doi.org/10.1080/08957347.2017.1353992
doi: 10.1080/08957347.2017.1353992
Wise, S. L., & Kuhfeld, M. R. (2021). Using retest data to evaluate and improve effort-moderated scoring. Journal of Educational Measurement, 58(1), 130–149. https://doi.org/10.1111/jedm.12275
doi: 10.1111/jedm.12275
Wise, S. L., & Ma, L. (2012). Setting response time thresholds for a CAT item pool: The normative threshold method. Paper presented at the Annual Meeting of the National Council on Measurement in Education. Vancouver, Canada.
Woods, C. M. (2006). Careless responding to reverse-worded items: Implications for confirmatory factor analysis. Journal of Psychopathology and Behavioral Assessment, 28(3), 189–94. https://doi.org/10.1007/s10862-005-9004-7
doi: 10.1007/s10862-005-9004-7
Yildirim-Erbasli, S. N., & Bulut, O. (2021). The impact of students’ test-taking effort on growth estimates in low-stakes educational assessments. Educational Research and Evaluation. https://doi.org/10.1080/13803611.2021.1977152 .

Auteurs

Esther Ulitzsch (E)

IPN-Leibniz Institute for Science and Mathematics Education, Educational Measurement, Olshausenstraße 62, 24118, Kiel, Germany. ulitzsch@leibniz-ipn.de.
Centre for International Student Assessment, Munich, Germany. ulitzsch@leibniz-ipn.de.

Hyo Jeong Shin (HJ)

Sogang University, Seoul, South Korea.

Oliver Lüdtke (O)

IPN-Leibniz Institute for Science and Mathematics Education, Educational Measurement, Olshausenstraße 62, 24118, Kiel, Germany.
Centre for International Student Assessment, Munich, Germany.

Classifications MeSH