Regression without regrets -initial data analysis is a prerequisite for multivariable regression.
Data screening
Functional form
IDA framework
Initial data analysis
Regression models
Reporting
STRATOS Initiative
Variable selection
Variable transformation
Journal
BMC medical research methodology
ISSN: 1471-2288
Titre abrégé: BMC Med Res Methodol
Pays: England
ID NLM: 100968545
Informations de publication
Date de publication:
08 Aug 2024
08 Aug 2024
Historique:
received:
08
11
2023
accepted:
24
07
2024
revised:
01
07
2024
medline:
9
8
2024
pubmed:
9
8
2024
entrez:
8
8
2024
Statut:
epublish
Résumé
Statistical regression models are used for predicting outcomes based on the values of some predictor variables or for describing the association of an outcome with predictors. With a data set at hand, a regression model can be easily fit with standard software packages. This bears the risk that data analysts may rush to perform sophisticated analyses without sufficient knowledge of basic properties, associations in and errors of their data, leading to wrong interpretation and presentation of the modeling results that lacks clarity. Ignorance about special features of the data such as redundancies or particular distributions may even invalidate the chosen analysis strategy. Initial data analysis (IDA) is prerequisite to regression analyses as it provides knowledge about the data needed to confirm the appropriateness of or to refine a chosen model building strategy, to interpret the modeling results correctly, and to guide the presentation of modeling results. In order to facilitate reproducibility, IDA needs to be preplanned, an IDA plan should be included in the general statistical analysis plan of a research project, and results should be well documented. Biased statistical inference of the final regression model can be minimized if IDA abstains from evaluating associations of outcome and predictors, a key principle of IDA. We give advice on which aspects to consider in an IDA plan for data screening in the context of regression modeling to supplement the statistical analysis plan. We illustrate this IDA plan for data screening in an example of a typical diagnostic modeling project and give recommendations for data visualizations.
Identifiants
pubmed: 39117997
doi: 10.1186/s12874-024-02294-3
pii: 10.1186/s12874-024-02294-3
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
178Subventions
Organisme : Deutsche Forschungsgemeinschaft
ID : SA 580/10-3
Organisme : US National Center for Advancing Translational Sciences
ID : CTSA award No. UL1 TR002243
Informations de copyright
© 2024. The Author(s).
Références
Vach V. Regression Models as a Tool in Medical Research. Boca Raton: Chapman and Hall/CRC; 2013.
Harrell F Jr. Regression Modelling Strategies. 2nd ed. NJ: Springer. New York; 2015.
doi: 10.1007/978-3-319-19425-7
Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H, Dunkler D, Harrell Jr FE, Royston P, Heinze G for TG2 of the STRATOS initiative, State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues. Diagn Progn Res 202;4:3. https://doi.org/10.1186/s41512-020-00074-3
Royston P, Altman DG. Regression using Fractional Polynomials of Continuous Covariates: Parsimonious Parametric Modelling. JRSS C (Applied Statistics). 1994;43(3):429–67. https://doi.org/10.2307/2986270 .
doi: 10.2307/2986270
Huebner M, Vach W, le Cessie S, Schmidt CO, Lusa L; Topic Group “Initial Data Analysis” of the STRATOS Initiative (STRengthening Analytical Thinking for Observational Studies, http://www.stratos-initiative.org ). Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Methodol 2020;20(1):61. https://doi.org/10.1186/s12874-020-00942-y .
Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Obs Stud. 2018;4:171–92.
doi: 10.1353/obs.2018.0014
Huber P. Data Analysis: What Can Be Learned From the Past 50 Years. NJ: Wiley. Hoboken; 2011.
doi: 10.1002/9781118018255
Baillie M, le Cessie S, Schmidt CO, Lusa L, Huebner M for Topic Group ‘Initial Data Analysis’ of the STRATOS initiative. Ten simple rules for initial data analysis. PLOS Computational Biology 2022;18(2):e1009819. https://doi.org/10.1371/journal.pcbi.1009819
Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al.. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21:63. https://doi.org/10.1186/s12874-021-01252-7
Kerr NL. HARKing: hypothesizing after the results are known. Personal Soc Psychol Rev. 1998;2:196–217.
doi: 10.1207/s15327957pspr0203_4
Ioannidis JPA. Why Most Published Research Findings Are False. PLoS Med. 2005;2(8): e124. https://doi.org/10.1371/journal.pmed.0020124 .
doi: 10.1371/journal.pmed.0020124
pubmed: 16060722
pmcid: 1182327
Chatfield C. The Initial Examination of Data. JRSS A (General). 1985;148(3):214–31. https://doi.org/10.2307/2981969 .
doi: 10.2307/2981969
Cook D, Reid N, Tanaka E. The Foundation Is Available for Thinking About Data Visualization Inferentially. Harvard Data Science Review. 2021;3:3. https://doi.org/10.1162/99608f92.8453435d .
doi: 10.1162/99608f92.8453435d
Heinze G, Wallisch C, Dunkler D. Variable selection – A review and recommendations for the practicing statistician. Biom J. 2018;60(3):431–49. https://doi.org/10.1002/bimj.201700067 .
doi: 10.1002/bimj.201700067
pubmed: 29292533
pmcid: 5969114
Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration. PLoS Med. 2007;4(10): e297. https://doi.org/10.1371/journal.pmed.0040297 .
doi: 10.1371/journal.pmed.0040297
pubmed: 17941715
pmcid: 2020496
Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GS. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162(1):W1-73. https://doi.org/10.7326/M14-0698 .
doi: 10.7326/M14-0698
pubmed: 25560730
Lee KJ, Tilling KM, Cornish RP, Little RJA, Bell ML, Goetghebeur E, Hogan JW, Carpenter JR; STRATOS initiative. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. J Clin Epidemiol. 2021;134:79–88. https://doi.org/10.1016/j.jclinepi.2021.01.008 .
Ratzinger F, Dedeyan M, Rammerstorfer M, Perkmann T, Burgmann H, Makristathis A, Dorffner G, Lötsch F, Blacky A, Ramharter M. A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study. PLoS ONE. 2014;9(9):e106765. https://doi.org/10.1371/journal.pone.0106765 .
Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, Lawrence F, Huebner M, for TG3 of the STRATOS initiative. Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS ONE. 2024;19(5):e0295726. https://doi.org/10.1371/journal.pone.0295726 .
Johnson NL. Systems of Frequency curves Generated by Methods of Translation. Biometrika. 1949;36:149–76.
doi: 10.1093/biomet/36.1-2.149
pubmed: 18132090
Gregorich M, Strohmaier S, Dunkler D, Heinze G. Regression with Highly Correlated Predictors: Variable Omission Is Not the Solution. Int J Environ Res Public Health. 2021;18(8):4259. https://doi.org/10.3390/ijerph18084259 .
doi: 10.3390/ijerph18084259
pubmed: 33920501
pmcid: 8073086
Royston P, Sauerbrei W. Improving the robustness of fractional polynomial models by preliminary covariate transformation: A pragmatic approach. Computational Statistics and Data Analysis 2007;51(9):4240–4253- https://doi.org/10.1016/j.csda.2006.05.006 .
Gelman A, Hill J, Vehtari A. Regression and Other Stories. Cambridge: Cambridge University Press; 2021.
Royston P, Sauerbrei W. Multivariable model-building. a pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Chichester: Wiley; 2008.
doi: 10.1002/9780470770771
Altman DG, McShane LM, Sauerbrei W, Taube SE. Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK): explanation and elaboration. PLoS Med. 2012;9(5): e1001216. https://doi.org/10.1371/journal.pmed.1001216 .
doi: 10.1371/journal.pmed.1001216
pubmed: 22675273
pmcid: 3362085
Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. NJ: Wiley. New York; 1987.
doi: 10.1002/0471725382
Zeileis A. Object-Oriented Computation of Sandwich Estimators. J Stat Softw. 2006;16(9):1–16. https://doi.org/10.18637/jss.v016.i09 .
doi: 10.18637/jss.v016.i09
Ma X, Wang H, Huang J, Geng Y, Jiang S, Zhou Q, Chen X, Hu H, Li W, Zhou C, Gao X, Peng N, Deng Y. A nomogramic model based on clinical and laboratory parameters at admission for predicting the survival of COVID-19 patients. BMC Infect Dis. 2020;20(1):899. https://doi.org/10.1186/s12879-020-05614-2 .
doi: 10.1186/s12879-020-05614-2
pubmed: 33256643
pmcid: 7702207
European Medicines Agency. ICH Topic E 9 Statistical Principles for Clinical Trials. European Medicines Agency, London, UK, 1998. https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-9-statistical-principles-clinical-trials-step-5_en.pdf
Kahan BC, Forbes G, Cro S. How to design a pre-specified statistical analysis approach to limit p-hacking in clinical trials: the Pre-SPEC framework. BMC Med 202;18:253. https://doi.org/10.1186/s12916-020-01706-7
Wicherts JM, Veldkamp CLS, Augusteijn HEM, Bakker M, Van Aert RCM, Van Assen MALM. Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Front Psychol 2016;7:1832. http://journal.frontiersin.org/article/ https://doi.org/10.3389/fpsyg.2016.01832/full
Glasziou P, Altman DG, Bossuyt P, Boutron I, Clarke M, Julious S, Michie S, Moher D, Wager E. Reducing waste from incomplete or unusable reports of biomedical research. Lancet. 2014;383(9913):267–76. https://doi.org/10.1016/S0140-6736(13)62228-X .
doi: 10.1016/S0140-6736(13)62228-X
pubmed: 24411647
Sauerbrei W, Haeussler T, Balmford J, Huebner M. Structured reporting to improve transparency of analyses in prognostic marker studies. BMC Med. 2022;20(1):184.
doi: 10.1186/s12916-022-02304-5
pubmed: 35546237
pmcid: 9095054
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, 't Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016;3:160018. https://doi.org/10.1038/sdata.2016.18 . Erratum in: Sci Data. 2019;6(1):6.
Marino J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Appl Sci. 2022;12(9):4238.
doi: 10.3390/app12094238
Schmidt CO, Struckmann S, Enzenbach C, Reinecke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol 2021;21:63. https://doi.org/10.1186/s12874-021-01252-7