Regression without regrets -initial data analysis is a prerequisite for multivariable regression.

Data screening Functional form IDA framework Initial data analysis Regression models Reporting STRATOS Initiative Variable selection Variable transformation

Journal

BMC medical research methodology
ISSN: 1471-2288
Titre abrégé: BMC Med Res Methodol
Pays: England
ID NLM: 100968545

Informations de publication

Date de publication:
08 Aug 2024
Historique:
received: 08 11 2023
accepted: 24 07 2024
revised: 01 07 2024
medline: 9 8 2024
pubmed: 9 8 2024
entrez: 8 8 2024
Statut: epublish

Résumé

Statistical regression models are used for predicting outcomes based on the values of some predictor variables or for describing the association of an outcome with predictors. With a data set at hand, a regression model can be easily fit with standard software packages. This bears the risk that data analysts may rush to perform sophisticated analyses without sufficient knowledge of basic properties, associations in and errors of their data, leading to wrong interpretation and presentation of the modeling results that lacks clarity. Ignorance about special features of the data such as redundancies or particular distributions may even invalidate the chosen analysis strategy. Initial data analysis (IDA) is prerequisite to regression analyses as it provides knowledge about the data needed to confirm the appropriateness of or to refine a chosen model building strategy, to interpret the modeling results correctly, and to guide the presentation of modeling results. In order to facilitate reproducibility, IDA needs to be preplanned, an IDA plan should be included in the general statistical analysis plan of a research project, and results should be well documented. Biased statistical inference of the final regression model can be minimized if IDA abstains from evaluating associations of outcome and predictors, a key principle of IDA. We give advice on which aspects to consider in an IDA plan for data screening in the context of regression modeling to supplement the statistical analysis plan. We illustrate this IDA plan for data screening in an example of a typical diagnostic modeling project and give recommendations for data visualizations.

Identifiants

pubmed: 39117997
doi: 10.1186/s12874-024-02294-3
pii: 10.1186/s12874-024-02294-3
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

178

Subventions

Organisme : Deutsche Forschungsgemeinschaft
ID : SA 580/10-3
Organisme : US National Center for Advancing Translational Sciences
ID : CTSA award No. UL1 TR002243

Informations de copyright

© 2024. The Author(s).

Références

Vach V. Regression Models as a Tool in Medical Research. Boca Raton: Chapman and Hall/CRC; 2013.
Harrell F Jr. Regression Modelling Strategies. 2nd ed. NJ: Springer. New York; 2015.
doi: 10.1007/978-3-319-19425-7
Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H, Dunkler D, Harrell Jr FE, Royston P, Heinze G for TG2 of the STRATOS initiative, State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues. Diagn Progn Res 202;4:3. https://doi.org/10.1186/s41512-020-00074-3
Royston P, Altman DG. Regression using Fractional Polynomials of Continuous Covariates: Parsimonious Parametric Modelling. JRSS C (Applied Statistics). 1994;43(3):429–67. https://doi.org/10.2307/2986270 .
doi: 10.2307/2986270
Huebner M, Vach W, le Cessie S, Schmidt CO, Lusa L; Topic Group “Initial Data Analysis” of the STRATOS Initiative (STRengthening Analytical Thinking for Observational Studies, http://www.stratos-initiative.org ). Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Methodol 2020;20(1):61. https://doi.org/10.1186/s12874-020-00942-y .
Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Obs Stud. 2018;4:171–92.
doi: 10.1353/obs.2018.0014
Huber P. Data Analysis: What Can Be Learned From the Past 50 Years. NJ: Wiley. Hoboken; 2011.
doi: 10.1002/9781118018255
Baillie M, le Cessie S, Schmidt CO, Lusa L, Huebner M for Topic Group ‘Initial Data Analysis’ of the STRATOS initiative. Ten simple rules for initial data analysis. PLOS Computational Biology 2022;18(2):e1009819. https://doi.org/10.1371/journal.pcbi.1009819
Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al.. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21:63. https://doi.org/10.1186/s12874-021-01252-7
Kerr NL. HARKing: hypothesizing after the results are known. Personal Soc Psychol Rev. 1998;2:196–217.
doi: 10.1207/s15327957pspr0203_4
Ioannidis JPA. Why Most Published Research Findings Are False. PLoS Med. 2005;2(8): e124. https://doi.org/10.1371/journal.pmed.0020124 .
doi: 10.1371/journal.pmed.0020124 pubmed: 16060722 pmcid: 1182327
Chatfield C. The Initial Examination of Data. JRSS A (General). 1985;148(3):214–31. https://doi.org/10.2307/2981969 .
doi: 10.2307/2981969
Cook D, Reid N, Tanaka E. The Foundation Is Available for Thinking About Data Visualization Inferentially. Harvard Data Science Review. 2021;3:3. https://doi.org/10.1162/99608f92.8453435d .
doi: 10.1162/99608f92.8453435d
Heinze G, Wallisch C, Dunkler D. Variable selection – A review and recommendations for the practicing statistician. Biom J. 2018;60(3):431–49. https://doi.org/10.1002/bimj.201700067 .
doi: 10.1002/bimj.201700067 pubmed: 29292533 pmcid: 5969114
Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration. PLoS Med. 2007;4(10): e297. https://doi.org/10.1371/journal.pmed.0040297 .
doi: 10.1371/journal.pmed.0040297 pubmed: 17941715 pmcid: 2020496
Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GS. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162(1):W1-73. https://doi.org/10.7326/M14-0698 .
doi: 10.7326/M14-0698 pubmed: 25560730
Lee KJ, Tilling KM, Cornish RP, Little RJA, Bell ML, Goetghebeur E, Hogan JW, Carpenter JR; STRATOS initiative. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. J Clin Epidemiol. 2021;134:79–88. https://doi.org/10.1016/j.jclinepi.2021.01.008 .
Ratzinger F, Dedeyan M, Rammerstorfer M, Perkmann T, Burgmann H, Makristathis A, Dorffner G, Lötsch F, Blacky A, Ramharter M. A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study. PLoS ONE. 2014;9(9):e106765. https://doi.org/10.1371/journal.pone.0106765 .
Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, Lawrence F, Huebner M, for TG3 of the STRATOS initiative. Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS ONE. 2024;19(5):e0295726. https://doi.org/10.1371/journal.pone.0295726 .
Johnson NL. Systems of Frequency curves Generated by Methods of Translation. Biometrika. 1949;36:149–76.
doi: 10.1093/biomet/36.1-2.149 pubmed: 18132090
Gregorich M, Strohmaier S, Dunkler D, Heinze G. Regression with Highly Correlated Predictors: Variable Omission Is Not the Solution. Int J Environ Res Public Health. 2021;18(8):4259. https://doi.org/10.3390/ijerph18084259 .
doi: 10.3390/ijerph18084259 pubmed: 33920501 pmcid: 8073086
Royston P, Sauerbrei W. Improving the robustness of fractional polynomial models by preliminary covariate transformation: A pragmatic approach. Computational Statistics and Data Analysis 2007;51(9):4240–4253- https://doi.org/10.1016/j.csda.2006.05.006 .
Gelman A, Hill J, Vehtari A. Regression and Other Stories. Cambridge: Cambridge University Press; 2021.
Royston P, Sauerbrei W. Multivariable model-building. a pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Chichester: Wiley; 2008.
doi: 10.1002/9780470770771
Altman DG, McShane LM, Sauerbrei W, Taube SE. Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK): explanation and elaboration. PLoS Med. 2012;9(5): e1001216. https://doi.org/10.1371/journal.pmed.1001216 .
doi: 10.1371/journal.pmed.1001216 pubmed: 22675273 pmcid: 3362085
Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. NJ: Wiley. New York; 1987.
doi: 10.1002/0471725382
Zeileis A. Object-Oriented Computation of Sandwich Estimators. J Stat Softw. 2006;16(9):1–16. https://doi.org/10.18637/jss.v016.i09 .
doi: 10.18637/jss.v016.i09
Ma X, Wang H, Huang J, Geng Y, Jiang S, Zhou Q, Chen X, Hu H, Li W, Zhou C, Gao X, Peng N, Deng Y. A nomogramic model based on clinical and laboratory parameters at admission for predicting the survival of COVID-19 patients. BMC Infect Dis. 2020;20(1):899. https://doi.org/10.1186/s12879-020-05614-2 .
doi: 10.1186/s12879-020-05614-2 pubmed: 33256643 pmcid: 7702207
European Medicines Agency. ICH Topic E 9 Statistical Principles for Clinical Trials. European Medicines Agency, London, UK, 1998. https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-9-statistical-principles-clinical-trials-step-5_en.pdf
Kahan BC, Forbes G, Cro S. How to design a pre-specified statistical analysis approach to limit p-hacking in clinical trials: the Pre-SPEC framework. BMC Med 202;18:253. https://doi.org/10.1186/s12916-020-01706-7
Wicherts JM, Veldkamp CLS, Augusteijn HEM, Bakker M, Van Aert RCM, Van Assen MALM. Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Front Psychol 2016;7:1832. http://journal.frontiersin.org/article/ https://doi.org/10.3389/fpsyg.2016.01832/full
Glasziou P, Altman DG, Bossuyt P, Boutron I, Clarke M, Julious S, Michie S, Moher D, Wager E. Reducing waste from incomplete or unusable reports of biomedical research. Lancet. 2014;383(9913):267–76. https://doi.org/10.1016/S0140-6736(13)62228-X .
doi: 10.1016/S0140-6736(13)62228-X pubmed: 24411647
Sauerbrei W, Haeussler T, Balmford J, Huebner M. Structured reporting to improve transparency of analyses in prognostic marker studies. BMC Med. 2022;20(1):184.
doi: 10.1186/s12916-022-02304-5 pubmed: 35546237 pmcid: 9095054
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, 't Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016;3:160018. https://doi.org/10.1038/sdata.2016.18 . Erratum in: Sci Data. 2019;6(1):6.
Marino J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Appl Sci. 2022;12(9):4238.
doi: 10.3390/app12094238
Schmidt CO, Struckmann S, Enzenbach C, Reinecke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol 2021;21:63. https://doi.org/10.1186/s12874-021-01252-7

Auteurs

Georg Heinze (G)

Center for Medical Data Science, Institute of Clinical Biometrics, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria. georg.heinze@meduniwien.ac.at.

Mark Baillie (M)

Novartis Pharma AG, Basel, Switzerland.

Lara Lusa (L)

Faculty of Mathematics, Department of Mathematics, University of Primorska, Natural Sciences and Information Technology, Koper, Slovenia.
Faculty of Medicine, Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia.

Willi Sauerbrei (W)

Faculty of Medicine and Medical Center, Institute of Medical Biometry and Statistics, University of Freiburg, Freiburg, Germany.

Carsten Oliver Schmidt (CO)

Institute of Community Medicine, University Medicine of Greifswald, SHIP-KEF, Greifswald, Germany.

Frank E Harrell (FE)

School of Medicine, Department of Biostatistics, Vanderbilt University, Nashville, TN, USA.

Marianne Huebner (M)

Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH