Random forests for the analysis of matched case-control studies.

CLogitForest Conditional logistic regression Conditional logistic regression forests Machine learning Matched case–control studies Random forest

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
01 Aug 2024
Historique:
received: 03 01 2024
accepted: 22 07 2024
medline: 2 8 2024
pubmed: 2 8 2024
entrez: 1 8 2024
Statut: epublish

Résumé

Conditional logistic regression trees have been proposed as a flexible alternative to the standard method of conditional logistic regression for the analysis of matched case-control studies. While they allow to avoid the strict assumption of linearity and automatically incorporate interactions, conditional logistic regression trees may suffer from a relatively high variability. Further machine learning methods for the analysis of matched case-control studies are missing because conventional machine learning methods cannot handle the matched structure of the data. A random forest method for the analysis of matched case-control studies based on conditional logistic regression trees is proposed, which overcomes the issue of high variability. It provides an accurate estimation of exposure effects while being more flexible in the functional form of covariate effects. The efficacy of the method is illustrated in a simulation study and within an application to real-world data from a matched case-control study on the effect of regular participation in cervical cancer screening on the development of cervical cancer. The proposed random forest method is a promising add-on to the toolbox for the analysis of matched case-control studies and addresses the need for machine-learning methods in this field. It provides a more flexible approach compared to the standard method of conditional logistic regression, but also compared to conditional logistic regression trees. It allows for non-linearity and the automatic inclusion of interaction effects and is suitable both for exploratory and explanatory analyses.

Sections du résumé

BACKGROUND BACKGROUND
Conditional logistic regression trees have been proposed as a flexible alternative to the standard method of conditional logistic regression for the analysis of matched case-control studies. While they allow to avoid the strict assumption of linearity and automatically incorporate interactions, conditional logistic regression trees may suffer from a relatively high variability. Further machine learning methods for the analysis of matched case-control studies are missing because conventional machine learning methods cannot handle the matched structure of the data.
RESULTS RESULTS
A random forest method for the analysis of matched case-control studies based on conditional logistic regression trees is proposed, which overcomes the issue of high variability. It provides an accurate estimation of exposure effects while being more flexible in the functional form of covariate effects. The efficacy of the method is illustrated in a simulation study and within an application to real-world data from a matched case-control study on the effect of regular participation in cervical cancer screening on the development of cervical cancer.
CONCLUSIONS CONCLUSIONS
The proposed random forest method is a promising add-on to the toolbox for the analysis of matched case-control studies and addresses the need for machine-learning methods in this field. It provides a more flexible approach compared to the standard method of conditional logistic regression, but also compared to conditional logistic regression trees. It allows for non-linearity and the automatic inclusion of interaction effects and is suitable both for exploratory and explanatory analyses.

Identifiants

pubmed: 39090608
doi: 10.1186/s12859-024-05877-5
pii: 10.1186/s12859-024-05877-5
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

253

Subventions

Organisme : Bundesministerium für Gesundheit
ID : NKP-332-049
Organisme : Deutsche Forschungsgemeinschaft
ID : BE 7543/1-1

Informations de copyright

© 2024. The Author(s).

Références

Breiman L. Random forests. Mach Learn. 2001;45:5–32.
doi: 10.1023/A:1010933404324
Breiman L, Friedman JH, Olshen RA, Stone JC. Classification and Regression Trees. Monterey, CA: Wadsworth; 1984.
Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.
doi: 10.1007/BF00058655
Efron B, Tibshirani R. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993.
doi: 10.1007/978-1-4899-4541-9
Fang Y, He W, Wang H, Wu M. Key considerations in the design of real-world studies. Contemp Clin Trials. 2020;96: 106091.
doi: 10.1016/j.cct.2020.106091 pubmed: 32717351
Mansournia MA, Poole C. Case-control matching on confounders revisited. Eur J Epidemiol. 2023;38(10):1025–34.
doi: 10.1007/s10654-023-01046-9 pubmed: 37707626
Pearce N. Analysis of matched case-control studies. BMJ (Clinical Research Edition). 2016;352: i969.
Mansournia MA, Jewell NP, Greenland S. Case-control matching: effects, misconceptions, and recommendations. Eur J Epidemiol. 2018;33(1):5–14.
doi: 10.1007/s10654-017-0325-0 pubmed: 29101596
Schauberger G, Tanaka LF, Berger M. A tree-based modeling approach for matched case-control studies. Stat Med. 2023;42(5):676–92.
doi: 10.1002/sim.9637 pubmed: 36631256
Avalos M, Pouyes H, Grandvalet Y, Orriols L, Lagarde E. Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: a simple algorithm. BMC Bioinf. 2015;16(S6):S1.
doi: 10.1186/1471-2105-16-S6-S1
Reid S, Tibshirani R. Regularization paths for conditional logistic regression: the clogitL1 package. J Stat Softw. 2014;12:58.
Zetterqvist J, Vermeulen K, Vansteelandt S, Sjölander A. Doubly robust conditional logistic regression. Stat Med. 2019;38(23):4749–60.
doi: 10.1002/sim.8332 pubmed: 31373403
Shomal Zadeh N, Lin S, Runger GC. Matched Forest: supervised learning for high-dimensional matched case-control studies. Bioinformatics. 2019;36(5):1570–6.
doi: 10.1093/bioinformatics/btz785
Breslow NE, Day NE, Halvorsen KT, Prentice RL, Sabai C. Estimation of multiple relative risk functions in matched case-control studies. Am J Epidemiol. 1978;108(4):299–307.
doi: 10.1093/oxfordjournals.aje.a112623 pubmed: 727199
Breslow NE, Day NE. The Analysis of Case-Control Studies. No. 1 in Statistical Methods in Cancer Research. Lyon: I.A.R.C; 1980.
Schauberger G, Berger M.: CLogitTree: Tree-Based Methods for Matched Case-Control Studies. R package version 0.2-2.
Molnar C. Interpretable Machine Learning. 2nd ed.; 2022. Available from: https://christophm.github.io/interpretable-ml-book .
Probst P, Wright MN, Boulesteix AL. Hyperparameters and tuning strategies for random forest. WIREs Data Min Knowl Discov. 2019;9(3): e1301.
doi: 10.1002/widm.1301
Boulesteix AL, Janitza S, Kruppa J, König IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. WIREs Data Min Knowl Discov. 2012;2(6):493–507.
doi: 10.1002/widm.1072
Tanaka LF, Schriefer D, Radde K, Schauberger G, Klug SJ. Impact of opportunistic screening on squamous cell and adenocarcinoma of the cervix in Germany: a population-based case-control study. PLOS ONE. 2021;16(7):1–17.
doi: 10.1371/journal.pone.0253801
Parikh S, Brennan P, Boffetta P. Meta-analysis of social inequality and the risk of cervical cancer. Int J Cancer. 2003;105(5):687–91.
doi: 10.1002/ijc.11141 pubmed: 12740919
Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning (Second Edition). New York: Springer-Verlag; 2009.
doi: 10.1007/978-0-387-84858-7

Auteurs

Gunther Schauberger (G)

Chair of Epidemiology, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany. gunther.schauberger@tum.de.

Stefanie J Klug (SJ)

Chair of Epidemiology, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany.

Moritz Berger (M)

Institute of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University of Bonn, Bonn, Germany.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH