Block Forests: random forests for blocks of clinical and omics covariate data.

Cancer Machine learning Multi-omics data Prediction Random forest Statistics Survival analysis

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
27 Jun 2019
Historique:
received: 15 01 2019
accepted: 07 06 2019
entrez: 29 6 2019
pubmed: 30 6 2019
medline: 16 8 2019
Statut: epublish

Résumé

In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. We identify one variant termed "block forest" that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.

Sections du résumé

BACKGROUND BACKGROUND
In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available.
RESULTS RESULTS
We identify one variant termed "block forest" that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application.
CONCLUSIONS CONCLUSIONS
The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.

Identifiants

pubmed: 31248362
doi: 10.1186/s12859-019-2942-y
pii: 10.1186/s12859-019-2942-y
pmc: PMC6598279
doi:

Types de publication

Journal Article

Langues

eng

Pagination

358

Subventions

Organisme : Deutsche Forschungsgemeinschaft
ID : BO3139/6-1
Organisme : Deutsche Forschungsgemeinschaft
ID : BO3139/4-3
Organisme : Deutsche Forschungsgemeinschaft
ID : HO6422/1-2

Références

Bioinformatics. 2004 Nov 1;20(16):2626-35
pubmed: 15130933
BMC Bioinformatics. 2007 Jan 25;8:25
pubmed: 17254353
Bioinformatics. 2010 Jan 1;26(1):68-76
pubmed: 19846436
Brief Bioinform. 2011 May;12(3):215-29
pubmed: 21245078
PLoS One. 2011;6(11):e24709
pubmed: 22073136
N Engl J Med. 2013 May 30;368(22):2059-74
pubmed: 23634996
Comput Methods Programs Biomed. 2013 Sep;111(3):592-601
pubmed: 23849930
Bioinformatics. 2014 Mar 15;30(6):838-45
pubmed: 24162466
Brief Bioinform. 2015 Mar;16(2):291-303
pubmed: 24632304
Stat Med. 2014 Dec 30;33(30):5310-29
pubmed: 25042390
Bioinformatics. 2014 Nov 15;30(22):3152-8
pubmed: 25086004
BMC Med Res Methodol. 2015 Nov 04;15:95
pubmed: 26537575
Comput Math Methods Med. 2015;2015:576413
pubmed: 26858773
Genetics. 2016 Jul;203(3):1425-38
pubmed: 27129736
Bioinformatics. 2016 Sep 1;32(17):i413-i420
pubmed: 27587657
Comput Math Methods Med. 2017;2017:7691937
pubmed: 28546826
Front Genet. 2017 Jun 16;8:84
pubmed: 28670325
BMC Bioinformatics. 2018 Sep 12;19(1):322
pubmed: 30208855
BMC Bioinformatics. 2018 Nov 20;19(Suppl 14):415
pubmed: 30453872
High Throughput. 2019 Jan 18;8(1):null
pubmed: 30669303
Stat Appl Genet Mol Biol. 2019 Jan 26;18(1):null
pubmed: 30685747
Front Genet. 2019 Mar 08;10:166
pubmed: 30906311
Stat Med. 1997 Feb 28;16(4):385-95
pubmed: 9044528

Auteurs

Roman Hornung (R)

Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany. hornung@ibe.med.uni-muenchen.de.

Marvin N Wright (MN)

Leibniz Institute for Prevention Research and Epidemiology - BIPS, Achterstr. 30, Bremen, 28359, Germany.
Section of Biostatistics, Department of Public Health, University of Copenhagen, Øster Farimagsgade 5, Copenhagen, 1014, Denmark.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH