Unsupervised random forests.
Impurity
sidClustering
staggered interaction data
unsupervised learning
Journal
Statistical analysis and data mining
ISSN: 1932-1864
Titre abrégé: Stat Anal Data Min
Pays: United States
ID NLM: 101492808
Informations de publication
Date de publication:
Apr 2021
Apr 2021
Historique:
entrez:
9
4
2021
pubmed:
10
4
2021
medline:
10
4
2021
Statut:
ppublish
Résumé
sidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi-institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.
Identifiants
pubmed: 33833846
doi: 10.1002/sam.11498
pmc: PMC8025042
mid: NIHMS1683277
doi:
Types de publication
Journal Article
Langues
eng
Pagination
144-167Subventions
Organisme : NIGMS NIH HHS
ID : R01 GM125072
Pays : United States
Déclaration de conflit d'intérêts
CONFLICT OF INTEREST None declared.
Références
Dis Esophagus. 2016 Oct;29(7):707-714
pubmed: 27731549
Stat Anal Data Min. 2017 Dec;10(6):363-377
pubmed: 29403567
Stat Med. 2019 Feb 20;38(4):558-582
pubmed: 29869423