Unsupervised random forests.

Impurity sidClustering staggered interaction data unsupervised learning

Journal

Statistical analysis and data mining

ISSN: 1932-1864

Titre abrégé: Stat Anal Data Min

Pays: United States

ID NLM: 101492808

Informations de publication

Date de publication:
Apr 2021

Historique:

entrez: 9 4 2021

pubmed: 10 4 2021

medline: 10 4 2021

Statut: ppublish

Résumé

sidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi-institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.

Identifiants

DOI: 10.1002/sam.11498 PMID: 33833846 PMC: PMC8025042

pubmed: 33833846

doi: 10.1002/sam.11498

pmc: PMC8025042

mid: NIHMS1683277

doi:

Types de publication

Journal Article

Langues

eng

Pagination

144-167

Subventions

Organisme : NIGMS NIH HHS

ID : R01 GM125072

Pays : United States

Déclaration de conflit d'intérêts

CONFLICT OF INTEREST None declared.

Références

Dis Esophagus. 2016 Oct;29(7):707-714

pubmed: 27731549

Stat Anal Data Min. 2017 Dec;10(6):363-377

pubmed: 29403567

Stat Med. 2019 Feb 20;38(4):558-582

pubmed: 29869423