Controlling costs: Feature selection on a budget.

feature cost feature selection multiple knockoffs weighted false discovery proportion

Journal

Stat
ISSN: 0038-9986
Titre abrégé: Stat
Pays: United States
ID NLM: 9875635

Informations de publication

Date de publication:
Dec 2022
Historique:
medline: 1 12 2022
pubmed: 1 12 2022
entrez: 22 1 2024
Statut: ppublish

Résumé

The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.

Identifiants

pubmed: 38250253
doi: 10.1002/sta4.427
pmc: PMC10798788
pii:
doi:

Types de publication

Journal Article

Langues

eng

Auteurs

Guo Yu (G)

Department of Statistics and Applied Probability, University of California Santa Barbara, Santa Barbara, California, USA.

Daniela Witten (D)

Department of Statistics and Biostatistics, University of Washington, Seattle, Washington, USA.

Jacob Bien (J)

Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California, USA.

Classifications MeSH