420 likes | 435 Views
This research paper discusses the motivation and background of grouped and hierarchical model selection using composite absolute penalties (CAP). It covers different penalization methods such as L1-penalty and provides algorithms, interpretation, and examples for the CAP approach. The paper also explores the challenges and opportunities of statistics in the IT age.
E N D
Grouped and Hierarchical Model Selection through Composite Absolute Penalties (CAP) Bin Yu Department of Statistics, UC Berkeley Joint work with Peng Zhao and Guilherme Rocha
Outline • Motivation • Background • Penalization methods • L1-penalty • Composite Absolute Penalties (CAP) • Building Blocks – L-norm Regularization • Definition • Interpretation • Algorithms • Examples and Results
IT is Providing a Golden time for Statistics Computer technology advances in the IT age have increased tremendously our data collection abilities. Diverse origins of IT data: • IT core areas: IT phenomenon Information retrieval, information extraction, natural language processing, web search • IT systems areas: CS hard-core Chip design, program debugging, network tomography • IT influence areas: Impacted by IT Remote sensing, astronomy, neuroscience, finance, …
Characteristics of Modern Data Set Problems • Goal is efficient use of data for: • Prediction • Interpretation • Larger number of variables: • Number of variables (p) in data sets is large • Sample sizes (n) have not increased at same pace • Scientific opportunities: • New findings in different scientific fields
Cyberinfrastructure challenges/opportunities for statistics Cyberinfrastructure (NSF CS Div. Report in 2003): integrating computer technology into the very fabric of science to extract useful information from the data deluge. For statistics to be part of the making of this cyberinfrastructure to solve problems outside statistics, it is necessary to integrate into our statistical framework the considerations or constraints of • storage (databases) • communication (streaming data, sensor networks) • computation (including memory usage) Feature selection (data reduction)is useful for the first two areas and interpretation, and it needs fast computation.
Automatic Feature Selection Model selection is too expensive: Jornsten and Yu (2003) use Minimum Description Length (MDL) principle for simultaneous gene selection and sample classification with competitive results. They pre-selected about 100 or so genes from 6000. There were still 2^100=1.3 10^30 possible subsets. Combinatorial search over all subsets is too expensive. Recent alternatives: continuous embedding into a convex optimization problem through Lasso -- third generation computational methods in statistics.
Computation for Statistical Inference First generation computation in statistics before computers: use parametric models with closed form solutions for maximum likelihood estimators or Bayes estimators. Second generation computation with computers: design statistically optimal procedures and worry about computation later. Call optimization routines. Third generation computation: form statistical goals with computation in mind and take advantage of special features of statistical computation.
Lasso: L1-norm as a penalty • The L1 penalty is defined for coefficients • Used initially with L2 loss: • Signal processing: Basis Pursuit (Chen & Donoho,1994) • Statistics: LASSO (Tibshirani, 1996) • Properties: • Sparsity (variable selection) • Convexity (convex relaxation of L0-penalty) • “Stability” (non-negative garrote, Breiman, 1995)
Lasso: L1-norm as a penalty Computation: the “right” tuning parameter unknown so “path” is needed (discretized or continuous) • Initially: quadratic program for each a grid on . QP is called for each . • Later: path following algorithms homotopy by Osborne et al (2000) LARS by Efron et al (2004) Theoretical studies: much work recently…
General Penalization Methods • Given data : • Xi : a p-dimensional predictor • Yi : response variable • The parameters are defined by the penalized problem: • where • is the empirical loss function • is a penalty function • is a tuning parameter
Beyond Sparsity of Individual Predictors:Natural Structures among predictors Rationale: side information might be available and/or additional regularization is needed beyond Lasso for p>>n • Groups: • Genes belonging to the same pathway; • Categorical variables represented by “dummies”; • Polynomial terms from the same variable; • Noisy measurements of the same variable. • Hierarchy: • Multi-resolution/wavelet models; • Interactions terms in factorial analysis (ANOVA); • Order selection in Markov Chain models;
Composite Absolute Penalties (CAP)Overview • The CAP family of penalties: • Highly customizable: • ability to perform grouped selection • ability to perform hierarchical selection • Computational considerations: • Feasibility: Convexity • Efficiency: Piecewise linearity in some cases • Define groups according to structure • Combine properties of L-norm penalties • Encompass and go beyond existing works: • Elastic Net (Zou & Hastie, 2005) • GLASSO (Yuan & Lin, 2006) • Blockwise Sparse Regression (Kim, Kim & Kim, 2006)
Composite Absolute PenaltiesReview of L Regularization Given data and loss function : • L Regularization: • Penalty: • Estimate: • where >0 is a tuning parameter • For the squared error loss function: • Hoerl & Kennard (1970): Ridge (=2) • Frank & Friedman (1993): Bridge (general ) • Tibshirani (1996): LASSO (=1)
Composite Absolute PenaltiesDefinition • The CAP parameter estimate is given by: • Gk's, k=1,…,K - indices of k-th pre-defined group • Gk – corresponding vector of coefficients. • || . ||k – group Lk norm: Nk = ||k||k; • || . ||0 – overall norm: T() =||N||0 • groups may overlap (hierarchical selection)
Composite Absolute PenaltiesA Bayesian interpretation • For non-overlapping groups: • Prior on group norms: • Prior on individual coefficients:
Contour plot for 0=1, 1=2, 2=2 Composite Absolute PenaltiesGroup selection • Tailoring T() for group selection: • Define non-overlapping groups • Setk>1, 8k 0: • Group norm k tunes similarity within its group • k>1 causes all variables in group i to be included/excluded together • Set0=1: • This yields grouped sparsity • i=2 has been studied by Yuan and Lin(Grouped Lasso, 2005).
Composite Absolute PenaltiesHierarchical Structures • Tailoring T() for Hierarchical Structure: • Set 0=1 • Set i>1, i • Groups overlap: • If2appears in all groups where 1is included • Then X2 enters the model after X1 • As an example:
X1 X2 Composite Absolute PenaltiesHierarchical Structures • Represent Hierarchy by a directed graph: • Then construct penalty by: • For graph above, 0=1, r=:
Composite Absolute PenaltiesComputation • CAP with general L norms • Approximate algorithms available for tracing regularization path • Two examples: • Rosset (2004) • Boosted Lasso (Zhao and Yu, 2004): BLASSO • CAP with L1–L norms • Exact algorithms fortracing regularization path • Some applications: • Grouped Selection: iCAP • Hierarchical Selection: hiCAP for ANOVA and wavelets
iCAP:Degrees of Freedom (DFs) for tuning par. selection Two ways for selecting the tuning parameter in iCAP: 1. Cross-validation 2. Model selection criterion AIC_c where DF used is a generalization of Zou et al (2004)’s df for Lasso to iCAP.
Simulation StudiesSummary of Results • Good prediction accuracy: • Extra structure results in lower model errors • Sparsity/Parsimony: • Less sparse models in l0 sense • Sparser in terms of degrees of freedom • Estimated degrees of freedom (Group, iCAP only) • Good choices for regularization parameter • AICc: model errors close to CV
Grouping examplesCase 1 Settings • Goals: • Comparison of different group norms • Comparison of CV against AICC • Y = X + • Settings: Coefficient Profile
Grouping exampleCase 1: LASSO vs. iCAP sample paths LASSO path Number of steps Normalized coefficients iCAP path Number of steps
0.5 K clusters 1.0 K clusters 1.5 K clusters k= k= k=4 k=2 k=4 k=2 k= k=2 k=4 LASSO Grouping exampleCase 1: Comparison of norms and clusterings 10 fold CV Model error
LASSO iCAP iCAP iCAP 0.5 K clusters 1.0 K clusters 1.5 K clusters 10 fold CV Model error CV AICC CV AICC CV AICC CV AICC Grouping exampleCase 1: Comparison of selection
Grouping examplesCase 2 Settings • Goals: • Comparison of performance when number of predictors (p) grows • Comparison of performance when number of groups (K) grows • Y = X + • Settings: • Coefficients are randomly selected: • Grouped Laplacian: • K coefficients independently from double exponential with parameter G • Coefficients are repeated within groups • Individual Laplacian: • p coefficients independently from double exponential with parameter I
Grouping examplesCase 2: Comparison of model errors Model Error AICC selection
Grouping examplesCase 2: Comparison of “group sparsity” Number of selected groups AICC selection
Grouping examplesCase 2: Comparison of degrees of freedom Degrees of freedom AICC selection
0.5K 0.5K 0.5K 0.5K 1.0K 1.0K 1.0K 1.0K 0.5 K 0.5 K 0.5 K 0.5 K 1.0 K 1.0 K 1.0 K 1.0 K 1.5 K 1.5 K 1.5 K 1.5 K 1.5K 1.5K 1.5K 1.5K Grouping 2: 250 predictors, 25 groups, Ind. Exp.Paired T-test Statistics
ANOVA Hierarchical SelectionSimulation Setup • 55 variables (10 main effects, 45 interactions) • 121 observations • 200 replications in results that follow
ANOVA Hierarchical SelectionNumber of Terms in Complete Graph
Wavelet Tree Hierarchical SelectionSimulation Setup • 15 basis functions • 80 obsevations, 5 in each of the 16 time slots • 5 fold cross validation (“balanced”)
Simulation StudiesSummary of Results • Good prediction accuracy: • Extra structure results in lower model errors • Sparsity/Parsimony: • Less sparse models in l0 sense • Sparser in terms of degrees of freedom • Estimated degrees of freedom (Group, iCAP only) • Good choices for regularization parameter • AICc: model errors close to CV
CAP: Group and Hierarchical Sparsity • CAP penalties: • Are built from L “blocks” • Allow incorporation of different structures to fitted model: • Group of variables • Hierarchy among predictors • Algorithms: • Approximation using BLASSO for general CAP penalties • Exact and efficient for particular cases (L2 loss, L1 and L norms) • Choice of regularization parameter : • Cross-validation • AICc for particular cases (L2 loss, L1 and L norms)
The Road Ahead • Extension of algorithms: • GLMs: Park and Hastie (2006)’s algorithm for iCAP • Boosted version of algorithms: • Steps in “groups” or “hierarchical” directions • Model Selection Consistency for CAP (0=1): • Can CAP select groups consistently? • Is irrepresentable condition on “group” level sufficient/necessary? • More general CAP penalties: • The “Rigid Net”: • Two groups containing all variables, 0=1, 1=1, 2= • One example: 0=, k=1: • Sparsity within groups? • Similarity across groups?
Codes: www.stat.berkeley.edu/~yugroup Paper: www.stat.berkeley.edu/~binyu Funding acknowledgements: NSF ARO Guggenheim Foundation