560 likes | 580 Views
Prediction using Side Information. Bin Yu Department of Statistics, UC Berkeley Joint work with Peng Zhao, Guilherme Rocha, and Vince Vu. Outline. Motivation Background Penalization methods (building in side information through penalty) L 1 -penalty (sparsity as the side information)
E N D
Prediction using Side Information Bin Yu Department of Statistics, UC Berkeley Joint work with Peng Zhao, Guilherme Rocha, and Vince Vu
Outline • Motivation • Background • Penalization methods (building in side information through penalty) • L1-penalty (sparsity as the side information) • Group and hierarchy as side information: Composite Absolute Penalty (CAP) • Building Blocks – L-norm Regularization • Definition • Interpretation • Algorithms • Examples and Results • Unlabeled data as side information: semi-supervised learning Motivating example: image-fMRI problem in neuroscience Penalty based on population covariance matrix Theoretical result to compare with OLS Experimental results on image-fMRI data
Characteristics of Modern Data Set Problems • Goal: efficient use of data for: • Prediction • Interpretation • Larger number of variables: • Number of variables (p) in data sets is large • Sample sizes (n) have not increased at same pace • Scientific opportunities: • New findings in different scientific fields
Regression and classification Data Example: image-fMRI problem Predictor : 11,000 features of an image Response: (preprocessed) fMRI signal at a voxe n=1750 samplesl Minimization of an empirical loss (e.g. L2) leads to • ill-posed computational problem, and • bad prediction
Regularization improves prediction • Penalization -- linked to computation L2 (numerical stability: ridge; SVM) Model selection (sparisty, combinatorial search) L1 (sparsity, convex optimization) • Early stopping: tuning parameter is computational Neural nets Boosting • Hierarchical modeling (computational considerations)
Lasso: L1-norm as a penalty • The L1 penalty is defined for coefficients • Used initially with L2 loss: • Signal processing: Basis Pursuit (Chen & Donoho,1994) • Statistics: Non-Negative Garrote (Breiman, 1995) • Statistics: LASSO (Tibshirani, 1996) • Properties of Lasso • Sparsity (variable selection) • Convexity (convex relaxation of L0-penalty)
Lasso: L1-norm as a penalty Computation: the “right” tuning parameter unknown so “path” is needed (discretized or continuous) • Initially: quadratic program for each a grid on . QP is called for each . • Later: path following algorithms homotopy by Osborne et al (2000) LARS by Efron et al (2004) Theoretical studies: much work recently on Lasso …
General Penalization Methods • Given data : • Xi : a p-dimensional predictor • Yi : response variable • The parameters are defined by the penalized problem: • where • is the empirical loss function • is a penalty function • is a tuning parameter
Beyond Sparsity of Individual Predictors:Natural Structures among predictors Rationale: side information might be available and/or additional regularization is needed beyond Lasso for p>>n • Groups: • Genes belonging to the same pathway; • Categorical variables represented by “dummies”; • Polynomial terms from the same variable; • Noisy measurements of the same variable. • Hierarchy: • Multi-resolution/wavelet models; • Interactions terms in factorial analysis (ANOVA); • Order selection in Markov Chain models;
Composite Absolute Penalties (CAP)Overview • The CAP family of penalties: • Highly customizable: • ability to perform grouped selection • ability to perform hierarchical selection • Computational considerations: • Feasibility: Convexity • Efficiency: Piecewise linearity in some cases • Define groups according to structure • Combine properties of L-norm penalties • Encompass and go beyond existing works: • Elastic Net (Zou & Hastie, 2005) • GLASSO (Yuan & Lin, 2006) • Blockwise Sparse Regression (Kim, Kim & Kim, 2006)
Composite Absolute PenaltiesReview of L Regularization Given data and loss function : • L Regularization: • Penalty: • Estimate: • where >0 is a tuning parameter • For the squared error loss function: • Hoerl & Kennard (1970): Ridge (=2) • Frank & Friedman (1993): Bridge (general ) • LASSO (1996): (=1) • SCAD (Fan and Li, 1999): (<1)
Composite Absolute PenaltiesDefinition • The CAP parameter estimate is given by: • Gk's, k=1,…,K - indices of k-th pre-defined group • Gk – corresponding vector of coefficients. • || . ||k – group Lk norm: Nk = ||k||k; • || . ||0 – overall norm: T() =||N||0 • groups may overlap (hierarchical selection)
Composite Absolute PenaltiesA Bayesian interpretation • For non-overlapping groups: • Prior on group norms: • Prior on individual coefficients:
Contour plot for 0=1, 1=2, 2=2 Composite Absolute PenaltiesGroup selection • Tailoring T() for group selection: • Define non-overlapping groups • Setk>1, for all k 0: • Group norm k tunes similarity within its group • k>1 causes all variables in group i to be included/excluded together • Set0=1: • This yields grouped sparsity • k=2 has been studied by Yuan and Lin(Grouped Lasso, 2005).
Composite Absolute PenaltiesHierarchical Structures • Tailoring T() for Hierarchical Structure: • Set 0=1 • Set i>1, i • Groups overlap: • If2appears in all groups where 1is included • Then X2 enters the model after X1 • As an example:
X1 X2 Composite Absolute PenaltiesHierarchical Structures • Represent Hierarchy by a directed graph: • Then construct penalty by: • For graph above, 0=1, r=:
Composite Absolute PenaltiesComputation • CAP with general L norms • Approximate algorithms available for tracing regularization path • Two examples: • Rosset (2004) • Boosted Lasso (Zhao and Yu, 2004): BLASSO • CAP with L1–L norms • Exact algorithms fortracing regularization path • Some applications: • Grouped Selection: iCAP • Hierarchical Selection: hiCAP for ANOVA and wavelets
iCAP:Degrees of Freedom (DFs) for tuning par. selection Two ways for selecting the tuning parameter in iCAP: 1. Cross-validation 2. Model selection criterion AIC_c where DF used is a generalization of Zou et al (2004)’s df for Lasso to iCAP.
Simulation Studies (p>n) (partially adaptive grouping)Summary of Results • Good prediction accuracy • Extra structure results in non-trivial reduction of model error • Sparsity/Parsimony • Less sparse models in l0 sense • Sparser in terms of degrees of freedom • Estimated degrees of freedom (Group, iCAP only) • Good choices for regularization parameter • AICc: model errors close to CV
ANOVA Hierarchical SelectionSimulation Setup • 55 variables (10 main effects, 45 interactions) • 121 observations • 200 replications in results that follow
Summary on CAP: Group and Hierarchical Sparsity • CAP penalties: • Are built from L “blocks” • Allow incorporation of different structures to fitted model: • Group of variables • Hierarchy among predictors • Algorithms: • Approximation using BLASSO for general CAP penalties • Exact and efficient for particular cases (L2 loss, L1 and L norms) • Choice of regularization parameter : • Cross-validation • AICc for particular cases (L2 loss, L1 and L norms)
Regularization using unlabeled data: semisupervised learning Motivating example: image-fMRI problem in neuroscience (Gallant Lab at UCB) Goal: to understand how natural images relate to fMRI signals
Stimuli Natural image stimuli
Stimulus to response Natural image stimuli drawn randomly from a database of 11,499 images Experiment designed so that response from different presentations are nearly independent Response is pre-processed and roughly Gaussian
Linear model Separate linear model for each voxel Y = Xb + e Model fitting • X: p=10921 dimensions (features) • n = 1750 training samples Fitted model tested on 120 validation samples • Performance measured by correlation
Ordinary Least Squares (OLS) Minimize empirical squared error risk Notice that OLS estimate is a function of estimates of covariance of X (Σxx)and covariance X with Y (Σxy)
OLS Sample covariance matrix of X is often nearly singular and so inversion is ill-posed. Some existing solutions • Ridge regression • Pseudo-inverse (or truncated SVD) • Lasso (closely related to L2boosting -- current method at Gallant Lab)
Semi-supervised Abundant unlabeled data available • samples from the marginal distribution of X Book on “semisupervised learning” (2006) (eds. Chapelle, Scholkopf, and Zien) Stat. science article (2007) (Liang, Mukherjee and Westl) Image-fMRI: images in the database are unlabeled data Semi-supervised linear regression • Use • labeled (Xi,Yi) i=1,…, n, and • unlabeled data Xi i=n+1,…,n+m to fit
Semi-supervised Does marginal distribution of X play a role? • For fixed design X, marginal dist of X plays no role. • (Brown 1990) shows that OLS estimate of the intercept is inadmissible if X assumed random.
Refining OLS The unknown parameter satisfies So OLS can be seen as a plug-in estimate for this equation Can plug-in an improved estimate of Σxx ?
A first approach Suppose population covariance of X is known • (infinite amount of unlabeled data) Use a linear combination of the sample and population covariances. (Ledoit and Wolf 2004) considered convex combinations of sample covariance and another matrix from a parametric model
Semi-supervised OLS Plug in the improved estimate of Σxx, we get “semi-OLS”:
Semi-supervised OLS Equivalent to penalized least squares Equivalent to ridge regression in pre-whitened covariates
Spectrally semi-supervised OLS Ridge regression in (W,Y) is just a transformation of Λ, where W has spectral decomposition: More generally, can consider arbitrary transformations of the spectrum of W Resulting estimator
Spectrally semi-supervised OLS Examples: • OLS h(s) = 1/s • Semi-OLS = Ridge on pre-whitened predictors: h(s) = 1/(s+α) • Truncated SVD on pre-whitened predictors (PCA reg): h(s) = 1/s if s>c, otherwise 0
Large n,p asymptotic MSPE Assumptions • Σ non-degenerate • Z = X Σ-1/2 is n-by-p with IID entries satisfying: • mean 0, variance 1 • finite 4th moment • h is a bounded function • βT Σxx β / σ2 has finite limit SNR2as p,n tend to ∞ • p/n has finite, strictly positive limit r
Large n,p MSPE Theorem The Mean Squared Prediction Error satisfies where Fr is the Marchenko-Pastur law with index r and
Consequences Asymptotically optimalh Asymptotically better than OLS and truncated SVD Reminiscent of shrinkage factor in James-Stein estimate SNR might be easily estimated
Back to image-fMRi problem Fitting details: Regularization parameter selected by 5-fold cross validation L2 boosting applied to all 10,000+ features -- L2 boosting is the method of choice in Gallant Lab Other methods applied to 500 features pre-selected by correlation
Other methods k = 1: semi OLS (theoretically better than OLS) k = 0: ridge k = -1: semi OLS (inverse)
Features used by L2boost Features used by L2boosting
Comparison of the feature locations Semi methods L2boost
Further work Image-fMRI problem based on a linear model Compare methods for other voxels Use fewer features for semi-methods? (average # features for L2boosting = 120 # features for semi-methods = 500, by design) Interpretation of the results of different methods Theoretical results for ridge and semi inverse OLS? Image-fMRI problem: non-linear modeling understanding the image space (clusters? Manifolds?) different linear models on different clusters (manifolds)? non-linear models on different clusters (manifolds)? …
CAP Codes: www.stat.berkeley.edu/~yugroup Paper: www.stat.berkeley.edu/~binyu to appear in Annals of Statistics Thanks: Gallant Lab at UC Berkeley
Proof Ingredients Can show that MSPE decomposes as: Results in random matrix theory can be applied: • BIAS term is a quadratic form in sample covariance matrix • VARIANCE term is an integral wrt empirical spectral distribution of sample covariance matrix