700 likes | 888 Views
Seeking Interpretable Models for High Dimensional Data. Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu. Characteristics of Modern Data Problems. Goal: efficient use of data for: Prediction
E N D
Seeking Interpretable Models forHigh Dimensional Data Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu
Characteristics of Modern Data Problems • Goal: efficient use of data for: • Prediction • Interpretation (proxy: sparsity) • Larger number of variables: • Number of variables (p) in data sets is large • Sample sizes (n) have not increased at same pace • Scientific opportunities: • New findings in different scientific fields
Today’s Talk • Understanding early visual cortex area V1 through fMRI • Occam’s Razor • Lasso: linear regression and Gaussian graphical models • Learning about V1 through sparse models (pre-selection via corr, linear, nonlinear, graphical) • Future work
Understanding visual pathway Gallant Lab at UCB is a leading vision lab. Da Vinci (1452-1519) Mapping of different visual cortex areas PPoPolyak (1957) Small left middle grey area: V1
Understanding visual pathway through fMRI One goal at Gallant Lab: understand how natural images relate to fMRI signals
Gallant Lab in Nature News This article is part of Nature's premium content. Published online 5 March 2008 | Nature | doi:10.1038/news.2008.650 Mind-reading with a brain scan Brain activity can be decoded using magnetic resonance imaging. Kerri Smith Scientists have developed a way of ‘decoding’ someone’s brain activity to determine what they are looking at. “The problem is analogous to the classic ‘pick a card, any card’ magic trick,” says Jack Gallant, a neuroscientist at the University of California in Berkeley, who led the study.
Stimuli Natural image stimuli
Stimulus to fMRI response Natural image stimuli drawn randomly from a database of 11,499 images Experiment designed so that response from different presentations are nearly independent fMRI response is pre-processed and roughly Gaussian
“Neural” (fMRI) encoding for visual cortex V1 Predictor : p=10,921 features of an image Response: (preprocessed) fMRI signal at a voxel n=1750 samples Goal: understanding human visual system interpretable (sparse) model desired good prediction is necessary Minimization of an empirical loss (e.g. L2) leads to • ill-posed computational problem, and • bad prediction
Linear Encoding Model by Gallant Lab Data • X: p=10921 dimensions (features) • Y: fMRI signal • n = 1750 training samples Separate linear model for each voxel via e-L2boosting (or Lasso) Fitted model tested on 120 validation samples • Performance measured by correlation
Modeling “history” at Gallant Lab • Prediction on validation set is the benchmark • Methods tried: neural nets, SVMs, e-L2boosting (Lasso) • Among models with similar predictions, simpler (sparser) models by e-L2boosting are preferred for interpretation This practice reflects a general trend in statistical machine learning -- moving from prediction to simpler/sparser models for interpretation, faster computation or data transmission.
Occam’s Razor 14th-century English logician and Franciscanfriar, William of Ockham Principle of Parsimony: Entities must not be multiplied beyond necessity. Wikipedia
Occam’s Razor via Model Selection in Linear Regression • Maximum likelihood (ML) is LS when Gaussian assumption • There are submodels • ML goes for the largest submodel with all predictors • Largest model often gives bad prediction for p large
Model Selection Criteria Akaike (73,74) and Mallows’ Cp used estimated prediction error to choose a model: Schwartz (1980): Both are penalized LS by . Rissanen’s Minimum Description Length (MDL) principle gives rise to many different different criteria. The two-part code leads to BIC.
Model Selection for image-fMRI problem For the linear encoding model, the number of submodels Combinatorial search: too expensive and often not necessary A recent alternative: continuous embedding into a convex optimization problem through L1 penalized LS (Lasso) -- a third generation computational method in statistics or machine learning.
Lasso: L1-norm as a penalty • The L1 penalty is defined for coefficients • Used initially with L2 loss: • Signal processing: Basis Pursuit (Chen & Donoho,1994) • Statistics: Non-Negative Garrote (Breiman, 1995) • Statistics: LASSO (Tibshirani, 1996) • Properties of Lasso • Sparsity (variable selection) and regularization • Convexity (convex relaxation of L0-penalty)
Lasso: computation and evaluation The “right” tuning parameter unknown so “path” is needed (discretized or continuous) Initially: quadratic program for each a grid on . QP is called for each . Later: path following algorithms such as homotopy by Osborne et al (2000) LARS by Efron et al (2004) Theoretical studies: much work recently on Lasso in terms of L2 prediction error L2 error of parameter model selection consistency
Model Selection Consistency of Lasso Set-up: Linear regression model n observations and p predictors Assume (A): Knight and Fu (2000) showed L2 estimation consistency under (A).
Model Selection Consistency of Lasso • p small, n large (Zhao and Y, 2006), assume (A) and Then roughly* Irrepresentable condition (1 by (p-s) matrix) * Some ambiguity when equality holds. • Related work: Tropp(06), Meinshausen and Buhlmann (06), Zou (06), Wainwright (06) Population version model selection consistency
Irrepresentable condition (s=2, p=3): geomery • r=0.4 • r=0.6
Consistency of Lasso for Model Selection • Interpretation of Condition – Regressing the irrelevant predictors on the relevant predictors. If the L1 norm of regression coefficients (*) • Larger than 1, Lasso can not distinguish the irrelevant predictor from the relevant predictors for some parameter values. • Smaller than 1, Lasso can distinguish the irrelevant predictor from the relevant predictors. • Sufficient Conditions (Verifiable) • Constant correlation • Power decay correlation • Bounded correlation*
Model Selection Consistency of Lasso (p>>n) • Consistency holds also for s and p growing with n, assuming bounds on max and min eigenvalues of design matrix smallest nonzero coefficient bounded away from zero irrepresentable condition Gaussian noise (Wainwright, 06): Finite 2k-th moment noise (Zhao&Y,06):
Gaussian Graphical Model • Gaussian inverse covariance characterizes conditional independencies (and hence graphical model) Graph Inverse Covariance
L1 penalized log Gaussian Likelihood Given n iid observations of X with Yuan and Lin (06): Banerjee et al (08), Friedman et al (07): fast algorithms based on block descent. Ravikumar, Wainwright, Raskutti, Yu (2008): sufficient conditions for model selection consistency (p>>n) for sub-Gaussian and also heavy tail distributions
Success prob’s dependence on n and p (Gaussian) • Edge covariances asEach point is an average over 100 trials. • Curves stack up in second plot, so that (n/log p) controls model selection.
Success prob’s dependence on “model complexity” K and n Chain graph with p = 120 nodes. • Curves from left to right have increasing values of K. • Models with larger K thus require more samples n for same probability of success.
Back to image-fMRI problem: Linear sparse encoding model on complex “cells” Gallant Lab’s approach: Separate linear model for each voxel Y = Xb + e Model fitting via e-L2boosting and stopping by CV • X: p=10921 dimensions (features or complex “cells”) • n = 1750 training samples Fitted model tested on 120 validation samples (not used in fitting) Performance measured by correlation (cc)
Our story on image-fMRI problem Episode 1: linear pre-selection by correlation localization Fitting details: Regularization parameter selected by 5-fold cross validation L2 boosting applied to all 10,000+ features -- L2 boosting is the method of choice in Gallant Lab Other methods applied to 500 features pre-selected by correlation
Other methods k = 1: semi OLS (theoretically better than OLS) k = 0: ridge k = -1: semi OLS (inverse) k =1,-1: semi-supervised because of the use of population cov. , which is estimated from the whole image data base.
Comparison of the feature locations Semi methods L2boost
Our Story (cont) Episode 2: nonlinear via iV-SpAM (machine learning stage) Additive Models (Hastie and Tibshirani, 1990): Sparse: High dimensional: p >>> n SpAM (Sparse Additve Models) By Ravikumar, Lafferty, Liu, Wasserman (2007) Related work: COSSO, Lin and Zhang (2006) for most j 2014年6月6日星期五 page 34
SpAM V1 encoding model (1331 voxels from V1)(Ravikumar, Vu, Gallant Lab, BY, NIPS08) • For each voxel, • Start with 10921+ complex cells (features) • Pre-selection of 100 complex cells via correlation • Fitting of SpAM to 100 complex cells with AIC stopping • Pooling of SpAM fitted complex cells according to location • and frequency to form pooled complex cells • Fitting SpAM to 100 complex cells and pooled complex cells with AIC stopping 2014年6月6日星期五 page 35
Prediction performance (R2) inset region median improvement 12%. 2014年6月6日星期五 page 36
Nonlinearities Compressive effect (finite energy supply) Common nonlinearity for each voxel? 2014年6月6日星期五 page 37
Identical Nonlinearity 2014年6月6日星期五 page 38 Identical V-SpAM model (iV-SpAM) (Ravikumar et al, 08): where is small or sparse.
Identical-nonlinearity vs linearity:R^2 prediction Median improvement 16% 2014年6月6日星期五 page 39
Episode 3: nonlinearity via power transformations (classical stage) Power transformation of features Plot a predictor vs. the response: The feature is skewed The response is Gaussian. After transformation (4’th root): Get better correlation! Response Feature Response Feature 6 June 2014 page 40
Make it Gaussian Histograms of Gabor wavelets, by spatial frequency. Transformations None Square root 4’th root Log Frequency 1 2 4 8 16 32 2014年6月6日星期五 page 41
Gaussianization of features improves prediction Gaussianized features used to train (using L2 epsilon-boosting) FT = F1/2 if F in levels 2,3 (4 x 4 and 8 x 8) FT = F1/4 if F in levels 4,5 (16 x 16 and 32 x 32) 2014年6月6日星期五 page 42
Episode 4: localized predictors Max correlation value at each location (of 8 orientations) Purple strong (> 0.4), red weak. Specific strong location across resolutions, for each voxel. Correlation of predictors with voxels – Multi resolution Voxel 1 1 Voxel 2 6 June 2014 page 43
Localized prediction Each voxel is related to an area on the image by Locatization Algorithm For each level 2 to 5 For each location: Find the orientation giving max correlation Generates 4 maps: 4 × 4 to 32 × 32 For each map: Smooth using a Gaussian kernel, and find peak correlation Levels 2,3: use only features from peak of correlation (8 features each) Levels 4,5: include 4,12 neighboring locations (40,104 features) On these features, train a linear prediction functionusing L2 epsilon-boosting. 10 Times faster than global precision. Similar prediction on almost all voxels. 2014年6月6日星期五 page 44
After localization: view responses of a single voxel Easy to distinguish stimuli that induce strong vs. weak responses in the voxel when we look at local area. Location in full image Strong Weak 2014年6月6日星期五 page 45
Localization does not work for all voxels Generally prediction does not deteriorate. However, there are a few voxels for which local prediction is not as good. Red: Global-local > 0.05 Blue: Local-global > 0.05 June 6, 2014 page 46
Who is responsible? (on-going) Large regions of features correlated with outcome The image features have correlations: Between close locations Between levels of pyramid Between neighboring gradients Which of the features are “responsible”, and which are tagging along? Can we distinguish between: Correlation between features, and Causation of voxel response? 2014年6月6日星期五 page 47
Some unlocalized voxels Two neighboring voxels Highly correlated Each is influenced by both areas. One influences the other Area A voxel1 voxel2 6 June 2014 page 48
Episode 5: sparse graphical models (on-going) Estimated apart, each voxel depends on both areas Together, each depends on only one area Estimated inverse covariance entries shown below: Voxel 1 Voxel 2 Area A Area B Area A Area B Estimated separately Estimated together 2014年6月6日星期五 page 49
Episode 6: building neighbor model (on-going) Voxels with poor localization are probably influenced by neighbor voxels For each voxel 0, take 6 neighbor voxels (v=1,…,6) L(v) are local (tranformed) predictors for voxel v Build a sparse linear model for voxel v based on L(v) Add the 6 predicted values from the neighbor models to the local (transformed) predictors for voxel 0. We call this model neighbor model. 2014年6月6日星期五 page 50