460 likes | 798 Views
Smoothing Proximal Gradient Method for General Structured Sparse Learning. Eric Xing epxing@cs.cmu.edu School of Computer Science Carnegie Mellon University Joint work with: Xi Chen, Qihang Lin, Seyoung Kim, Jaime Carbonell. Moder n High-Dimensional Problems. Genomics:
E N D
Smoothing Proximal Gradient Methodfor General Structured Sparse Learning Eric Xing epxing@cs.cmu.edu School of Computer Science Carnegie Mellon University Joint work with: Xi Chen, Qihang Lin, Seyoung Kim, Jaime Carbonell
Modern High-Dimensional Problems Genomics: 3 Billion Bases, millions of “mutations”, in a linear ordering 20K genes, related by a network Computer Vision: 3.6 billion photos, with million-dimensional features labeled with ~104 classes, organized by a gigantic taxonomy
Genome Wide Association (GWA) GWA mapping Single Nucleotide Polymorphism genotyping became affordable due to high throughput sequencing technologies This problem is extremely difficult Number of samples << Number of SNPs Complex genetic architecture, population structures, other confounders What would you do if you need to perform GWA for multiple traits, e.g., ALL gene expressions, ALL clinical phenotypes, on MANY SNPs? And you know their STRUCTURES? It is desirable to incorporate rich structures of multiple traits and SNPs via Sparse Structured I/O Modelsto improve GWA mapping! 7/27/2011 3
Human-level Image classification The knowledge ontology • large scale in 3 dimensions • Data: 12 million images • Features: ~1 million (number comes from the top performing system in ILSVRC10, [Lin et al. 2011]) • Classes: 17k classes Courtesy L. Fei-Fei
Toward Large Scale Problems: • Large data size: • Stochastic/online methods • Parallel computation, e.g., Map-Reduce • Large feature dimension: • Sparsity-inducing regularization • Structured sparsity • Sparse coding • Large concept space: • Multi-task and transfer learning • Structured sparsity
Outline Structured Sparse Learning Problems Smoothing Proximal-Gradient Descent Method Extension: Multi-task Structured Sparse Learning Experimental Results: Simulation Study Experimental Results: Real Genetic Data Analysis
Sparse Learning • Linear Model: • Sparse Linear Regression (Lasso) [R.Tibshirani 96] Individual Feature Level Sparsity Regression Loss Group Structure Graph Structure
Structured Prediction • Binary classification: black-and-white decisions • Multi-class classification: the world of technicolor • can be reduced to several binary decisions, but... • often better to handle multiple classes directly • how many classes? 2? 5? exponentially many? • Structured prediction: many classes, strongly interdependent • Example: sequence labeling (number of classes exponential in the sequence length 8
LD Dogs Birds Multivariate Regression for Multi-task Classification Input features Feature strength Shepherd Penguin Duck Husky bulldog Feature strength between featurejand class i:βj,i x (0 0 1 0 0) = ? |βj,i| + How to combine information across multiple classes to increase the power?
LD Dog Birds Multivariate Regression for Multi-task Classification Input features Feature strength Shepherd Penguin Duck Husky bulldog Feature strength between featurejand class i:βj,i x (0 0 1 0 0) = |βj,i| + We introduce Graph- or tree-guided penalty +
Graph-Guided Fusion Penalty • Fusion Penalty: | βjk - βjm| • For two correlated concepts (connected in the network), the association strengths may have similar values. • Fusion effect propagates to the entire network • Association between features and subnetworks of concepts Feature j Strength between feature jand concept m:βjm Strength between feature jand concept k:βjk … concept m concept k Kim and Xing, PLOS G 2009
Tree-Guided Group Lasso • For a general tree h2 Select the child nodes jointly or separately? h1 Tree-guided group lasso Joint selection Separate selection 12
Sparse Coding (unsupervised) • Let X be a signal, e.g., speech, image, etc. • Let b be a set of normalized “basis vectors” • We call it dictionary • b is “adapted” to x if it can represent it with a few basis vectors • There exists a sparse vector q such that x ≈ b q • We call q the sparse code Sailboat response = X … q X … Bear response Water response
Hierarchical Image Coding Unsupervised or supervised feature learning Sailboat response Structured Object Dictionary Structured Object Dictionary Pooling … Bear response … … … Water response L.-J. Li, J. Zhu, H. Su, E.P. Xing, & L. Fei-Fei. Under preparation
Challenge ? • How to solve the optimization problem for overlapping group lasso & graph-guided fused lasso Overlapping Group Lasso: Graph-guided Fused Lasso
Optimization • Existing Methods:
Smoothing Proximal Gradient (SPG) Descent • Fast and Scalable Algorithm: Gradient Method • Non-separabilityand non smoothness of the structured sparsity-inducing penalty • Idea: • Reformulate the structured sparsity-inducing penalty (via the dual norm) • Introduce its smooth approximation • Plug the smooth approximation back into the problem and solve it by accelerate gradient method (FISTA: fast iterative shrinkage-thresholding algorithm) [Y.Nesterov 05] [Beck and Teboulle, 09]
Reformulation of Fusion Penalty • Graph Structured Sparsity edge-vertex incident matrix Dual Norm:
Reformulation of Group Penalty • Group Structured Sparsity Dual Norm: Row Index: Column Index:
Approximation to the Penalty Smoothing Parameter: Max Gap: Smooth Lower Bound Graph: Group:
Geometric Interpretation Smooth approximation Uppermost Line Nonsmooth Uppermost Line Smooth
Proximal Gradient Descent Original Problem: Smooth Non-smooth with complicated structure Non-smooth with good separability Approximation Problem: Smooth function Gradient of h:
Accelerated Gradient Descent [Beck and Teboulle, 09] Smooth Non-smooth with good separability (FISTA) Closed-form Solution
Convergence Rate If we require and set , the number of iterations is upper bounded by: Proof Idea: Subgradient Method:
Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Group: Graph: Proximal-Gradient: Independent of Sample Size
Multi-Task Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Group: Graph: Proximal-Gradient: Independent of Sample Size Linear in
Experiment • Multi-task Overlapping Group Lasso (Tree-structured) Binary Tree Ground Truth Lasso L1/L2 Multi-task Lasso Group Structure
Experiment • Multi-task Overlapping Group Lasso (Tree-structured) SOCP: Out of memory for storing Newton Linear System Cannot scale up
Experiment • Multi-task Graph-guided Fused Lasso Input: SNPs in Hapmap CEU panel Graph Fused L1/L2 Ground Truth lasso
Experiment • Multi-task Graph-guided Fused Lasso SOCP/ QP: Out of memory for storing Newton Linear System Cannot scale up
The ImageNet Problem • ILSVRC10: 1.2 million images / 1000 categories • 1000 visual words in dictionary • Locality-constrained linear coding • Max pooling on spatial pyramid • Each image represented as a vector in 21000 dimensional space Zhao, Fei-Fei and Xing, in preparation
Classification Results • Flat error & hierarchical error
Effects of Augmented Loss Function • APPLET vs. LR • Classification results of APPLET significantly more informative
Summary • Smoothing Proximal Gradient (SPG) Descent • Reformulate the structured sparsity-inducing penalty (via the dual norm) • Introduce its smooth approximation • Plug the smooth approximation back into the problem and solve it by accelerate gradient method (FISTA: fast iterative shrinkage-thresholding algorithm)
Thank You! Q& A
Accelerated Gradient Descent (FISTA) • Generalized Gradient Descent Step (Projection Step) • Closed–form Solution (soft-thresholding operation) Euclidean Distance Exact Sparse (Zero) Solution
Biological Applications • Genome-Wide Association Studies (GWAS) 1,260 genotypes(inputs), expression levels(output) of 3,684 genes, 114 yeast strains. Multi-task Overlapping Group Lasso: Group defined among genes by hierarchical clustering tree. Training:Test=2:1 (5-folds) 368 Iterations, 1366 seconds Previous Method can only handle no more than 100 genotypes [S. Kim 10]
Multi-Task Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Tree: Graph: Proximal-Gradient: Independent of Sample Size Linear in #.of concepts Parallelizable 42
Proximal Gradient Descent Original Problem: Approximation Problem: Gradient of the Approximation: