Smoothing Proximal Gradient Method for General Structured Sparse Learning

Smoothing Proximal Gradient Methodfor General Structured Sparse Learning Eric Xing epxing@cs.cmu.edu School of Computer Science Carnegie Mellon University Joint work with: Xi Chen, Qihang Lin, Seyoung Kim, Jaime Carbonell

Modern High-Dimensional Problems Genomics: 3 Billion Bases, millions of “mutations”, in a linear ordering 20K genes, related by a network Computer Vision: 3.6 billion photos, with million-dimensional features labeled with ~104 classes, organized by a gigantic taxonomy

Genome Wide Association (GWA) GWA mapping Single Nucleotide Polymorphism genotyping became affordable due to high throughput sequencing technologies This problem is extremely difficult Number of samples << Number of SNPs Complex genetic architecture, population structures, other confounders What would you do if you need to perform GWA for multiple traits, e.g., ALL gene expressions, ALL clinical phenotypes, on MANY SNPs? And you know their STRUCTURES? It is desirable to incorporate rich structures of multiple traits and SNPs via Sparse Structured I/O Modelsto improve GWA mapping! 7/27/2011 3

Human-level Image classification The knowledge ontology • large scale in 3 dimensions • Data: 12 million images • Features: ~1 million (number comes from the top performing system in ILSVRC10, [Lin et al. 2011]) • Classes: 17k classes Courtesy L. Fei-Fei

Toward Large Scale Problems: • Large data size: • Stochastic/online methods • Parallel computation, e.g., Map-Reduce • Large feature dimension: • Sparsity-inducing regularization • Structured sparsity • Sparse coding • Large concept space: • Multi-task and transfer learning • Structured sparsity

Outline Structured Sparse Learning Problems Smoothing Proximal-Gradient Descent Method Extension: Multi-task Structured Sparse Learning Experimental Results: Simulation Study Experimental Results: Real Genetic Data Analysis

Sparse Learning • Linear Model: • Sparse Linear Regression (Lasso) [R.Tibshirani 96] Individual Feature Level Sparsity Regression Loss Group Structure Graph Structure

Structured Prediction • Binary classification: black-and-white decisions • Multi-class classification: the world of technicolor • can be reduced to several binary decisions, but... • often better to handle multiple classes directly • how many classes? 2? 5? exponentially many? • Structured prediction: many classes, strongly interdependent • Example: sequence labeling (number of classes exponential in the sequence length 8

LD Dogs Birds Multivariate Regression for Multi-task Classification Input features Feature strength Shepherd Penguin Duck Husky bulldog Feature strength between featurejand class i:βj,i x (0 0 1 0 0) = ? |βj,i| + How to combine information across multiple classes to increase the power?

LD Dog Birds Multivariate Regression for Multi-task Classification Input features Feature strength Shepherd Penguin Duck Husky bulldog Feature strength between featurejand class i:βj,i x (0 0 1 0 0) = |βj,i| + We introduce Graph- or tree-guided penalty +

Graph-Guided Fusion Penalty • Fusion Penalty: | βjk - βjm| • For two correlated concepts (connected in the network), the association strengths may have similar values. • Fusion effect propagates to the entire network • Association between features and subnetworks of concepts Feature j Strength between feature jand concept m:βjm Strength between feature jand concept k:βjk … concept m concept k Kim and Xing, PLOS G 2009

Tree-Guided Group Lasso • For a general tree h2 Select the child nodes jointly or separately? h1 Tree-guided group lasso Joint selection Separate selection 12

Sparse Coding (unsupervised) • Let X be a signal, e.g., speech, image, etc. • Let b be a set of normalized “basis vectors” • We call it dictionary • b is “adapted” to x if it can represent it with a few basis vectors • There exists a sparse vector q such that x ≈ b q • We call q the sparse code Sailboat response = X … q X … Bear response Water response

Hierarchical Image Coding Unsupervised or supervised feature learning Sailboat response Structured Object Dictionary Structured Object Dictionary Pooling … Bear response … … … Water response L.-J. Li, J. Zhu, H. Su, E.P. Xing, & L. Fei-Fei. Under preparation

Challenge ? • How to solve the optimization problem for overlapping group lasso & graph-guided fused lasso Overlapping Group Lasso: Graph-guided Fused Lasso

Optimization • Existing Methods:

Smoothing Proximal Gradient (SPG) Descent • Fast and Scalable Algorithm: Gradient Method • Non-separabilityand non smoothness of the structured sparsity-inducing penalty • Idea: • Reformulate the structured sparsity-inducing penalty (via the dual norm) • Introduce its smooth approximation • Plug the smooth approximation back into the problem and solve it by accelerate gradient method (FISTA: fast iterative shrinkage-thresholding algorithm) [Y.Nesterov 05] [Beck and Teboulle, 09]

Reformulation of Fusion Penalty • Graph Structured Sparsity edge-vertex incident matrix Dual Norm:

Reformulation of Group Penalty • Group Structured Sparsity Dual Norm: Row Index: Column Index:

Approximation to the Penalty Smoothing Parameter: Max Gap: Smooth Lower Bound Graph: Group:

Geometric Interpretation Smooth approximation Uppermost Line Nonsmooth Uppermost Line Smooth

Proximal Gradient Descent Original Problem: Smooth Non-smooth with complicated structure Non-smooth with good separability Approximation Problem: Smooth function Gradient of h:

Accelerated Gradient Descent [Beck and Teboulle, 09] Smooth Non-smooth with good separability (FISTA) Closed-form Solution

Convergence Rate If we require and set , the number of iterations is upper bounded by: Proof Idea: Subgradient Method:

Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Group: Graph: Proximal-Gradient: Independent of Sample Size

Multi-Task Extension

Multi-Task Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Group: Graph: Proximal-Gradient: Independent of Sample Size Linear in

Experiment • Multi-task Overlapping Group Lasso (Tree-structured) Binary Tree Ground Truth Lasso L1/L2 Multi-task Lasso Group Structure

Experiment • Multi-task Overlapping Group Lasso (Tree-structured) SOCP: Out of memory for storing Newton Linear System Cannot scale up

Experiment • Multi-task Graph-guided Fused Lasso Input: SNPs in Hapmap CEU panel Graph Fused L1/L2 Ground Truth lasso

Experiment • Multi-task Graph-guided Fused Lasso SOCP/ QP: Out of memory for storing Newton Linear System Cannot scale up

The ImageNet Problem • ILSVRC10: 1.2 million images / 1000 categories • 1000 visual words in dictionary • Locality-constrained linear coding • Max pooling on spatial pyramid • Each image represented as a vector in 21000 dimensional space Zhao, Fei-Fei and Xing, in preparation

Classification Results • Flat error & hierarchical error

Effects of Augmented Loss Function • APPLET vs. LR • Classification results of APPLET significantly more informative

Summary • Smoothing Proximal Gradient (SPG) Descent • Reformulate the structured sparsity-inducing penalty (via the dual norm) • Introduce its smooth approximation • Plug the smooth approximation back into the problem and solve it by accelerate gradient method (FISTA: fast iterative shrinkage-thresholding algorithm)

Thank You！ Q& A

Accelerated Gradient Descent (FISTA) • Generalized Gradient Descent Step (Projection Step) • Closed–form Solution (soft-thresholding operation) Euclidean Distance Exact Sparse (Zero) Solution

Biological Applications • Genome-Wide Association Studies (GWAS) 1,260 genotypes(inputs), expression levels(output) of 3,684 genes, 114 yeast strains. Multi-task Overlapping Group Lasso: Group defined among genes by hierarchical clustering tree. Training:Test=2:1 (5-folds) 368 Iterations, 1366 seconds Previous Method can only handle no more than 100 genotypes [S. Kim 10]

Multi-Task Time Complexity • Pre-compute: • Per-iteration Complexity (computing gradient) Tree: Graph: Proximal-Gradient: Independent of Sample Size Linear in #.of concepts Parallelizable 42

Proximal Gradient Descent Original Problem: Approximation Problem: Gradient of the Approximation:

Smoothing Proximal Gradient Method for General Structured Sparse Learning

Smoothing Proximal Gradient Method for General Structured Sparse Learning

Presentation Transcript

Structured Sparse Principal Component Analysis

2.7.6 Conjugate Gradient Method for a Sparse System

Autotuning Sparse Matrix and Structured Grid Kernels

Conjugate Gradient Iterative Method for Deblurring Images

Sparse Factor Analysis for Learning Analytics

Structured learning

Conjugate Gradient Method for Indefinite Matrices

A robust preconditioner for the conjugate gradient method

CONJUGATE GRADIENT METHOD

Learning to Learn: A General Method for Lifelong Learning

A Proximal Gradient Algorithm for Tracking Cascades over Networks

Perceptual Categories: Old and gradient, young and sparse.

A reversible and statistical method for discrete surfaces smoothing

Requirements to a Line Smoothing Method

Structured Workplace Learning

proximal

Structured learning

Cojugate Gradient Method

Tree-Structured Method for LUT Inverse Halftoning

Proximal Methods for Sparse Hierarchical Dictionary Learning

Method of Hair Smoothing Treatment

Structured Analysis Method