440 likes | 724 Views
Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications. Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry Wasserman, Robert Tibshirani Machine Learning Department Carnegie Mellon University. Modern Data Analysis.
E N D
Thesis ProposalLearning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry Wasserman, Robert Tibshirani Machine Learning Department Carnegie Mellon University
Modern Data Analysis Web-text data: Characteristic: Both high-dimensional & massive amount Structures of word features (e.g., synonym) Challenges : High-dimensions Complex & Dynamic Structures Gene expression data for tumor classification: Characteristic: High-dimensional; Very few samples; complex structure Climate Data Characteristic: Dynamic complex structure
Solutions: Sparse Learning [Tibshirani96] Smooth Convex Loss L1-regularization [Jenattonet al., 09, Penget al., 09 Tibshirani et al., 05 Friedman et al., 10 Kim et al., 10] Structured Penalty (e.g., group, hierarchical tree, graph) Additive Model [Ravikumar et al., 09] Sparse regression for feature selection & prediction Incorporating Structural Prior Knowledge Nonparametric Sparse Regression: flexible model
Sparse Learning in Graphical Models Pairwise model for image Gene Graph Graphical Lasso (gLasso) ( Yuan et al. 06, Friedman et al. 07, Banerjee et al. 08) Iterated Lasso (Meishausen and Buhlmann, 06) Forest Density Estimator (Liu et al. 10) Undirected Graphical Model (Markov Random Fields) Learn Sparse Structure of Graphical Models
Thesis Overview High-dimensional Sparse Learning with Structures Nonparametric Sparse Regression Learning Sparse Structures for Undirected Graphical Models Sparse Single/Multi-task Regression with General Structured- Penalty Existing: Additive Models Challenge: (1) Generalized Models, (2) Structures Existing: Static or Time-varying Graph Challenge: Dynamic Structures Challenge: Computation Completed Work: Conditional Gaussian Graphical Model Kernel Smoothing Method for Spatial-Temporal Graphs [AAAI 10] Partition-Based Method [NIPS 10] Completed Work: Unified Optimization Framework: Smoothing Proximal Gradient [UAI 11, AOAS] Completed Work Generalized Forward Regression [NIPS 09] Penalized Tree Regression [NIPS 10] Future Work: (1) Online Learning for Massive Data (2) Incorporate Structured-Penalty in Other Models (e.g. PCA, CCA) Future Work: Relax Conditional Gaussian Assumption: Continuous & Discrete Future Work: Incorporating Rich Structures Application areas: tumor classification using gene expression data [UAI 11, AOAS], climate data analysis [AAAI 10, NIPS 10], web-text mining [ICDM 10, SDM 10]
Roadmap Smoothing Proximal Gradient for Structured Sparse Regression Structure Learning in Graphical Models Nonparametric Sparse Regression Summary and Timeline Q & A
Useful Structures and Structured Penalty Application: pathway selection for gene-expression data in tumor classification [Yuan 06] [Peng et al 09, Kim et al 10] Example: WordNet [Bach et al., 09] Group Structure (group-wise selection)
Useful Structure and Structured Penalty Piece-wise constant Graph smoothness [Kim et al., 10] • Graph Structure (to enforce smoothness) [Tibshirani 05]
Challenge Single-task Regression Nonsmooth Nonseparable Multi-task Regression Unified, Efficient and Scalable Optimization Framework for Solvingall these Structured Penalties
Existing Optimization Proximal Operator: [Nesterov 07, Beck and Teboulle, 09]
Overview: Smoothing Proximal Gradient (SPG) [Nesterov 05] • First-order Method (only gradient info): fast and scalable • No exact solution for proximal operator • Idea: • 1) Reformulate the structured penalty (via the dual norm) • 2) Introduce its smooth approximation • 3) Plug the smooth approximation back into the original problem and solve it by accelerated proximal gradient methods • Convergence Results:
Why the Approximation is Smooth? Uppermost Line Nonsmooth Uppermost Line Smooth Geometric Interpretation:
Smoothing Proximal Gradient (SPG) Original Problem: Convex Smooth Loss Non-smooth Penalty with complex structure Approximated Problem: Non-smooth with good separability Smooth function Gradient of the Approximation (Danskin’sTheorem) Proximal Operator: Soft-thresholding [Nesterov 07, Beck and Teboulle, 09]
Simulation Study ACGTTTTACTGTACAATTTAC SNP Gene-expression data Multi-task Graph-guided Fused Lasso
Biological Application SPG for Overlapping Group Lasso Regularization path (20 parameters): 331 seconds Important pathways: proteasome,nicotinate (ENPP1) Training:Test=2:1 Breast Cancer Tumor Classification Gene expression data for 8,141 genes in 295 breast cancer tumors. (78 metastatic and 217 non-metastatic, logistic regression loss) Canonical pathways from MSigDB containing 637 groups of genes
Proposed Research Complex Structured Penalty: Smoothing Technique Simple Penalty with good separability: closed-form solution in proximal operator E.g. Low Rank + Sparse • More applications for SPG • Web-scale learning: massive amounts of data • Inputs arrive sequentially at a high-rate • Need to provide real-time service Solution: Stochastic Optimization for Online Learning
Proposed Research Deterministic: Stochastic: Existing Methods : RDA [Lin 10] , Accelerated Stochastic Gradient Descent [Lan et al. 10] Ruin the sparsity-pattern Goal: sparsity-persevering stochastic optimization for large-scale online learning • Stochastic Optimization • Structured Sparsity: Beyond Regression • Canonical Correlation Analysis and its Application in Genome-wide Association Study
Roadmap Smoothing Proximal Gradient for Structured Sparse Regression Structure Learning in Graphical Models Nonparametric Sparse Regression Summary and Timeline Q & A
Gaussian Graphical Model [Lauritzen 96] [Yuan et al., 06, Friedman et al., 07 Banerjee et al., 08] gLasso Gaussian Graphical Model Graphical Lasso (gLasso) Challenge: Dynamic Graph Structure
Idea: Graph-Valued Regression Multivariate Regression Undirected Graphical Model Input data: Graph-Valued Regression: Application: [Zhou et al., 08 Song et al., 09]
Applications for higher dimensional X Y: Gene expression levels X: Patient Symptoms Characterization
Kernel Smoothing Estimator Conditional Gaussian Assumption Kernel Smoothing Estimator Cons: (1) Unstable when the dimension of x is high (2) Computationally heavy and difficult to analyze (3) Hard to Visualize
Partition Based Estimator [Breiman 84, Tibshirani et al.,09] Graphical model: difficult to search for the split point Partition Based Estimator: Graph-Optimized CART(Go-CART) CART (Classification and Regression Tree)
Dyadic Partitioning Tree [Scott and Nowak,04] Dyadic Partitioning Tree (DPT) Assumptions and Notations:
Graph-Optimized CART (Go-CART) • Go-CART: penalized risk minimization estimator • Go-CART: held-out risk minimization estimator • Split the data: • Practical algorithm: greedy learning using held-out data
Statistical Property We do not assume that underlying partition is dyadic Oracle Risk Oracle Inequality: bound the oracle excessive risk Add the assumption that underlying partition is dyadic: Tree Partitioning Consistency(might obtain finer partition)
Real Climate Data Analysis [Lozano et al.,09, IBM] CO2 UV CH4 DIR CO ETRN H2 ETR WET GLO CLD TMX VAP TMP TMN PRE FRS DTR Data Description 125 locations of U.S. 1990 ~ 2002 (13 years) Monthly observation (18 variables/factors)
Real Climate Data Analysis glasso Observations: (1): For graphical lasso, no edge connects greenhouse gases (CO2, CH4, CO, H2) with solar radiation factors (GLO, DIR) which contradicts IPCC report; Co-CART, there is. (2): Graphs along the coasts are more sparse than the ones in the mainland.
Proposed Research [Chow and Liu, 68, Tan et al., 09, Liu et al., 11] • Limitations of Go-CART (1) Conditional Gaussian Assumption: (2) Only for continuous Y. For discrete Y : approximation likelihood • Forest Graphical Model • Density only involves univariate and bivariate marginals • Compute mutual information for each pair of variables • Greedily learn the tree structure via Chow-Liu algorithm • Handle both continuous and discrete data • Forest-Valued Regression
Roadmap Smoothing Proximal Gradient for Structured Sparse Regression Structure Learning in Graphical Models Nonparametric Sparse Regression Summary and Timeline Q & A
Nonparametric Regression [Hastie et al., 90] [Ravikumaret al., 09] Bottleneck: Computation Parametric Models Additive Models Sparse Additive Models Generalized Nonparametric Models: model interaction between variables
My Work and Proposed Research [Tropp et al., 06] • Greedy Learning Method • Additive Forward Regression (AFR) • Generalization of Orthogonal Matching Pursuit to Non-parametric setting • Generalized Forward Regression (GFR) • Penalized Regression Tree Method • Proposed Research: • Formulate the functional forms for structured penalties • Develop efficient algorithms for solving the corresponding nonparametric structured sparse regression
Roadmap Smoothing Proximal Gradient for Structured Sparse Regression Structure Learning in Graphical Models Nonparametric Sparse Regression Summary and Timeline Q & A
Acknowledgements Feedback: Xi Chen (xichen@cs.cmu.edu) My Committee Members Jaime Carbonell (advisor), Tom Mitchell, Larry Wasserman, Robert Tibshirani Acknowledgements: Eric P. Xing, John Lafferty, Seyoung Kim, Manuel Blum, Aarti Singh, Jeff Schneider, Javier Pena, Han Liu, Qihang Lin, Junming Yin, Xiong Liang, Tzu-Kuo Huang, Min Xu, MladenKolar, Yan Liu, Jingrui He, Yanjun Qi, Bing Bai IBM Fellowship