460 likes | 593 Views
Study of Sparse Classifier Design Algorithms. Sachin Nagargoje, 08449 Advisor : Prof. Shirish Shevade 20 th June 2013. Outline. Introduction Sparsity w.r.t . features Using regularizer /penalty Traditional regularizer /penalty Other regularizer /penalty SparseNet
E N D
Study of Sparse Classifier Design Algorithms Sachin Nagargoje, 08449 Advisor : Prof. Shirish Shevade 20th June 2013
Outline • Introduction • Sparsityw.r.t. features • Using regularizer/penalty • Traditional regularizer/penalty • Other regularizer/penalty • SparseNet • Sparsityw.r.t. support vectors / basis points • Various Techniques • SVM with L1regularizer • Greedy Methods • Proposed Methods • Experimental Results • Conclusion / Future Work
What is Sparsity? • Sparsity w.r.t. features in model • eg: #Non - zero coefficients of model • Sparsity w.r.t. Support Vectors Support Vectors, x1, …, xd Sparser w.r.t #training points But not w.r.t. #features Vapnik 1992, Vapnik, et al 1995
Need for Sparsity? • Faster prediction • Decreases complexity of model • In the case of sparsityw.r.t. features • To Remove • Redundant features • Irrelevant features • Noisy features • As number of features increases • Data becomes sparse in High Dimension • Difficult to achieve low generalization error
Traditional ways to achieve Sparsity • Filter • Select features before ML Algorithm is run • E.g. Rank features and eliminate • Wrapper • Find best subset of features using ML techniques • E.g. Forward Selection, Random Selection • Embedded • Feature selection as part of ML Algorithm • E.g. L1 regularized linear regression
Using Regularizer/Penalty • Data, x= [x1, x2, … ,xn], Labels, y= [y1, y2 … ,yn]T, Model, w = [w1, w2 … ,wp] • A type of Embedded approach • Eg: In the case of linear least square regression • R represents regularizer, eg: L0 or L1 Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267, 1994.
Traditional regularizers • L0 Penalty • L1 Penalty L0 Norm • Not Continuous • Non-convex • Not differentiable at 0 L1 Norm
Traditional regularizers contd.. • Example: • Let us take Rainfall prediction problem • Assuming, both model has same training error • Model 1 • L0 Penalty = 1+ 1+ 1+ 1+ 1 = 5 • L1 Penalty = |3| +|-5| +|8|+|-4|+|1| = 21 • Model 2 • L0 Penalty = 1+ 0+ 1+ 1+ 0 = 3 • L1 Penalty = |-20| +|0| +|7|+|18|+|0| = 45 • Since L1 shrinks and selects - it often selects dense model L0 Norm chooses L1 Norm chooses
Other regularizer MC +
MC+ Closer to L1 Norm Closer to L0 Norm
SparseNet • Uses Coordinate Descent with Non-convex Penalty • Lets consider least square problem for single feature data matrix: • It has a closed form solution as: • Our goal is to minimize: Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009
SparseNet (cont.) • Let us define a soft threshold operator as below: • There are three cases here : w>0, w<0, w=0 • Convert multiple feature function into single feature function • Apply Coordinate Descent Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009
SparseNet (cont.) Constant Residue • Now let us extend our problem to solve data matrix with multiple features • Therefore soft threshold operator function becomes - Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009 - Jerome Friedman, Trevor Hastie, Holger H¨ofling, and Robert Tibshirani. Pathwise coordinate optimization. Technical report, Annals of Applied Statistics, 2007.
SparseNet with L1 Penalty • Using L1 Penalty Choose this model Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.
SparseNet with MC+ Penalty • Using MC+ Penalty Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.
Sparsityw.r.t. Support vectors • Kernel based learning algorithms • f(x) is linear combination of terms of form
Various techniques • Support Vector Machine (SVM) • SVM with L1 penalty • Greedy methods (wrapper): • Kernel Matching Pursuit (KMP) • Building SVM with sparser complexity (Keerthi et al) • Proposed method: • Preprocessing the training points using filtering and then applying wrapper methods
SVM with L1 regularizer • Settings: Data • SVM optimization: • SVM with L1 Penalty: • Solved using Linear Programming • Settings used: • Lambda: [1/100 1/10 1 10 100 ], Sigma: [1/16 ¼ 1 4 16]
SVM with L1 regularizer Decision Boundaries and Support Vectors RBF Kernel on dummy data Poly & RBF Kernel on Banana data
SVM with L1 regularizer Datasets Datasets Our formulation gave better sparser results than SVM
Kernel Matching Pursuit • Inspired from signal processing community • Decomposes any signal into a linear expansion of waveforms selected from dictionary of functions • Set of basis points are constructed in greedy fashion • Removes the requirement of positive definiteness of Kernel matrix • Allow us to directly control the sparsity (in terms of number of support vectors) Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.
Kernel Matching Pursuit • Setup: • D, finite dictionary of functions, • , l= # training points • n = # support vectors chosen so far • At (n+1)th step & are to be chosen s.t. : • Predictor: • where = indexes of SVs Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.
Basis points versus Support Vectors Basis Points / Support Vectors • - Dataset: http://mldata.org/repository/data/viewslug/banana-ida/ • S. Sathiya Keerthi, et al. Building support vector machines with reduced classier complexity. JMLR, 2006. • - Vladimir Vapnik, Steven E. Golowich, and Alex J. Smola. Support vector method for function approximation, regression estimation and signal processing. NIPS, 1996
Proposed methods • Two step process: • Step 1: Choosing subset of training set: • Modified BIRCH Clustering Algorithm • K-means Clustering • GMM Clustering • Step 2: Apply Greedy Algorithm • Kernel Matching Pursuit (KMP) • Building SVM with sparser complexity (Keerthi et al) Modified BIRCH KMP / Keerthi et al K-means Basis Points Training Points Model GMM Clustering - S. Sathiya Keerthi, Olivier Chapelle, and Dennis DeCoste. Building support vector machines with reduced classifier complexity. JOURNAL OF MACHINE LEARNING RESEARCH, 2006. - Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, Senior Member, and Senior Member. An efficient k-means clustering algorithm: Analysis and implementation.2002
BIRCH basics • Balanced Iterative Reducing and Clustering using Hierarchies • Uses one-scan over dataset, therefore suits large dataset • Each CF vector of cluster is defined as (N,LS,SS), N=data points, LS=Linear Sum, SS=Squared Sum • Merging of two clusters: • CF1 + CF2 = (Nl + N2, LSl + LS2, ,SS1 + SS2) • CF Tree • Height balanced tree • Two factors: • B (branching factor): Each non-leaf node contains at most B entries [CFi, childi], i=1..B, CFi is sub-cluster represented by childi. A leaf node contains at most L entries [CFi], i=1..L • T (threshold): radius/diameter of cluster
BIRCH Example Insertion into CF Tree B=3 L=3 New subcluster sc8 sc3 sc4 sc7 sc1 sc5 sc6 LN3 sc2 LN2 Root LN1 LN2 LN3 LN1 - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc5 sc3 sc6 sc7 sc1 sc4 sc2
BIRCH Example Here, Branch factor of leaf node exceeds 3, so LN1 is split New subcluster sc8 LN1’ sc3 sc4 sc7 sc1 LN1” sc5 sc6 Root LN2 LN3 sc2 LN2 LN3 LN1” LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc4 sc5 sc3 sc6 sc7 sc1 sc2
BIRCH Example Here, Branch factor of non-leaf node exceeds 3, so root is split and height of CF Tree increases by one NLN2 New subcluster NLN1 sc8 LN1’ sc3 sc4 sc7 sc1 LN1” Root sc5 sc6 NLN1 NLN2 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1
BIRCH Example NLN2 NLN1 New Point sc8 LN1’ sc3 sc4 sc7 sc1 LN1” Root sc5 sc6 NLN1 NLN2 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1
BIRCH Example Here, alien point falls inside leaf-node. Break it into parts. Branch factor of leaf node exceeds 3, so LN3 should split .. NLN2 New subcluster NLN1 sc7 sc9 sc8 LN1’ sc3 sc4 sc1 LN1” Root sc5 sc6 NLN1 NLN2 sc8 LN2 LN3 sc2 LN2 LN3 LN1 LN1’ - www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, sc8 sc2 sc4 sc5 sc3 sc6 sc7 sc1
Clusters using modified BIRCH Centroids
Modified BIRCH with KMP Our formulation gave descent results (red color)
Multi - class Modified BIRCH with KMP All multi-class datasets gave better results
K means and GMM with KMP Gave sparse model with less testset accuracy (except in blue color)
Conclusion • Studied various sparse classifier design algorithms • Better results obtained using SVM with L1 Penalty. • Modified BIRCH with KMP: • gave descent result on binary datasets • gave good results on multi-class datasets • saved kernel calculations (and time) by almost ~1/5th of actual time • Clustering is an easy way (time consuming) to choose basis points but not much effective. • Future work: • Explore greedy embedded sparse multi-class classification with different loss functions, e.g. Logistic Loss • Explore such techniques for Semi-supervised learning