200 likes | 303 Views
Efficient and Numerically Stable Sparse Learning. Sihong Xie 1 , Wei Fan 2 , Olivier Verscheure 2 , and Jiangtao Ren 3 1 University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China. Applications.
E N D
Efficient and Numerically Stable Sparse Learning Sihong Xie1, Wei Fan2, Olivier Verscheure2, and Jiangtao Ren3 1University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China
Applications • Signal processing (compressive sensing, MRI, coding, etc.) • Computational Biology (DNA array sensing, gene expression pattern annotation ) • Geophysical Data Analysis • Machine learning
Algorithms • Greedy selection • Via L-0 regularization • Boosting, forward feature selection not for large scale problem • Convex optimization • Via L-1 regularization (e.g. Lasso) • IPM (interior point method) medium size problem • Homotopy method full regularization path computation • Gradient descent • Online algorithm (Stochastic Gradient Descent)
Rising awareness of Numerical Problems in ML • Efficiency • SVM, beyond Optimization black box solver • Large scale problems, parallelization • Eigenvalue problems, randomization • Stability • Gaussian process calculation, solving large system of linear equations, matrix inversion • Convergence of gradient descent, matrix iteration computation • For more topics of numerical mathematics in ML, See : ICML Workshop on Numerical Methods in Machine Learning 2009
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Solve the following optimization problem • Incorporating gradient descent with hard thresholding
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Simple and scalable • With RIP assumption, previous methods [BDIHT09, GK09] shows that iterative hard thresholding converges. • Without the assumption of the spectral radius of the iteration matrix, such methods may diverge.
Stability in Sparse learning • Gradient Descent with Matrix Iteration • Error Vector • Error Vector of IHT
Stability in Sparse learning • Mirror Descent Algorithm for Sparse Learning (SMIDAS) 1. Recover predictors from the Dual vector 2. Gradient Descent and Soft-threshold Dual vector Primal vector
Stability in Sparse learning • Elements of the Primal Vector is exponentially sensitive to the corresponding elements of the Dual Vector d is the dimensionality of data Needed in Prediction • Due to limited precisioin, small components will be omitted when computing
Stability in Sparse learning • Example • Suppose data are
Efficiency of Sparse Learning Over complicated models are produced with lower accuracy • Sparse models • Less computational cost • Lower generalization bound • Existing sparse learning algorithms may not good at trading off between sparsity and accuracy Can we get accurate models with higher sparsity? For a theoretical treatment of trading off between accuracy and sparsity see S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.
The proposed method Perceptron + soft-thresholding • Motivation • Soft-thresholding • L1-regularization for sparse model • Perceptron • Avoids updates when the current features are able to predict well • Convergence under soft-thresholding and limited precision (Lemma 2and Theorem 1) • Compression (Theorem 2) • Generalization error bound (Theorem 3) Don’t complicate the model when unnecessary
Experiments • Datasets Large Scale Contest http://largescale.first.fraunhofer.de/instructions/
Experiments Divergence of IHT • For IHT to converge • The iteration matrices found in practice don’t meet this condition • For IHT (GraDes) with learning rate set to 1/3 and 1/100, respectively, we found …
Experiments Numerical problem of MDA • Train models with 40% density. • Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively • percentage of elements of the model within [em, em-52], indicating how many features will be lost during prediction • Dynamical range indicate how wildly can the elements of model change.
Experiments Numerical problem of MDA • How parameter p=O(ln(d)) affects performance • Smaller p, algorithm acts more like ordinary stochastic gradient descent [GL1999] • Larger p, causing truncation during prediction • When dimensionality is high, MDA becomes numerically unstable. [GL1999] Claudio Gentile and Nick Littlestone. The robustness of the p-norm algorithms. In Proceeding of 12th Annual Conference on Computer Learning Theory, pages 1–11.ACM Press, New York, NY, 1999.
Experiments Overall comparison • The proposed algorithm + 3 baseline sparse learning algorithms (all with logistic loss function) • SMIDAS (MDA based [ST2009]) • TG (Truncated Gradient [LLZ2009]) • SCD (Stochastic Coordinate Descent [ST2009]) Parameter tuning • Run 10 times for each algorithm, find out the average accuracy on the validation set. [ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages 929-936, 2009. [LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.
Experiments Overall comparison • Accuracy under the same model density • First 7 datasets: maximum 40% of features • Webspam: select maximum 0.1% of features • Stop running the program when maximum percentage of features are selected
Experiments Overall comparison Generalizability • Accuracy vs. sparsity • The proposed algorithm works consistently better than other baselines. • On 3 out of 5 tasks, stopped updating model before reaching the maximum density (40% of features) • On task 1, outperforms others with 10% less features • On task 3, ties with the best baseline using less 20% features • On task 1-7, SMIDAS: the smaller p, the better accuracy, but it is beat by all other algorithms Convergence Sparse Numerically unstable