300 likes | 393 Views
Efficient and Numerically Stable Sparse Learning. Sihong Xie 1 , Wei Fan 2 , Olivier Verscheure 2 , and Jiangtao Ren 3 1 University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China. Sparse Linear Model. Input:
E N D
Efficient and Numerically Stable Sparse Learning Sihong Xie1, Wei Fan2, Olivier Verscheure2, and Jiangtao Ren3 1University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China
Sparse Linear Model • Input: • Output: sparse linear model • Learning formulation Sparse regularization Large Scale Contest http://largescale.first.fraunhofer.de/instructions/
Objectives • Sparsity • Accuracy • Numerical Stability • limited precision friendly • Scalability • Large scale training data (rows and columns)
Outline • “Numerical un-stability” of two popular approaches • Propose sparse linear model • online • numerically stable • parallelizable • good sparcity – don’t take features unless necesseary • Experiments results
Stability in Sparse learning • NumericalProblemsof DirectIterativeMethods • Numerical Problems of Mirror Descent
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Solve the following optimization problem Sparse Degree Data Matrix Label vector Linear model The error to minimize L-0 regularization
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Incorporating gradient descent with hard thresholding • At each iteration: Negative of Gradient 1 2 Hard Thresholding: Keep s top significant elements
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Advantages: Simple and scalable • Convergence of IHT
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • For IHT algorithm to converge, the iteration matrix should have its spectral radius less than 1 • Spectral radius
Stability in Sparse learning Example of error growth of IHT
Stability in Sparse learning • Numerical Problems of Direct Iterative Methods • Numerical Problemsof Mirror Descent • Numerical Problemsof Mirror Descent
Stability in Sparse learning • Mirror Descent Algorithm (MDA) • Solve the L-1 regularized formulation • Maintain two vectors: primal and dual
1. 2. soft-thresholding Stability in Sparse learning Primal space Dual space link function Sparse Dual vector p is a parameter for MDA Illustration adopted from Peter Bartlett’s lecture slide http://www.cs.berkeley.edu/~bartlett/courses/281b-sp08/
Stability in Sparse learning • Floating number system • MDA link function exponent base significant digits × • Example: • A computer with only 4 significant digits • 0.1 + 0.00001 = 0.1
Stability in Sparse learning The difference between elements is amplified via the link function, when comparing elements in dual and primal vectors, respectively
Experiments Numerical problem of MDA • Experimental settings • Train models with 40% density. • Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively [ST2009] [ST2009] Shai S. Shwartz and Ambuj Tewari. Stochastic methods for ℓ1 regularized loss minimization. In ICML, pages 929–936. ACM, 2009.
Experiments Numerical problem of MDA • Performance Criteria • Percentage of elements that are truncated during prediction • Dynamical range
Objectives of a Simple Approach • Numerically Stable • Computationally efficient • Online, parallelizable • Accurate models with higher sparsity • Costly to obtain too many features (e.g. medical diagnostics) For an excellent theoretical treatment of trading off between accuracy and sparsity see S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.
The proposed method algorithm SVM like margin
Numerical Stability and Scalability Considerations • numerical stability • Less conditions on data matrix such as spectral radius and no change of scales • Less precision demanding (works under limited precision, theorem 1) • Under mild conditions, the proposed method converges even for a large number of iterations proportional to Machine precision
Numerical Stability and Scalability Considerations • Online fashion: one example at a time • Parallelization for intensive data access • Data can be distributed to computers, where parts of the inner product can be obtained. • Small network communication (only parts of inner product and signals to update model)
The proposed method properties • Soft-thresholding • L1-regularization for sparse model • Perceptron: avoids updates when the current features are able to predict well – sparcity • Convergence under soft-thresholding and limited precision (Lemma 2and Theorem 1) – numerical stability • Generalization error bound (Theorem 3) Don’t complicate the model when unnecessary
The proposed method A toy example The proposed method: a sparse model is enough to predict well (margin indicates good-enough model, so enough features) 1st update 2nd update TG: truncated descent Relatively dense model 3rd update
Experiments Overall comparison • The proposed algorithm + 3 baseline sparse learning algorithms (all with logistic loss function) • SMIDAS (MDA based [ST2009]): p = 0.5log(d) (cannot run with bigger p due to numerical problem) • TG (Truncated Gradient [LLZ2009]) • SCD (Stochastic Coordinate Descent [ST2009]) [ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages 929-936, 2009. [LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.
Experiments Overall comparison • Accuracy under the same model density • First 7 datasets: maximum 40% of features • Webspam: select maximum 0.1% of features • Stop running the program when maximum percentage of features are selected
Experiments Overall comparison • Accuracy vs. sparsity • The proposed algorithm works consistently better than other baselines. • On 5 out of 8 tasks, stopped updating model before reaching the maximum density (40% of features) • On task 1, outperforms others with 10% features • On task 3, ties with the best baseline using 20% features Convergence Sparse
Conclusion • Numerical Stability of Sparse Learning • Gradient Descent using matrix iteration may diverge without the spectral radius assumption. • When dimensionality is high, MDA produces many infinitesimal elements. • Trading off Sparsity and Accuracy • Other methods (TG, SCD) are unable to train accurate models with high sparsity. • Proposed approach is numerically stable, online parallelizable and converges. • Controlled by margin • L-1 regularization and soft threshold • Experimental codes are available • www.weifan.info