230 likes | 342 Views
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines. John C. Platt Presented by: Travis Desell. Overview. Introduction Motivation General SVMs General SVM training Related Work Sequential Minimal Optimization (SMO) Choosing the smallest optimization problem
E N D
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell
Overview • Introduction • Motivation • General SVMs • General SVM training • Related Work • Sequential Minimal Optimization (SMO) • Choosing the smallest optimization problem • Solving the smallest optimization problem • Benchmarks • Conclusion • Remarks & Future Work • References
Motivation • Traditional SVM Training Algorithms • Require quadratic programming (QP) package • SVM training is slow, especially for large problems • Sequential Minimal Optimization (SMO) • Requires no QP package • Easy to implement • Often faster • Good scalability properties
General SVMs u = SiaiyiK(xi,x) – b (1) • u : SVM output • a : weights to blend different kernels • y in {-1, +1} : desired output • b : threshold • xi : stored training example (vector) • x : input (vector) • K : kernel function to measure similarity of xi to xi
General SVMs (2) • For linear SVMs, K is linear, thus (1) can be expressed as the dot product of w and x minus the threshold: u = w * x – b (2) w = Siaiyixi (3) • Where w, x, and xi are vectors
General SVM Training • Training an SVM is finding ai, expressed as minimizing a dual quadratic form: minaY(a) = mina ½ Si SjyiyjK(xi, xj)aiaj – Siai (4) • Subject to box constraints: 0 <= ai <= C, for all I (5) • And the linear equality constraint: Siyiai = 0 (6) • The ai are Lagrange multipliers of a primal QP problem: there is a one-to-one correspondence between each ai and each training example xi
General SVM Training (2) • SMO solves the QP expressed in (4-6) • Terminates when all of the Karush-Kuhn-Tucker (KKT) optimality conditions are fulfilled: ai = 0 -> yiui >= 1 (7) 0 < ai < C -> yiui = 1 (8) ai = C -> yiui <= 1 (9) • Where ui is the SVM output for the ith training example
Related Work • “Chunking” [9] • Removing training examples with ai = 0 does not change solution. • Breaks down large QP problem into smaller sub-problems to identify non-zero ai. • The QP sub-problem consists of every non-zero ai from previous sub-problem combined with M worst examples that violate (7-9) for some M [1]. • Last step solves the entire QP problem as all non-zero ai have been found. • Cannot handle large-scale training problems if standard QP techniques are used. Kaufman [3] describes QP algorithm to overcome this.
Related Work (2) • Decomposition [6]: • Breaks the large QP problem into smaller QP sub-problems. • Osuna et al. [6] suggest using fixed size matrix for every sub-problem – allows very large training sets. • Joachims [2] suggests adding and subtracting examples according to heuristics for rapid convergence. • Until SMO, requires numerical QP library, which can be costly or slow.
Sequential Minimal Optimization • SMO decomposes the overall QP problem (4-6), into fixed size QP sub-problems. • Chooses the smallest optimization problem (SOP) at each step. • This optimization consists of two elements of a, because of the linear equality constraint. • SMO repeatedly chooses two elements of a to jointly optimize until the overall QP problem is solved.
Choosing the SOP • Heuristic based approach • Terminates when the entire training set obeys (7-9) within e (typically <= 10-3) • Repeatedly finds a1 and a2 and optimizes until termination
Finding a1 • “First choice heuristic” • Searches through examples most likely to violate conditions (non-bound subset) • ai at the bounds likely to stay there, non-bound ai will move as others are optimized • “Shrinking Heuristic” • Finds examples which fulfill (7-9) more than the worst example failed • Ignores these examples until a final pass at the end to ensure all examples fulfill (7-9)
Finding a2 • Chosen to maximize the size of the step taken during the joint optimization of a1 and a2 • Each non-bound has a cached error value E for each non-bound example • If E1 is negative, chooses a2 with minimum E2 • If E1 is positive, chooses a2 with maximum E2
Solving the SOP • Computes minimum along the direction of the linear equality constant: a2new = y2(E1-E2)/(K(x1,x1)+K(x2,x2)–2K(x1, x2)) (10) Ei = ui-yi (11) • Clips a2new within [L,H]: L = max(0,a2+sa1-0.5(s+1)C) (12) H = min(C,a2+sa1-0.5(s-1)C) (13) s = y1y2 (14) • Calculates a1new: a1new = a1 + s(a2 – a2new,clipped) (15)
Benchmarks • UCI Adult: SVM is given 14 attributes of a census and is asked to predict if household income is greater than $50k. 8 categorical attributes, 6 continues = 123 binary attributes. • Web: classify if a web page is in a category or not. 300 sparse binary keyword attributes. • MNIST: One classifier is trained. 784-dimensional, non-binary vectors stored as sparse vectors.
Description of Benchmarks • Web and Adult are trained with linear and Gaussian SVMs. • Performed with and without sparse inputs, with and without kernel caching • PCG chunking always uses caching
Conclusions • PCG chunking slower than SMO, SMO ignores examples whose Lagrange multipliers are at C. • Overhead of PCG chunking not involved with kernel (kernel optimizations do not greatly effect time).
Conclusions (2) • SVMlight solves 10 dimensional QP sub-problems. • Differences mostly due to kernel optimizations and numerical QP overhead. • SMO faster on linear problems due to linear SVM folding, but SVMlight can potentially use this as well. • SVMlight benefits from complex kernel cache while SVM does have a complex kernel cache and thus does not benefit from it at large problem sizes.
Remarks & Future Work • Heuristic based approach to finding a1 and a2 to optimize: • Possible to determine optimal choice strategy to minimize the number of steps? • Proof that SMO always minimizes the QP problem?
References • [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1998. • [2] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184. MIT Press, 1998.
References (2) • [3] L. Kaufman. Solving the quadratic programming problem arising in support vector classification. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 147–168. MIT Press, 1998. • [6] E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines. In Proc. IEEE Neural Networks in Signal Processing ’97, 1997.
References (3) • [9] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.