510 likes | 1.01k Views
Support Vector Regression. David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000 http://www.cs.wisc.edu/~musicant. Outline. Robust Regression Huber M-Estimator loss function New quadratic programming formulation
E N D
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium onMathematical ProgrammingThursday, August 10, 2000 http://www.cs.wisc.edu/~musicant
Outline • Robust Regression • Huber M-Estimator loss function • New quadratic programming formulation • Numerical comparisons • Nonlinear kernels • Tolerant Regression • New formulation of Support Vector Regression (SVR) • Numerical comparisons • Massive regression:Row-column chunking • Conclusions & Future Work
Focus 1:Robust Regression a.k.a. Huber Regression -g g
“Standard” Linear Regression Find w, b such that: m points in Rn, represented by an m x n matrix A. y in Rm is the vector to be approximated.
Find w, b such that: Optimization problem • Bound the error by s: • Minimize the error: Traditional approach: minimize squared error.
Examining the loss function • Standard regression uses a squared error loss function. • Points which are far from the predicted line (outliers) are overemphasized.
Alternative loss function • Instead of squared error, try absolute value of the error: This is the 1-norm loss function.
1-Norm Problems And Solution • Overemphasizes error on points close to the predicted line • Solution: Huber loss function hybrid approach Linear Quadratic Many practitioners prefer the Huber loss function.
Mathematical Formulation • g indicates switchover from quadratic to linear -g g Larger g means “more quadratic.”
Regression Approach Summary • Quadratic Loss Function • Standard method in statistics • Over-emphasizes outliers • Linear Loss Function (1-norm) • Formulates well as a linear program • Over-emphasizes small errors • Huber Loss Function (hybrid approach) • Appropriate emphasis on large and small errors
Previous attempts complicated • Earlier efforts to solve Huber regression: • Huber: Gauss-Seidel method • Madsen/Nielsen: Newton Method • Li: Conjugate Gradient Method • Smola: Dual Quadratic Program • Our new approach: convex quadratic program Our new approach is simpler and faster.
Experimental Results: Census20k 20,000 points11 features g Faster! Time (CPU sec)
Experimental Results: CPUSmall 8,192 points12 features g Faster! Time (CPU sec)
Introduce nonlinear kernel • Begin with previous formulation: • Substitute w = A’a and minimize a instead: • Substitute K(A,A’) for AA’:
Nonlinear results Nonlinear kernels improve accuracy.
Regression Approach Summary • Quadratic Loss Function • Standard method in statistics • Over-emphasizes outliers • Linear Loss Function (1-norm) • Formulates well as a linear program • Over-emphasizes small errors • Huber Loss Function (hybrid approach) • Appropriate emphasis on large and small errors
Find w, b such that: Optimization problem • Bound the error by s: • Minimize the error: Minimize the magnitude of the error.
The overfitting issue • Noisy training data can be fitted “too well” • leads to poor generalization on future data • Prefer simpler regressions, i.e. where • some w coefficients are zero • line is “flatter”
Reducing overfitting • To achieve both goals • minimize magnitude of w vector • C is a parameter to balance the two goals • Chosen by experimentation • Reduces overfitting due to points far from surface
Overfitting again: “close” points • “Close points” may be wrong due to noise only • Line should be influenced by “real” data, not noise • Ignore errors from those points which are close!
Tolerant regression • Allow an interval of size e with uniform error • How large should e be? • Large as possible, while preserving accuracy
Introduce nonlinear kernel • Begin with previous formulation: • Substitute w = A’a and minimize a instead: • Substitute K(A,A’) for AA’: K(A,A’) = nonlinear kernel function
Our formulation Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation tolerance as aconstraint single error bound
Smola, Schölkopf, Rätsch multiple error bounds
Reduction in: • Variables: • 4m+2 --> 3m+2 • Solution time
Our formulation Smola, Schölkopf, Rätsch Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation • Reduction in: • Variables: • 4m+2 --> 3m+2 • Solution time tolerance as aconstraint single error bound multiple error bounds
Natural interpretation for m • our linear program is equivalent to classical stabilized least 1-norm approximation problem • Perturbation theory results show there exists a fixed such that: • For all • we solve the above stabilized least 1-norm problem • additionally we maximize e, the least error component • As m goes from 0 to 1, • least error component e is monotonically nondecreasing function of m.
Numerical Testing • Two sets of tests • Compare computational times of our method (MM) and the SSR method • Row-column chunking for massive datasets • Datasets: • US Census Bureau Adult Dataset: 300,000 points in R11 • Delve Comp-Activ Dataset: 8192 points in R13 • UCI Boston Housing Dataset: 506 points in R13 • Gaussian noise was added to each of these datasets. • Hardware: Locop2: Dell PowerEdge 6300 server with: • Four gigabytes of memory, 36 gigabytes of disk space • Windows NT Server 4.0 • CPLEX 6.5 solver
Experimental Process • m is a parameter which needs to be determined experimentally • Use a hold-out tuning set to determine optimal value for m • Algorithm: m = 0 while (tuning set accuracy continues to improve) { Solve LP m = m + 0.1 } • Run for both our method and SSR methods and compare times
Linear Programming Row Chunking • Basic approach: (PSB/OLM) for classification problems • Classification problem is solved for a subset, or chunk of constraints (data points) • Those constraints with positive multipliers are preserved and integrated into next chunk (support vectors) • Objective function is montonically nondecreasing • Dataset is repeatedly scanned until objective function stops increasing
Innovation: Simultaneous Row-Column Chunking • Row Chunking • Cannot handle problems with large numbers of variables • Therefore: Linear kernel only • Row-Column Chunking • New data increase the dimensionality of K(A,A’) by adding both rows and columns (variables) to the problem. • We handle this with row-column chunking. • General nonlinear kernel
Row-Column Chunking Algorithm while (problem termination criteria not satisfied) { choose set of rows as row chunk while (row chunk termination criteria not satisfied) { from row chunk, select set of columns solve LP allowing only these columns to vary add columns with nonzero values to next column chunk } add rows with nonzero multipliers to next row chunk }
Objective Value & Tuning Set Errorfor Billion-Element Matrix
Conclusions and Future Work • Conclusions • Robust regression can be modeled simply and efficiently as a quadratic program • Tolerant Regression can be handled more efficiently using improvements on previous formulations • Row-column chunking is a new approach which can handle massive regression problems • Future work • Chunking via parallel and distributed approaches • Scaling Huber regression to larger problems
LP Perturbation Regime #1 • Our LP is given by: • When m = 0, the solution is the stabilized least 1-norm solution. • Therefore, by LP Perturbation Theory, there exists a such that • The solution to the LP with is a solution to the least 1-norm problem that also maximizes e.
LP Perturbation Regime #2 • Our LP can be rewritten as: • Similarly, by LP Perturbation Theory, there exists a such that • The solution to the LP with is the solution that minimizes least error (e) among all minimizers of average tolerated error.
Motivation for dual variable substitution • Primal: • Dual: