330 likes | 555 Views
Machine Learning Seminar: Support Vector Regression. Presented by: Heng Ji 10/08/03. Outline. Regression Background Linear ε - Insensitive Loss Algorithm Primal Formulation Dual Formulation Kernel Formulation Quadratic ε - Insensitive Loss Algorithm
E N D
Machine Learning Seminar:Support Vector Regression Presented by: Heng Ji 10/08/03
Outline • Regression Background • Linear ε- Insensitive Loss Algorithm • Primal Formulation • Dual Formulation • Kernel Formulation • Quadratic ε- Insensitive Loss Algorithm • Kernel Ridge Regression & Gaussian Process
Regression = find a function that fits the observations Observations: (1949,100) (1950,117) ... (1996,1462) (1997,1469) (1998,1467) (1999,1474) (x,y) pairs
Linear fit... Not so good...
Better linear fit... Take logarithm of y and fit a straight line
Transform back to original So so...
So what is regression about? Construct a model of a process, using examples of the process. Input: x (possibly a vector) Output: f(x) (generated by the process) Examples: Pairs of input and output {y, x} Our model: The function is our estimate of the true function g(x)
Assumption about the process The “fixed regressor model” x(n) Observed input y(n) Observed output g[x(n)] True underlying function e(n) I.I.Dnoise process with zero mean Data set:
Example 0<=e<=2
Model Sets (examples) g(x) = 0.5 + x + x2 + 6x3 F1 F2 F3 F1 F2 F3 F1={a+bx}; F2={a+bx+cx2}; F3={a+bx+cx2+dx3}; Linear; Quadratic; Cubic;
Idealized regression Find appropriate model familyFand findf(x) Fwith minimum “distance” tog(x)(“error”) g(x) Error F fopt(x) F Model Set (our hypothesis set)
How measure “distance”? • Q: What is the distance (difference) between functions f and g?
Margin Slack Variable For Example(xi, yi), function f, Margin slack variable θ: target accuracy in test γ: difference between target accuracy and margin in training
ε- Insensitive LossFunction • Let ε= θ-γ, Margin Slack Variable • Linear ε- Insensitive Loss: • Quadratic ε- Insensitive Loss
Linear ε- Insensitive Loss a Linear SV Machine ξ ξ Yi-<w,xi>
Basic Idea of SV Regression • Starting point We have input data X = {(x1,y1), …., (xN,yN)} • Goal We want to find a robust function f(x) that has at most ε deviation from the targets y, while at the same time being as flat as possible. • Idea Simple Regression Problem + Optimization + Kernel Trick
Thus setting: • Primal Regression Problem
Linear ε- Insensitive Loss Regression min subject to ε decide Insensitive Zone C a trade-off between error and ||w|| • εand C must be tuned simultaneously Regression is more difficult than Classification?
Dual Formulation • Lagrangian function will help us to formulate the dual problem • ε: insensitive loss βi* : Lagrange Multiplier ξi : difference value for points above εband ξi*: difference value for points below εband • Optimality Conditions
Dual Formulation(Cont’) • Dual Problem • Solving
+e -e KKT Optimality Conditions and b • KKT Optimality Conditions • b can be computed as follows This means that the Lagrange multipliers will only be non-zero for points outside the e band. Thus these points are the support vectors
The Idea of SVM • input space feature space •
Kernel Version • Why can we use Kernel? The complexity of a function’s representation depends only on the number of SVs the complete algorithm can be described in terms of inner product. An implicit mapping to the feature space • Mapping via Kernel
Quadratic ε- Insensitive Loss Regression Problem: min subject to Kernel Formulation
Kernel Ridge Regression & Gaussian Processes • ε= 0 Least Square Linear Regression The weight decay factor is controlled by C • min (λ~1/C) subject to • Kernel Formulation (I: Identity Matrix) is also the mean of a Gaussian distribution
Architecture of SV Regression Machine b similar to regression in a three-layered neural network!?
Conclusion • SVM is a useful alternative to neural network • Two key concepts of SVM • optimization • kernel trick • Advantages of SV Regression • Represent solution by a small subset of training points • Ensure the existence of global minimum • Ensure the optimization of a reliable eneralization bound
Discussion1: Influence of an insensitivity band on regression quality • 17 measured training data points are used. • Left: ε= 0.1 15 SV are chosen • Right: ε= 0.5 6 chosen SV produced a much better regression function
Discussion2: ε- Insensitive Loss • Enables sparseness within SVs, but guarantees sparseness? • Robust (robust to small changes in data/ model) • Less sensitive to outliers