Machine Learning Seminar: Support Vector Regression

Machine Learning Seminar:Support Vector Regression Presented by: Heng Ji 10/08/03

Outline • Regression Background • Linear ε- Insensitive Loss Algorithm • Primal Formulation • Dual Formulation • Kernel Formulation • Quadratic ε- Insensitive Loss Algorithm • Kernel Ridge Regression & Gaussian Process

Regression = find a function that fits the observations Observations: (1949,100) (1950,117) ... (1996,1462) (1997,1469) (1998,1467) (1999,1474) (x,y) pairs

Linear fit... Not so good...

Better linear fit... Take logarithm of y and fit a straight line

Transform back to original So so...

So what is regression about? Construct a model of a process, using examples of the process. Input: x (possibly a vector) Output: f(x) (generated by the process) Examples: Pairs of input and output {y, x} Our model: The function is our estimate of the true function g(x)

Assumption about the process The “fixed regressor model” x(n) Observed input y(n) Observed output g[x(n)] True underlying function e(n) I.I.Dnoise process with zero mean Data set:

Example 0<=e<=2

Model Sets (examples) g(x) = 0.5 + x + x2 + 6x3 F1 F2 F3 F1 F2 F3 F1={a+bx}; F2={a+bx+cx2}; F3={a+bx+cx2+dx3}; Linear; Quadratic; Cubic;

Idealized regression Find appropriate model familyFand findf(x)  Fwith minimum “distance” tog(x)(“error”) g(x) Error F fopt(x)  F Model Set (our hypothesis set)

How measure “distance”? • Q: What is the distance (difference) between functions f and g?

Margin Slack Variable For Example(xi, yi), function f, Margin slack variable θ: target accuracy in test γ: difference between target accuracy and margin in training

ε- Insensitive LossFunction • Let ε= θ-γ, Margin Slack Variable • Linear ε- Insensitive Loss: • Quadratic ε- Insensitive Loss

Linear ε- Insensitive Loss a Linear SV Machine ξ ξ Yi-<w,xi>

Basic Idea of SV Regression • Starting point We have input data X = {(x1,y1), …., (xN,yN)} • Goal We want to find a robust function f(x) that has at most ε deviation from the targets y, while at the same time being as flat as possible. • Idea Simple Regression Problem + Optimization + Kernel Trick

Thus setting: • Primal Regression Problem

Linear ε- Insensitive Loss Regression min subject to ε decide Insensitive Zone C  a trade-off between error and ||w|| • εand C must be tuned simultaneously Regression is more difficult than Classification?

Parameters used in SV Regression

Dual Formulation • Lagrangian function will help us to formulate the dual problem • ε: insensitive loss βi* : Lagrange Multiplier ξi : difference value for points above εband ξi*: difference value for points below εband • Optimality Conditions

Dual Formulation(Cont’) • Dual Problem • Solving

+e -e KKT Optimality Conditions and b • KKT Optimality Conditions • b can be computed as follows This means that the Lagrange multipliers will only be non-zero for points outside the e band. Thus these points are the support vectors

The Idea of SVM • input space feature space •     

Kernel Version • Why can we use Kernel? The complexity of a function’s representation depends only on the number of SVs  the complete algorithm can be described in terms of inner product. An implicit mapping to the feature space • Mapping via Kernel

Quadratic ε- Insensitive Loss Regression Problem: min subject to Kernel Formulation

Kernel Ridge Regression & Gaussian Processes • ε= 0  Least Square Linear Regression The weight decay factor is controlled by C • min (λ~1/C) subject to • Kernel Formulation (I: Identity Matrix) is also the mean of a Gaussian distribution

Architecture of SV Regression Machine b similar to regression in a three-layered neural network!?

Conclusion • SVM is a useful alternative to neural network • Two key concepts of SVM • optimization • kernel trick • Advantages of SV Regression • Represent solution by a small subset of training points • Ensure the existence of global minimum • Ensure the optimization of a reliable eneralization bound

Discussion1: Influence of an insensitivity band on regression quality • 17 measured training data points are used. • Left: ε= 0.1  15 SV are chosen • Right: ε= 0.5  6 chosen SV produced a much better regression function

Discussion2: ε- Insensitive Loss • Enables sparseness within SVs, but guarantees sparseness? • Robust (robust to small changes in data/ model) • Less sensitive to outliers

Machine Learning Seminar: Support Vector Regression