350 likes | 505 Views
Kernel Regression. Prof. Bennett Math Model of Learning and Discovery 1/28/05 Based on Chapter 2 of Shawe-Taylor and Cristianini. Outline. Review Ridge Regression LS-SVM=KRR Dual Derivation Bias Issue Summary . Ridge Regression Review. Use least norm solution for fixed
E N D
Kernel Regression Prof. Bennett Math Model of Learning and Discovery 1/28/05 Based on Chapter 2 of Shawe-Taylor and Cristianini
Outline • Review Ridge Regression • LS-SVM=KRR • Dual Derivation • Bias Issue • Summary
Ridge Regression Review • Use least norm solution for fixed • Regularized problem • Optimality Condition: Requires 0(n3) operations
Dual Representation • Inverse always exists for any • Alternative representation: Solving ll equation is 0(l3)
Dual Ridge Regression • To predict new point: • Note need only compute G, the Gram Matrix Ridge Regression requires only inner products between data points
Linear Regression in Feature Space Key Idea: Map data to higher dimensional space (feature space) and perform linear regression in embedded space. Embedding Map:
Kernel Function • A kernel is a function K such that • There are many possible kernels. Simplest is linear kernel.
Ridge Regression in Feature Space • To predict new point: • To compute the Gram Matrix Use kernel to compute inner product
Alternative Dual Derivation • Original math model • Equivalent math model • Construct dual using Wolfe Duality
Lagrangian Function • Consider the problem • Lagrangian function is
Wolfe Dual Problem • Primal • Dual
Lagrangian Function • Primal • Lagrangian
Wolfe Dual Problem Construct Wolfe Dual Simplify by eliminating z=
Simplified Problem Get rid of z Simplify by eliminating w=X’
Simplified Problem Get rid of w
Optimal solution • Problem in matrix notation with G=XX’ • Solution satisfies
What about Bias • If we limit regression function to f(x)=w’x means that solution must pass through origin. • Many models may require a bias or constant factor f(x)=w’x+b
Eliminate Bias • One way to eliminate bias is to “center” the response Make response have mean of 0
Center y Y now has sample mean of 0 Frequently good to make y have standard length:
Centering X may be good idea • Mean X • Center X
Scaling X may be a good idea • Compute Standard Deviation • Scale columns/variables
You Try • Consider data matrix with 3 points in 4 dimensions • Computer the centered X by hand and with the following formula, then scale
Center (X) in Feature Space • We cannot center (X) directly in feature space. • Center G = XX’ • Works in feature space too for G in kernel space
Centering Kernel Practical Computation:
Ridge Regression in Feature Space • Original way • Predicted normalized y • Predicted original y
Worksheet • Normalized Y • Invert to get unnormalized y
Centering Test Data Calculate test data just like training data: Prediction of test data becomes:
Alternate Approach • Directly add bias to the model: • Optimization problem becomes:
Lagrangian Function • Consider the problem • Lagrangian function is
Lagrangian Function • Primal
Wolfe Dual Problem Simplify by eliminating z= and using e’ =0
Simplified Problem Simplify by eliminating w=X’
Simplified Problem Get rid of w
New Problem to be solved • Problem in matrix notation with G=XX’ • This is a constrained optimization problem. Solution is also system of equations, but not as simple.
Kernel Ridge Regression • Centered algorithm just requires centering of the kernel and solving one equation. • Can also add bias directly. • + Lots of fast equation solvers. • + Theory supports generalization • - requires full training kernel to compute • - requires full training kernel to predict future points