250 likes | 367 Views
Support Vector Machines for Structured Classification and The Kernel Trick. William Cohen 3-6-2007. Announcements. Don’t miss this one: Lise Getoor, 2:30 in Newell-Simon 3305. ^. ^. If mistake: v k+1 = v k + y i x i. Compute: y i = v k . x i. y i. y i.
E N D
Support Vector Machines for Structured Classification and The Kernel Trick William Cohen 3-6-2007
Announcements • Don’t miss this one: • Lise Getoor, 2:30 in Newell-Simon 3305
^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A
v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ
Perceptrons vs SVMs • For the voted perceptron to “work” (in this proof), we need to assume there is some u such that
Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: γ, (x1,y1), (x2,y2), (x3,y3), … • Find: some w where • ||w||=1 and • for all i, w.xi.yi> γ
Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Find: some w and γ such that • ||w||=1 and • for all i, w.xi.yi> γ The best possible w and γ
Perceptrons vs SVMs • Question: why not use this assumption directly in the learning algorithm? i.e. • Given: (x1,y1), (x2,y2), (x3,y3), … • Maximize γ under the constraint • ||w||=1 and • for all i, w.xi.yi> γ • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 Units are arbitrary: rescaling increases γ and w almost Thorsten’s eq (5-6), SVM0
^ Compute: yi = vk . xi Return: the index b* of the “best” xi If mistake: vk+1 = vk + xb -xb* b b* The voted perceptron for ranking instancesx1 x2 x3 x4… B A
v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 u +x2 >γ v1 -u 2γ 3
^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* F(xi,y*) F(xi,yi) b* b The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends B the Sha & Pereira paper and instructions for creating the instances: • A sends a word vector xi. Then B could create the instances F(xi,y)….. • but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi. • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)
^ Compute: yi = vk . zi Return: the index b* of the “best” zi If mistake: vk+1 = vk + zb -zb* b b* The voted perceptron for NER instancesz1 z2 z3 z4… B A • A sends a word vector xi. • B just returns the y* that gives the best score for vk . F(xi,y*) • A sends B the correct label sequence yi. • On errors, B sets vk+1 = vk + zb -zb* = vk + F(xi,y) - F(xi,y*)
SVM for ranking: assumptions • Mimimize ||w||2 under the constraint • for all i, w.xi.yi> 1 suggests algorithm Thorsten’s eq (5-6), SVM0 Assumption:
^ ^ If mistake: vk+1 = vk + yixi Compute: yi = vk . xi yi yi The voted perceptron instancexi B A
The kernel trick Remember: where i1,…,ik are the mistakes… so:
The kernel trick – con’t Since: where i1,…,ik are the mistakes… then Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:
The kernel trick – con’t A voted perceptron over vectors like u,v is a linear function… Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x2,…)
The kernel trick – con’t But notice…if we replace uv with (uv+1)2 …. Compare to
The kernel trick – con’t So – up to constants on the cross-product terms Why not replace the computation of With the computation of where ?
The kernel trick – con’t General idea: replace an expensive preprocessor xx’ and ordinary inner product with no preprocessor and a function K(x,xi) where This is really useful when you want to learn over objects x with some non-trivial structure….as in the two Mooney papers.
The kernel trick – con’t • Even more general idea: use any function K that is • Continuous • Symmetric—i.e., K(u,v)=K(v,u) • “Positive semidefinite”—i.e., K(u,v)≥0 • Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e., Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram matrix.