1.2k likes | 1.21k Views
Discriminative Machine Learning Topic 2: Linear Classification (SVM). M. Pawan Kumar (Based on Prof. A. Zisserman’s course material). Slides available online http:// mpawankumar.info. Outline. Classification Binary Classification Multiclass Classification. Binary Classification.
E N D
Discriminative Machine LearningTopic 2: Linear Classification (SVM) M. Pawan Kumar (Based on Prof. A. Zisserman’s course material) Slides available online http://mpawankumar.info
Outline • Classification • Binary Classification • Multiclass Classification
Binary Classification Input: x Output: y {-1,+1} Image Sentence DNA sequence …
I applied for Oxford engineering Example Will she get a 1st class degree, if admitted?
Features GCSE grades AS-level grades A-level grades Feature Vector Φ(x) Interview scores PAT scores …
Example Is this an energy efficient house?
Features Number of people Number of electrical items Annual income Feature Vector Φ(x) Political inclination Country, State, City …
Example Is this a spam email?
Features Spelling mistakes Word count URLs Feature Vector Φ(x) Sender Recipients …
Multiclass Classification Input: x Output: y {1,2,…, C} Image Sentence DNA sequence …
I applied for Oxford engineering Example What class degree will she get, if admitted?
Features GCSE grades AS-level grades A-level grades Feature Vector Φ(x) Interview scores PAT scores …
Example Which digit does the image depict?
Features Scale the image to canonical size, say 28x28 Binarize the intensity values Concatenate the binary values Feature Vector Φ(x)
Outline • Classification • Binary Classification • Multiclass Classification
Binary Classification Dataset D = {(xi,yi), i = 1, … n} yi∈ {-1,+1} Classification via regression Loss function ∑i(wTΦ(xi) – yi)2 Builds on known method Not suitable for classification
I applied for Oxford engineering Example What class degree will she get, if admitted?
Example Say, interview score of 8+ implies 1st class Consider a positive sample Loss = (w*interview_score – 1)2 If w = 1/8, non-zero loss for 8.5, 9, 9.5,… If w = 1/9, non-zero loss for 8, 8.5, 9.5, …
Outline • Classification • Binary Classification • Formulation • Support Vector Machine (SVM) • Max-Margin Interpretation • Optimization • Application • Multiclass Classification
Formulation We first consider prediction Given a classifier, how do we use it to classify Then we’ll move on to learning Given training data, how do we learn a classifier
Prediction Given x compute score for each y∈ {-1,+1} Function f: (x,y) → Real For example (wN)TΦ(x), if y = -1 argmaxy f(x,y) = y(f) = (wP)TΦ(x), if y = +1 Let us make this a bit more abstract
Prediction – Joint Feature Vector Given input x and output y Joint feature vector Ψ(x,y) For example Φ(x) Ψ(x,-1) = 0 Vector of zeros Same size as Φ(x)
Prediction – Joint Feature Vector Given input x and output y Joint feature vector Ψ(x,y) For example 0 Ψ(x,+1) = Φ(x)
Prediction – Score Function Function f: Ψ(x,y)→ Real For example, wTΨ(x,y) Linear classifier 0 Φ(x) wN Ψ(x,+1) = Ψ(x,-1) = w = wP 0 Φ(x) -ve score = (wN)TΦ(x) +ve score = (wP)TΦ(x)
Prediction – Score Function Function f: Ψ(x,y)→ Real For example, wTΨ(x,y) Linear classifier y(w) = argmaxywTΨ(x,y) Predicted class y(w)
Prediction – Summary Given an input x,for each output y Define joint feature vector Ψ(x,y) Score function wTΨ(x,y) y(w) = argmaxywTΨ(x,y)
Learning D = {(xi,yi), i = 1, … n} yi∈ {-1,+1} minw ∑iLoss(w; xi,yi) + λ Reg(w) Loss Function Regularization Suitable loss function for classification?
Learning – Loss Function Consider one sample (xi,yi) yi(w): Prediction using parameters w
Learning – Loss Function Consider one sample (xi,yi) yi(w) = argmaxywTΨ(x,y) 0, if yi = yi(w) Loss Δ(yi,yi(w)) = 1, if yi ≠ yi(w) 0-1 loss function
Learning – Objective ∑i minw Δ(yi,yi(w)) + λ||w||2 Is this a sensible learning objective? Loss is highly non-convex in w Regularization plays no role Minimize a convex upper bound
Learning – Objective wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi(w)) ≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi) Why?
Learning – Objective wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi(w)) ≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi) ≤ maxy{ wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi)
Learning – Objective Convex? Regularization sensitive? Replace loss function with upper bound Minimize objective to obtain linear classifier maxy{ wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi)
Learning – Summary D = {(xi,yi), i = 1, … n} yi∈ {-1,+1} minw ∑i maxy{wTΨ(xi,y) +Δ(yi,y)} - wTΨ(xi,yi) + λ||w||2 Let us look at a specific example
Outline • Classification • Binary Classification • Formulation • Support Vector Machine (SVM) • Max-Margin Interpretation • Optimization • Application • Multiclass Classification
Prediction – Joint Feature Vector Given input x and output y Joint feature vector Ψ(x,y) Φ(x) Ψ(x,+1) = 1 Classifier doesn’t always pass through origin
Prediction – Joint Feature Vector Given input x and output y Joint feature vector Ψ(x,y) 0 Ψ(x,-1) = 0 Vector of zeros Same size as Φ(x)
Prediction – Score Function Score: (wS)TΨ(x,y) Linear classifier Φ(x) 0 w Ψ(x,+1) = Ψ(x,-1) = wS = 1 b 0 Weight vector
Prediction – Score Function Score: (wS)TΨ(x,y) Linear classifier Φ(x) 0 w Ψ(x,+1) = Ψ(x,-1) = wS = 1 b 0 Bias
Prediction – Score Function Score: (wS)TΨ(x,y) Linear classifier Φ(x) 0 w Ψ(x,+1) = Ψ(x,-1) = wS = 1 b 0 -ve score = 0 +ve score = wTΦ(x) + b Make prediction by maximizing score over {-1,+1}
Prediction – Summary Weight vector: w Bias: b +1, if wTΦ(x) + b ≥ 0 Prediction y(w) = -1, otherwise y(w) = sign(wTΦ(x) + b)
Learning Convex upper bound of the 0-1 loss function maxy{(wS)TΨ(xi,y) +Δ(yi,y)} – (wS)TΨ(xi,yi) Consider a positive sample yi = +1 If y = +1 0 If y = -1 1 – ((wS)TΨ(xi,+1))
Learning Convex upper bound of the 0-1 loss function maxy{(wS)TΨ(xi,y) +Δ(yi,y)} – (wS)TΨ(xi,yi) Consider a positive sample yi = +1 If y = +1 0 If y = -1 1 – (wTΦ(xi) + b )
Learning Convex upper bound of the 0-1 loss function maxy{(wS)TΨ(xi,y) +Δ(yi,y)} – (wS)TΨ(xi,yi) Consider a positive sample yi = +1 If y = +1 0 If y = -1 1 – yi(wTΦ(xi) + b )
Learning Convex upper bound of the 0-1 loss function maxy{(wS)TΨ(xi,y) +Δ(yi,y)} – (wS)TΨ(xi,yi) Consider a negative sample yi = -1 If y = -1 0 If y = +1 1 + ((wS)TΨ(xi,+1))
Learning Convex upper bound of the 0-1 loss function maxy{(wS)TΨ(xi,y) +Δ(yi,y)} – (wS)TΨ(xi,yi) Consider a negative sample yi = -1 If y = -1 0 If y = +1 1 + (wTΦ(xi) + b )
Learning Convex upper bound of the 0-1 loss function maxy{(wS)TΨ(xi,y) +Δ(yi,y)} – (wS)TΨ(xi,yi) Consider a negative sample yi = -1 If y = -1 0 If y = +1 1 - yi(wTΦ(xi) + b )
Learning Convex upper bound of the 0-1 loss function Hinge Loss max{0, 1 - yi(wTΦ(xi) + b )} yi(wTΦ(xi) + b )
I applied for Oxford engineering Example What class degree will she get, if admitted?
Example Say, interview score of 8+ implies 1st class Consider a positive sample Loss = max{0,1-1*(w*interview_score+b)} If w = 1/8 and b=0, loss=0 for 8, 8.5, 9, 9.5,… More suitable for classification than squared loss