530 likes | 1.01k Views
Learning Non-Linear Kernel Combinations Subject to General Regularization: Theory and Applications. Rakesh Babu 200402007 Advisors : Prof C. V. Jawahar, IIIT Hyderabad Dr. Manik Varma, Microsoft Research. Overview. Introduction to Support Vector Machines (SVM)
E N D
Learning Non-Linear Kernel Combinations Subject to General Regularization: Theory and Applications Rakesh Babu 200402007 Advisors : Prof C. V. Jawahar, IIIT Hyderabad Dr. Manik Varma, Microsoft Research
Overview • Introduction to Support Vector Machines (SVM) • Multiple Kernel Learning (MKL) • Problem Statement • Literature Survey • Generalized Multiple Kernel Learning ( GMKL ) • Applications • Conclusion
SVM Notation Xi i = 1,..…..,M yi i = 1,……,M Margin = 2 / > 1 Misclassified point < 1 b Support Vector Support Vector = 0 w wt(x) + b = -1 wt(x) + b = 0 wt(x) + b = +1
SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0 C • w = i iyi xi
SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0 C • f(x) = wt x + b = i iyi < xi, x > + b
Kernel Trick • Using some function which maps input space to feature space. • We build the classifier in feature space.
SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0 C
SVM after Kernelization • Primal : Minimise ½wtw + C ii • Subject to • yi [wt (xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<(xi) , (xj) > • Subject to • iiyi = 0 • 0 C • f(x) = wtx + b = i iyi<(xi) , (x) >+ b Dot product in feature space
SVM after Kernelization • Primal : Minimise ½wtw + C ii • Subject to • yi [wt (xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyjk(xi , xj) • Subject to • iiyi = 0 • 0 C • f(x) = wtx + b = i iyik(xi , x)+ b Kernel function
Kernel function & Kernel Matrix Class 1 Class 2 • Dot products in feature space is computed efficiently using kernel function. • e.g. RBF = • Properties of Kernel Function • Positive definite kernel Class 1 Class 2 t(xi)(xj) k(xi,xj) Kij = k(xi,xj)
Linear : k(xi,xj) = xitxj • Polynomial : k(xi,xj) = (xitxj + c)d • Gaussian (RBF) : k(xi,xj) = exp( –(xi – xj)2) • Chi-Squared : k(xi,xj) = exp( –2(xi, xj) ) Some Popular Kernels
Varying Kernel Parameter g g g =0.001 =1 =1000 RBF RBF RBF 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 Decision Boundaries 5
Learning the kernel • Valid kernels : • k = α1k1 + α2k2 • k = k1 * k2 • Learning the kernel function k(xi,xj) = ldlkl(xi , xj)
Multiple Kernel Learning • Learning the SVM parameters (’s ) and Kernel parameters (d’s) is multiple kernel learning problem. • k(xi,xj) = ldlkl(xi,xj) d22 = d33
Problem Statement • Most of multiple kernel learning formulations are restricted to linear combination of kernels subject to either l1 or l2 regularization. • In this thesis, we address the problem of how the kernel can be learnt using non-linear kernel combinations subject to general regularization. • We investigate some applications, the use of non-linear kernel combinations.
Literature Survey • Kernel Target Alignment • Semi-Definite Programming-MKL (SDP) • Block l1-MKL (M-Y regularization + SMO) • Semi-Infinite Linear Programming-MKL (SILP) • Simple MKL (gradient descent) • Hyper kernels (SDP/SOCP) • Multi-class MKL • Hierarchical MKL • Local MKL • Mixed norm MKL (mirror descent)
Multiple Kernel Learning • MKL learns a linear combination of base kernels • k(xi,xj) = ldlkl(xi,xj) d11 d22 = d33
Generalized MKL • GMKL learns non-linear kernel combinations • Product : k(xi,xj) = lkl(xi,xj) 1 = 2 = =
Toy Example : Non-linear Kernel Combination Individual 1D feature spaces 1 and 2 3 1 4 1 1 2 0 0 2 2 4 3 Combined kernel feature spaces 2 3 1 4 1 1 0 3 2 1 2 2 4 Sum Product
Generalized MKL Primal • Formulation • Min ½wtw + i L(f(xi), yi) + r(d) • subject to the constraints on d • where • (xi, yi) is the ith training point. • f(x) = wtd(x) + b • L is a general loss function. • Kd is a kernel function parameterised by d. • r is a regulariser on the kernel parameters. • This formulation is not convex.
GMKL Primal for Classification • MinimisedT(d) subject to d ≥ 0 • where • T(d) = Minw,b,½wtw + C ii +r(d) • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • To minimise T using gradient descent we need to • Prove that dT exists. • Calculate dT efficiently.
Dual - Differentiability • W(d) = r(d) + Max1t – ½ tYKdY • Subject to • 1tY = 0 • 0 ≤ ≤ C • T(d) = W(d) by the principle of strong duality. • Differentiability with respect to d comes from Danskin's Theorem [Danskin 1947].
Dual - Derivative • Let *(d) be the optimal value of so that • W(d) = r(d) + 1t*– ½ *tYKdY* • W = r(d) – ½*tY KY* • Since d is fixed, W(d) is the standard SVM dual and * can be obtained using any SVM solver.
Final Algorithm • Initialise d0 randomly • Repeat until convergence criteria is met • Form K using the current estimate of d. • Use any SVM solver to obtain *. • Update dn+1 = max(0, dn – nW)
Applications • Feature selection • Learning discriminative parts/pixels for object categorization • Character Recognition taken in natural scenes
MKL & its Applications • In general, applications exploit one of the following views of MKL. • To obtain the optimal weights of different features used for the task. • To interpret the sparsity after learning the weights of the kernels. • To Combine the multiple heterogeneous data sources.
Applications : Feature Selection • UCI datasets k(xi,xj) = l exp( -dl(xil-xjl)2 )
UCI Datasets – Ionosphere N = 246 M = 34 Uniform MKL = 89.9 2.5 Uniform GMKL = 93.6 2.0
UCI Dataset – Parkinson’s N = 136 M = 22 Uniform MKL = 87.3 3.9 Uniform GMKL = 91.0 3.5
UCI Datasets – Musk N = 333 M = 166 Uniform MKL = 90.2 3.2 Uniform GMKL = 93.8 1.9
UCI Datasets – Sonar N = 145 M = 60 Uniform MKL = 82.9 3.4 Uniform GMKL = 84.6 4.1
UCI Datasets – Wpbc N = 135 M = 34 Uniform MKL = 72.1 5.4 Uniform GMKL = 77.0 6.4
Application:Learning Discriminative Pixels/Parts • Problem : Can Image Categorization can be done efficiently ? • Idea:Often, Information present in images can be redundant • Solution :Yes, by focusing on only a subset of pixels or regions in an image.
Solution • A Kernel is associated with each part
Pixel Selection for Gender Identification • Database of FERET faces [Moghaddam and Yang ,PAMI 2002]. Males Females
Gender Identification – Features Pixel 1 Pixel 252
Gender Identification - Results N = 1053 M = 252 Uniform MKL = 92.6 0.9 Uniform GMKL = 94.3 0.1
Caltech 101 • Task : Object recognition • No. of classes : 102 classes • Problem !!! Not Perfectly Aligned !!! ......but roughly aligned Collected by Fei-Fei et al. [PAMI 2006]
Approach Kernel 1 Feature Extraction : GIST Kernel 64
Problem • Objective : Character recognition (English) taken from natural scenes.
A sample approach for sentence recognition from images: bottom up Sentence • Locate characters in images • Recognise characters • Recognise words • Recognise sentences Word Word Character Character Character Character recognition Character detection Image
Challenges • Perspective distortion • Occlusion • Variations in • Contrast • Color • Style • Sizes • Motion blur • Inter class distance is less and intra class distance is more. • Large number of classes • None of existing OCR techniques works here.
class-based vector quantisation patch detection feature extraction histogram computation +0.1 -1.5 … -0.5 x = Character Recognition usingBag of Features (discrete distribution) classification
Feature Extraction Methods • Geometric Blur [Berg] • Shape Contexts [Belongie et al] • SIFT [Lowe] • Patches [Varma & Zisserman 07] • SPIN [Lazebnik et al., Johnson] • MR8 (maximum response of 8 filters) [Varma & Zisserman 05]
Results - SVM and MKL • MKL Class i Class j
Conclusions • We presented a formulation which accepts non-linear kernel combinations. • GMKL results can be significantly better than standard MKL. • We shown several applications where proposed formulation gives better than state of art methods.