520 likes | 537 Views
This thesis explores learning non-linear kernel combinations subject to general regularization in Support Vector Machines (SVM). It covers the introduction to SVM, multiple kernel learning (MKL), problem statement, literature survey, generalized MKL, applications, and conclusion. The formulation of SVM, kernel trick, and kernel function properties are discussed, along with popular kernels like RBF and decision boundaries. Multiple Kernel Learning, Generalized MKL, and their applications are also explored, emphasizing learning the kernel function and SVM parameters. Various formulations and algorithms for MKL, such as Semi-Definite Programming-MKL and Simple MKL, are analyzed. Additionally, the GMKL formulation for classification is detailed.
E N D
Learning Non-Linear Kernel Combinations Subject to General Regularization: Theory and Applications Rakesh Babu 200402007 Advisors : Prof C. V. Jawahar, IIIT Hyderabad Dr. Manik Varma, Microsoft Research
Overview • Introduction to Support Vector Machines (SVM) • Multiple Kernel Learning (MKL) • Problem Statement • Literature Survey • Generalized Multiple Kernel Learning ( GMKL ) • Applications • Conclusion
SVM Notation Xi i = 1,..…..,M yi i = 1,……,M Margin = 2 / > 1 Misclassified point < 1 b Support Vector Support Vector = 0 w wt(x) + b = -1 wt(x) + b = 0 wt(x) + b = +1
SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0 C • w = i iyi xi
SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0 C • f(x) = wt x + b = i iyi < xi, x > + b
Kernel Trick • Using some function which maps input space to feature space. • We build the classifier in feature space.
SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0 C
SVM after Kernelization • Primal : Minimise ½wtw + C ii • Subject to • yi [wt (xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<(xi) , (xj) > • Subject to • iiyi = 0 • 0 C • f(x) = wtx + b = i iyi<(xi) , (x) >+ b Dot product in feature space
SVM after Kernelization • Primal : Minimise ½wtw + C ii • Subject to • yi [wt (xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyjk(xi , xj) • Subject to • iiyi = 0 • 0 C • f(x) = wtx + b = i iyik(xi , x)+ b Kernel function
Kernel function & Kernel Matrix Class 1 Class 2 • Dot products in feature space is computed efficiently using kernel function. • e.g. RBF = • Properties of Kernel Function • Positive definite kernel Class 1 Class 2 t(xi)(xj) k(xi,xj) Kij = k(xi,xj)
Linear : k(xi,xj) = xitxj • Polynomial : k(xi,xj) = (xitxj + c)d • Gaussian (RBF) : k(xi,xj) = exp( –(xi – xj)2) • Chi-Squared : k(xi,xj) = exp( –2(xi, xj) ) Some Popular Kernels
Varying Kernel Parameter g g g =0.001 =1 =1000 RBF RBF RBF 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 Decision Boundaries 5
Learning the kernel • Valid kernels : • k = α1k1 + α2k2 • k = k1 * k2 • Learning the kernel function k(xi,xj) = ldlkl(xi , xj)
Multiple Kernel Learning • Learning the SVM parameters (’s ) and Kernel parameters (d’s) is multiple kernel learning problem. • k(xi,xj) = ldlkl(xi,xj) d22 = d33
Problem Statement • Most of multiple kernel learning formulations are restricted to linear combination of kernels subject to either l1 or l2 regularization. • In this thesis, we address the problem of how the kernel can be learnt using non-linear kernel combinations subject to general regularization. • We investigate some applications, the use of non-linear kernel combinations.
Literature Survey • Kernel Target Alignment • Semi-Definite Programming-MKL (SDP) • Block l1-MKL (M-Y regularization + SMO) • Semi-Infinite Linear Programming-MKL (SILP) • Simple MKL (gradient descent) • Hyper kernels (SDP/SOCP) • Multi-class MKL • Hierarchical MKL • Local MKL • Mixed norm MKL (mirror descent)
Multiple Kernel Learning • MKL learns a linear combination of base kernels • k(xi,xj) = ldlkl(xi,xj) d11 d22 = d33
Generalized MKL • GMKL learns non-linear kernel combinations • Product : k(xi,xj) = lkl(xi,xj) 1 = 2 = =
Toy Example : Non-linear Kernel Combination Individual 1D feature spaces 1 and 2 3 1 4 1 1 2 0 0 2 2 4 3 Combined kernel feature spaces 2 3 1 4 1 1 0 3 2 1 2 2 4 Sum Product
Generalized MKL Primal • Formulation • Min ½wtw + i L(f(xi), yi) + r(d) • subject to the constraints on d • where • (xi, yi) is the ith training point. • f(x) = wtd(x) + b • L is a general loss function. • Kd is a kernel function parameterised by d. • r is a regulariser on the kernel parameters. • This formulation is not convex.
GMKL Primal for Classification • MinimisedT(d) subject to d ≥ 0 • where • T(d) = Minw,b,½wtw + C ii +r(d) • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • To minimise T using gradient descent we need to • Prove that dT exists. • Calculate dT efficiently.
Dual - Differentiability • W(d) = r(d) + Max1t – ½ tYKdY • Subject to • 1tY = 0 • 0 ≤ ≤ C • T(d) = W(d) by the principle of strong duality. • Differentiability with respect to d comes from Danskin's Theorem [Danskin 1947].
Dual - Derivative • Let *(d) be the optimal value of so that • W(d) = r(d) + 1t*– ½ *tYKdY* • W = r(d) – ½*tY KY* • Since d is fixed, W(d) is the standard SVM dual and * can be obtained using any SVM solver.
Final Algorithm • Initialise d0 randomly • Repeat until convergence criteria is met • Form K using the current estimate of d. • Use any SVM solver to obtain *. • Update dn+1 = max(0, dn – nW)
Applications • Feature selection • Learning discriminative parts/pixels for object categorization • Character Recognition taken in natural scenes
MKL & its Applications • In general, applications exploit one of the following views of MKL. • To obtain the optimal weights of different features used for the task. • To interpret the sparsity after learning the weights of the kernels. • To Combine the multiple heterogeneous data sources.
Applications : Feature Selection • UCI datasets k(xi,xj) = l exp( -dl(xil-xjl)2 )
UCI Datasets – Ionosphere N = 246 M = 34 Uniform MKL = 89.9 2.5 Uniform GMKL = 93.6 2.0
UCI Dataset – Parkinson’s N = 136 M = 22 Uniform MKL = 87.3 3.9 Uniform GMKL = 91.0 3.5
UCI Datasets – Musk N = 333 M = 166 Uniform MKL = 90.2 3.2 Uniform GMKL = 93.8 1.9
UCI Datasets – Sonar N = 145 M = 60 Uniform MKL = 82.9 3.4 Uniform GMKL = 84.6 4.1
UCI Datasets – Wpbc N = 135 M = 34 Uniform MKL = 72.1 5.4 Uniform GMKL = 77.0 6.4
Application:Learning Discriminative Pixels/Parts • Problem : Can Image Categorization can be done efficiently ? • Idea:Often, Information present in images can be redundant • Solution :Yes, by focusing on only a subset of pixels or regions in an image.
Solution • A Kernel is associated with each part
Pixel Selection for Gender Identification • Database of FERET faces [Moghaddam and Yang ,PAMI 2002]. Males Females
Gender Identification – Features Pixel 1 Pixel 252
Gender Identification - Results N = 1053 M = 252 Uniform MKL = 92.6 0.9 Uniform GMKL = 94.3 0.1
Caltech 101 • Task : Object recognition • No. of classes : 102 classes • Problem !!! Not Perfectly Aligned !!! ......but roughly aligned Collected by Fei-Fei et al. [PAMI 2006]
Approach Kernel 1 Feature Extraction : GIST Kernel 64
Problem • Objective : Character recognition (English) taken from natural scenes.
A sample approach for sentence recognition from images: bottom up Sentence • Locate characters in images • Recognise characters • Recognise words • Recognise sentences Word Word Character Character Character Character recognition Character detection Image
Challenges • Perspective distortion • Occlusion • Variations in • Contrast • Color • Style • Sizes • Motion blur • Inter class distance is less and intra class distance is more. • Large number of classes • None of existing OCR techniques works here.
class-based vector quantisation patch detection feature extraction histogram computation +0.1 -1.5 … -0.5 x = Character Recognition usingBag of Features (discrete distribution) classification
Feature Extraction Methods • Geometric Blur [Berg] • Shape Contexts [Belongie et al] • SIFT [Lowe] • Patches [Varma & Zisserman 07] • SPIN [Lazebnik et al., Johnson] • MR8 (maximum response of 8 filters) [Varma & Zisserman 05]
Results - SVM and MKL • MKL Class i Class j
Conclusions • We presented a formulation which accepts non-linear kernel combinations. • GMKL results can be significantly better than standard MKL. • We shown several applications where proposed formulation gives better than state of art methods.