Rakesh Babu 200402007 Advisors : Prof C. V. Jawahar, IIIT Hyderabad

Learning Non-Linear Kernel Combinations Subject to General Regularization: Theory and Applications Rakesh Babu 200402007 Advisors : Prof C. V. Jawahar, IIIT Hyderabad Dr. Manik Varma, Microsoft Research

Overview • Introduction to Support Vector Machines (SVM) • Multiple Kernel Learning (MKL) • Problem Statement • Literature Survey • Generalized Multiple Kernel Learning ( GMKL ) • Applications • Conclusion

SVM Notation Xi i = 1,..…..,M yi i = 1,……,M Margin = 2 /  > 1 Misclassified point  < 1 b Support Vector Support Vector  = 0 w wt(x) + b = -1 wt(x) + b = 0 wt(x) + b = +1

SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0    C • w = i iyi xi

SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0    C • f(x) = wt x + b = i iyi < xi, x > + b

Kernel Trick • Using some function which maps input space to feature space. • We build the classifier in feature space.

SVM Formulation • Primal : Minimise ½wtw + C ii • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<xi , xj> • Subject to • iiyi = 0 • 0    C

SVM after Kernelization • Primal : Minimise ½wtw + C ii • Subject to • yi [wt (xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyj<(xi) , (xj) > • Subject to • iiyi = 0 • 0    C • f(x) = wtx + b = i iyi<(xi) , (x) >+ b Dot product in feature space

SVM after Kernelization • Primal : Minimise ½wtw + C ii • Subject to • yi [wt (xi) + b] ≥ 1 – i • i ≥ 0 • Dual : Max i i + ij ijyiyjk(xi , xj) • Subject to • iiyi = 0 • 0    C • f(x) = wtx + b = i iyik(xi , x)+ b Kernel function

Kernel function & Kernel Matrix Class 1 Class 2 • Dot products in feature space is computed efficiently using kernel function. • e.g. RBF = • Properties of Kernel Function • Positive definite kernel Class 1 Class 2 t(xi)(xj) k(xi,xj) Kij = k(xi,xj)

Linear : k(xi,xj) = xitxj • Polynomial : k(xi,xj) = (xitxj + c)d • Gaussian (RBF) : k(xi,xj) = exp( –(xi – xj)2) • Chi-Squared : k(xi,xj) = exp( –2(xi, xj) ) Some Popular Kernels

Varying Kernel Parameter g g g =0.001 =1 =1000 RBF RBF RBF 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 Decision Boundaries 5

Learning the kernel • Valid kernels : • k = α1k1 + α2k2 • k = k1 * k2 • Learning the kernel function k(xi,xj) = ldlkl(xi , xj)

Multiple Kernel Learning • Learning the SVM parameters (’s ) and Kernel parameters (d’s) is multiple kernel learning problem. • k(xi,xj) = ldlkl(xi,xj) d22  = d33

Problem Statement • Most of multiple kernel learning formulations are restricted to linear combination of kernels subject to either l1 or l2 regularization. • In this thesis, we address the problem of how the kernel can be learnt using non-linear kernel combinations subject to general regularization. • We investigate some applications, the use of non-linear kernel combinations.

Literature Survey • Kernel Target Alignment • Semi-Definite Programming-MKL (SDP) • Block l1-MKL (M-Y regularization + SMO) • Semi-Infinite Linear Programming-MKL (SILP) • Simple MKL (gradient descent) • Hyper kernels (SDP/SOCP) • Multi-class MKL • Hierarchical MKL • Local MKL • Mixed norm MKL (mirror descent)

Multiple Kernel Learning • MKL learns a linear combination of base kernels • k(xi,xj) = ldlkl(xi,xj) d11 d22  = d33

Generalized MKL • GMKL learns non-linear kernel combinations • Product : k(xi,xj) = lkl(xi,xj) 1 = 2 =  =

Toy Example : Non-linear Kernel Combination Individual 1D feature spaces 1 and 2 3 1 4 1 1 2 0 0 2 2 4 3 Combined kernel feature spaces 2 3 1 4 1 1 0 3 2 1 2 2 4 Sum Product

Generalized MKL Primal • Formulation • Min ½wtw + i L(f(xi), yi) + r(d) • subject to the constraints on d • where • (xi, yi) is the ith training point. • f(x) = wtd(x) + b • L is a general loss function. • Kd is a kernel function parameterised by d. • r is a regulariser on the kernel parameters. • This formulation is not convex.

GMKL Primal for Classification • MinimisedT(d) subject to d ≥ 0 • where • T(d) = Minw,b,½wtw + C ii +r(d) • Subject to • yi [wt(xi) + b] ≥ 1 – i • i ≥ 0 • To minimise T using gradient descent we need to • Prove that dT exists. • Calculate dT efficiently.

Visualizing T on UCI Sonar Data

Dual - Differentiability • W(d) = r(d) + Max1t – ½ tYKdY • Subject to • 1tY = 0 • 0 ≤  ≤ C • T(d) = W(d) by the principle of strong duality. • Differentiability with respect to d comes from Danskin's Theorem [Danskin 1947].

Dual - Derivative • Let *(d) be the optimal value of  so that • W(d) = r(d) + 1t*– ½ *tYKdY* • W = r(d) – ½*tY KY* • Since d is fixed, W(d) is the standard SVM dual and * can be obtained using any SVM solver.

Final Algorithm • Initialise d0 randomly • Repeat until convergence criteria is met • Form K using the current estimate of d. • Use any SVM solver to obtain *. • Update dn+1 = max(0, dn – nW)

Applications • Feature selection • Learning discriminative parts/pixels for object categorization • Character Recognition taken in natural scenes

MKL & its Applications • In general, applications exploit one of the following views of MKL. • To obtain the optimal weights of different features used for the task. • To interpret the sparsity after learning the weights of the kernels. • To Combine the multiple heterogeneous data sources.

Applications : Feature Selection • UCI datasets k(xi,xj) = l exp( -dl(xil-xjl)2 )

UCI Datasets – Ionosphere N = 246 M = 34 Uniform MKL = 89.9  2.5 Uniform GMKL = 93.6  2.0

UCI Dataset – Parkinson’s N = 136 M = 22 Uniform MKL = 87.3  3.9 Uniform GMKL = 91.0  3.5

UCI Datasets – Musk N = 333 M = 166 Uniform MKL = 90.2  3.2 Uniform GMKL = 93.8  1.9

UCI Datasets – Sonar N = 145 M = 60 Uniform MKL = 82.9  3.4 Uniform GMKL = 84.6  4.1

UCI Datasets – Wpbc N = 135 M = 34 Uniform MKL = 72.1  5.4 Uniform GMKL = 77.0  6.4

Application:Learning Discriminative Pixels/Parts • Problem : Can Image Categorization can be done efficiently ? • Idea:Often, Information present in images can be redundant • Solution :Yes, by focusing on only a subset of pixels or regions in an image.

Solution • A Kernel is associated with each part

Pixel Selection for Gender Identification • Database of FERET faces [Moghaddam and Yang ,PAMI 2002]. Males Females

Gender Identification – Features Pixel 1 Pixel 252

Gender Identification - Results N = 1053 M = 252 Uniform MKL = 92.6  0.9 Uniform GMKL = 94.3  0.1

Caltech 101 • Task : Object recognition • No. of classes : 102 classes • Problem !!! Not Perfectly Aligned !!! ......but roughly aligned Collected by Fei-Fei et al. [PAMI 2006]

Approach Kernel 1 Feature Extraction : GIST Kernel 64

Faces_Easy and Windsor_chair

Car_Side and Leopards

Minaret and Bikes

Problem • Objective : Character recognition (English) taken from natural scenes.

A sample approach for sentence recognition from images: bottom up Sentence • Locate characters in images • Recognise characters • Recognise words • Recognise sentences Word Word Character Character Character Character recognition Character detection Image

Challenges • Perspective distortion • Occlusion • Variations in • Contrast • Color • Style • Sizes • Motion blur • Inter class distance is less and intra class distance is more. • Large number of classes • None of existing OCR techniques works here.

class-based vector quantisation patch detection feature extraction histogram computation +0.1 -1.5 … -0.5 x = Character Recognition usingBag of Features (discrete distribution) classification

Feature Extraction Methods • Geometric Blur [Berg] • Shape Contexts [Belongie et al] • SIFT [Lowe] • Patches [Varma & Zisserman 07] • SPIN [Lazebnik et al., Johnson] • MR8 (maximum response of 8 filters) [Varma & Zisserman 05]

Results - SVM and MKL • MKL Class i Class j

Conclusions • We presented a formulation which accepts non-linear kernel combinations. • GMKL results can be significantly better than standard MKL. • We shown several applications where proposed formulation gives better than state of art methods.

Rakesh Babu 200402007 Advisors : Prof C. V. Jawahar, IIIT Hyderabad