A Kernel Approach for Learning From Almost Orthogonal Pattern *

A Kernel Approach for Learning From Almost Orthogonal Pattern* CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf et al., Proc. 13th ECML, Aug 19-23, 2002, pp. 511-528.

Presentation Outline • Introduction • Motivation • A Brief review of SVM for linearly separable patterns • Kernel approach for SVM • Empirical kernel map • Problem: almost orthogonal patterns in feature space • An example • Situations leading to almost orthogonal patterns • Method to reduce large diagonals of Gram matrix • Gram matrix transformation • An approximate approach based on statistics • Experiments • Artificial data (String classification, Microarray data with noise, Hidden variable problem) • Real data (Thrombin binding, Lymphoma classification, Protein family classification) • Conclusions • Comments

Introduction

Motivation • Support vector machine (SVM) • Powerful method for classification (or regression) with high accuracy comparable to neural network • Exploit of kernel function for pattern separation in high dimensional space • The information of training data for SVM is stored in the Gram matrix (kernel matrix) • The problem: • SVM doesn’t perform well if Gram matrix has large diagonal values

─ ─ + + + + ─ ─ depends on closest points + + + ─ ─ ─ + ─ ─ margin A Brief Review of SVM For linearly separable patterns: To maximize the margin: Minimize: Constraints:

Minimize: Constraints: Kernel Approach for SVM (1/3) • For linearly non-separable patterns • Nonlinear mapping function (x)H: mapping the patterns to new feature space H of higher dimension • For example: the XOR problem • SVM in the new feature space: • The kernel trick: • Solving the above minimization problem requires: 1) Explicit form of  2) Inner product in high dimensional space H • Simplification by wise selection of kernel functions with property: k(xi, xj) = (xi)  (xj)

Minimize: Constraints: Kernel Approach for SVM (2/3) • Transform the problem with kernel method • Expand w in the new feature space: w = ai(xi) = [(x)]awhere [(x)]=[(x1), (x2), …, (xm)], and a=[a1, a2, … am]T • Gram matrix: K=[Kij], where Kij = (xi)  (xj) = k(xi, xj) (symmetric !) • The (squared) objective function:||w||2 = aT[(x)]T[(x)]a = aTKa (sufficient condition for existence of optimal solution: K is positive definite) • The constraints:yi{wT(xi) + b} = yi{aT[(x)]T(xi) + b} = yi{aTKi + b}  1, where Ki is the ith column of K.

Where: a and b are optimal solution based on training data, and m is the number if training data Kernel Approach for SVM (3/3) • To predict new data with a trained SVM • The explicit form of k(xi, xj) is required for prediction of new data

Minimize: Constraints: Empirical Kernel Mapping • Assumption: m (the number if instances) is a sufficient high dimension of the new feature space. i.e. the patterns will be linearly separable in m-dimension space (Rm) • Empirical kernel map: m(xi)= [k(xi,x1), k(xi,x2), …, k(xi,xm)]T =Ki • The SVM in Rm • The new Gram matrix Km associated withm(x): Km=[Kmij], where Kmij = m(xi)  m(xj) = Ki  Kj = KiTKj, i.e.Km = KTK = KKT • Advantage of empirical kernel map: Km is positive definite • Km= KKT = (UTDU) (UTDU)T= UTD2U(K is symmetric, U is unitary matrix, D is diagonal) • Satisfied the sufficient condition of above minimization problem

The Problem: Almost Orthogonal Patterns in the Feature Space Result in Poor Performance

An Example of Almost Orthogonal Patterns • The Gram matrix with linear kernel k(xi, xj) = xi xj • The training dataset with almost orthogonal patterns Large Diagonals • w is the solution with standard SVM • Observation: each large entry in w is corresponding to a column in X with only one large entry: w becomes a lookup table, the SVM won’t generalize well • A better solution:

Situations Leading to Almost Orthogonal Patterns • Sparsity of the patterns in the new feature space, e.g. • x = [ 0, 0, 0, 1, 0, 0, 1, 0]T • Y = [ 0, 1, 1, 0, 0, 0 , 0, 0]T • x  x  y  y >> x  y (large diagonals in Gram matrix) • Some selection of kernel functions may result in sparsity in the new feature space • String kernel (Watkins 2000, et al) • Polynomial kernel, k(xi, xj) = (xixj)d, with large order d • If xi  xi > xi  xj , for ij, then • k(xi, xi) >> k(xi, xj), for even moderately large d, due to the exponential function.

Methods to Reduce the Large Diagonals of Gram Matrices

Gram Matrix Transformation (1/2) • For symmetric, positive definite Gram matrix K (or Km), • K = UTDU U is unitary matrix, D is diagonal matrix • Define f(K) = UTf(D)U, andf(D)ii = f(Dii) i.e., the function f operates on the eigenvalues i of K • f(K) should preserve positive definition of the Gram matrix • A sample procedure for Gram matrix transformation • (Optional) Compute the positive definite matrix A = sqrt(K) • Suppress the large diagonals of A, and obtain a symmetric A’ i.e. transform the eigenvalues of A:[min, max]  [f(min ), f(max )] • Compute the positive definite matrix K’=(A’)2

k(xi,xj)= (xi)  (xj) (x) K Implicit transformation f(K) ’(x) K’ k’(xi,xj) = ’(xi)  ’(xj) a’ and b’ from the portion of K’ corresponding to the training data K’=f(K) If xi has been used in calculating K’,the prediction on xi can simply use K’i i= 1, 2,…m+n, where m is the number if training dataand n is the number of testing data Gram Matrix Transformation (2/2) • Effect of matrix transformation • The explicit form of new kernel function k’ is not available • k’ is required when the trained SVM is used to predict the testing data • A solution: include all test data into K before the matrix transformation K->K’i.e. the testing data has to be known in training time

An Approximate Approach based on Statistics • The empirical kernel map m+n(x) should be used to calculate the Gram matrix • Assuming the dataset size r is large • Therefore, the SVM can be simply trained with the empirical map on the training set, m(x), instead of m+n(x)

Experiment Results

Artificial Data (1/3) • String classification • String kernel function (Watkins 2000, et al) • Sub-polynomial kernel k(x,y) = [(x)  (y)]P, 0<P<1 for sufficiently small P, the large diagonals of K can be suppressed • 50 strings (25 for training, and 25 for testing), 20 trials

Artificial Data (2/3) • Microarray data with noise (Alon et al, 1999) • 62 instance (22 positive, 44 negative), 2000 features in original data • 10000 noise features were added (1% to be non-zero in probability) Error rate for SVM without noise addition is: 0.180.15

Artificial Data (3/3) • Hidden variable problem • 10 hidden variables (attributes), 10 additional attributes which are nonlinear functions of the 10 hidden variables • Original kernel is polynomial kernel of order 4

Real Data (1/3) • Thrombin binding problem • 1909 instances, 139,351 binary features • 0.68% entries are non-zero • 8-fold cross validation

Real Data (2/3) • Lymphoma classification (Alizadeh et al, 2000) • 96 samples, 4026 features • 10-fold cross validation • Improved results observed compared with previous work (Weston, 2001)

Real Data (3/3) • Protein family classification (Murzin et al, 1995) • Small positive set, large negative set Receiver operating characteristic 1: best score 0: worst score Rate of false positive

Conclusions • Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed • The common situation that sparse vectors leading to large diagonals was identified and discussed • A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases • Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices

Comments • Strong points: • The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals • Experiments are extensive • Weak points: • The application of Gram matrix transformation may be severely restricted in forecasting or other applications in which the testing data is not know in training time • The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments • The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied

End!

A Kernel Approach for Learning From Almost Orthogonal Pattern *