370 likes | 398 Views
Learn how kernel methods process datasets, compare PCA vs. Kernel PCA, analyze data using kernel algorithms, and delve into Bag of Words, Semantic Kernels, String Kernels, and Tree Kernels. Discover details of a Data Mining Problem and evaluate model performance.
E N D
Overview of Kernel Methods (Part 2) Steve Vincent November 17, 2004
Overview • Kernel methods offer a modular framework. • In a first step, a dataset is processed into a kernel matrix. Data can be of various types, and also heterogeneous types. • In a second step, a variety of kernel algorithms can be used to analyze the data, using only the information contained in the kernel matrix
What will be covered today • PCA vs. Kernel PCA • Algorithm • Example • Comparison • Text Related Kernel Methods • Bag of Words • Semantic Kernels • String Kernels • Tree Kernels
PCA algorithm • Subtract the mean from all the data points • Compute the covariance matrix S= • Diagonalize S to get its eigenvalues and eigenvectors • Retain c eigenvectors corresponding to the c largest eigenvalues such that equals the desired variance to be captured • Project the data points on the eigenvectors
Kernel PCA algorithm (1) • Given N data points in d dimensions let X={x1|x2|..|xN} where each column represents one data point • Subtract the mean from all the data points • Choose an appropriate kernel k • Form the NxN Gram matrix K|ij=[k(xi,xj)} • Form the modified Gram matrix where 1NxN is an NxN matrix with all entries equal to 1
Kernel PCA algorithm (2) • Diagonalize K to get its eigenvalues ln and its eigenvectors an • Normalize an • Retain c eigenvectors corresponding to c largest eigenvalues such that equals desired variance to be captured • Project the data points on the eigenvectors
Data Mining Problem • Data Source • Computational Intelligence and Learning Cluster Challenge 2000, http://www.wi.leidenuniv.nl/~putten/library/cc2000/index.html • Supplied by Sentient Machine Research, http://www.smr.nl • Problem Definition: • Given data which incorporates both socio-economic and various insurance policy ownership attributes, can we derive models which help in determining factors or attributes which may influence or signify individuals who purchase a caravan insurance policy.
Data Selection • 5,822 records for training • 4,000 records for evaluation • 86 attributes • Attributes 1 through 43: • Socio-demographic data derived from zip code areas • Attributes 44 through 85: • Product ownership for customers • Attribute 86 • Purchased caravan insurance
Data Transformation and Reduction • Principal Components Analysis (PCA) [From MatLab]
Relative Performance • PCA run time: 6.138 • Kernel PCA run time: 5.668 • Used Radial Basis Function Kernel • Matlab Code for PCA and Kernel PCA algorithm can be supplied if needed
Modeling, Test and Evaluation Manually Reduced Dataset: Naïve Bayes Overall – 82.79% Correctly Classified a b 3155 544 a 14.71% False Positive 132 98 b 42.61% Correctly Classified PCA Reduced Dataset: Naïve Bayes Overall – 88.45% Correctly Classified a b 3507 255 a 6.77% False Positive 207 31 b 13.03% Correctly Classified Kernel PCA Reduced Dataset: Naïve Bayes Overall – 82.22% Correctly Classified a b 3238 541 a 14.3% False Positive 175 74 b 29.7% Correctly Classified * Legend: a – no, b yes
Overall Results • KPCA and PCA had similar time performance • KPCA is much gives results closer to manually reduced dataset • Future Work: • Examine other Kernels • Vary the parameters for the Kernels • Use other Data Mining Algorithms
‘Bag of words’ kernels (1) • Document seen as a vector d, indexed by all the elements of a (controlled) dictionary. The entry is equal to the number of occurrences. • A training corpus is therefore represented by a Term-Document matrix, noted D=[d1d2…dm-1 dm] • From this basic representation, we will apply a sequence of successive embeddings, resulting in a global (valid) kernel with all desired properties
BOW kernels (2) • Properties: • All order information is lost (syntactical relationships, local context, …) • Feature space has dimension N (size of the dictionary) • Similarity is basically defined by: k(d1,d2)=d1•d2= d1t.d2 or, normalized (cosine similarity): • Efficiency provided by sparsity (and sparse dot-product algorithm): O(|d1|+|d2|)
Latent concept Kernels • Basic idea : F1 terms terms terms terms terms Size t documents K(d1,d2)=? Size d F2 Size k <<t Concepts space
Semantic Kernels (1) • k(d1,d2)=(d1)SS’(d2)’ • where S is the semantic matrix • S can be defined as S=RP where • R is a diagonal matrix giving the term weightings or relevances • P is the proximity matrix defining the semantic spreading between the different terms of the corpus • The measure for the inverse document frequency for a term t is given by: • The matrix R is diagonal with entries: Rtt=w(t) l=# of documents df(t)=# of documents containing term t
Semantic Kernels (2) • The associated kernel is: • For the proximity matrix (P) the associated kernel is: Where Qij encodes the amount of semantic relation between terms i and j.
Semantic Kernels (3) • Most natural method of incorporating semantic information is be inferring the relatedness of terms from an external source of domain knowledge • Example: WordNet Ontology • Semantic Distance • Path length of hierarchical tree • Information content
Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI) • Singular Value Decomposition (SVD): where S is a diagonal matrix of the same dimensions as D, and U and V are unitary matrices whose columns are the eigenvectors of D’D and DD’ respectively • LSI projects the documents into the space spanned by the first k columns of U, suing the new k-dimensional vectors for subsequent processing where Uk is the matrix containing the first k columns of U
Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI) • New kernel becomes that of Kernel PCA • LSK is implemented by projecting onto the features: where k is the base kernel, and li vi are eigenvalue, eigenvector pairs of the kernel matrix • Can represent the LSK’s with the proximity matrix
String and Sequence • An alphabet is a finite set S of |S| symbols. • A string s=s1…s|s| is any finite sequence of symbols from S, including the empty sequence. • We denote Sn the set of all finite strings of length n • String matching: implies contiguity • Sequence matching : only implies order
p-spectrum kernel (1) • Features of s = p-spectrum of s = histogram of all (contiguous) substrings of length p • Feature space indexed by all elements of Sp • fu(s)=number of occurrences of u in s • The associated kernel is defined as
p-spectrum kernel example • Example: 3-spectrum kernel • s=“statistics’ and t=“computation” • The two strings contain the following substrings of length 3: • “sta”,”tat”, “ati”, “tis”, “ist”, “sti”, “tic”, “ics” • “com”, “omp”, “mpu”, “put”, “uta”, “tat”, “ati”, “tio”, “ion” • Common substrings of “tat” and “ati”, so the inner product k(s,t)=2
p-spectrum Kernels Recursion • k-suffix kernel is defined by • p-spectrum kernel can be evaluated using the equation: in O(p |s| |t|) operations • The evaluation of one row of the table for the p-suffix kernel corresponds to performing a search in the string t for the p-suffix of a prefix in s.
All-subsequences kernels • Feature mapping defined by all contiguous or non-contiguous subsequences of a string • Feature space indexed by all elements of S*={e}U S U S2U S3U… • fu(s)=number of occurrences of u as a (non-contiguous) subsequence of s • Explicit computation rapidly infeasible (exponential in |s| even with sparse rep.)
Recursive implementation • Consider the addition of one extra symbol a to s: common subsequences of (sa,t) are either in s or must end with symbol a (in both sa and t). • Mathematically, • This gives a complexity of O(|s||t|2)
Fixed-length subsequence kernels • Feature space indexed by all elements of Sp • fu(s)=number of occurrences of the p-gram u as a (non-contiguous) subsequence of s • Recursive implementation(will create a series of p tables) • Complexity: O(p|s||t|) , but we have the k-length subseq. kernels (k<=p) for free easy to compute k(s,t)=Salkl(s,t)
Gap-weighted subsequence kernels (1) • Feature space indexed by all elements of Sp • fu(s)=sum of weights of occurrences of the p-gram u as a (non-contiguous) subsequence of s, the weight being length penalizing: llength(u)) [NB: length includes both matching symbols and gaps] • Example (1) • The string “gon” occurs as a subsequence of the strings “gone”, “going” and “galleon”, but we consider the first occurrence as more important since it is contiguous, while the final occurrence is the weakest of all three
Gap-weighted subsequence kernels (2) • Example(2) • D1 : ATCGTAGACTGTC • D2 : GACTATGC • (D1)CAT = 2l8+2l10and(D2)CAT = l4 • k(D1,D2)CAT=2l12+2l14 • Naturally built as a dot product valid kernel • For alphabet of size 80, there are 512,000 trigrams • For alphabet of size 26, there are 12 x 106 5-grams
Gap-weighted subsequence kernels (3) • Hard to perform explicit expansion and dot-product • Efficient recursive formulation (dynamic programming type), whose complexity is O(k |D1| |D2|)
Word Sequence Kernels (1) • Here “words” are considered as symbols • Meaningful symbols more relevant matching • Linguistic preprocessing can be applied to improve performance • Shorter sequence sizes improved computation time • But increased sparsity (documents are more : “orthogonal”) • Motivation : the noisy stemming hypothesis (important N-grams approximate stems), confirmed experimentally in a categorization task
Word Sequence Kernels (2) • Link between Word Sequence Kernels and other methods: • For k=1, WSK is equivalent to basic “Bag Of Words” approach • For l=1, close relation to polynomial kernel of degree k, WSK takes order into account • Extension of WSK: • Symbol dependant decay factors (way to introduce IDF concept, dependence on the POS, stop words) • Different decay factors for gaps and matches (e.g. lnoun<ladj when gap; lnoun>ladj when match) • Soft matching of symbols (e.g. based on thesaurus, or on dictionary if we want cross-lingual kernels)
Tree Kernels • Application: categorization [one doc=one tree], parsing (disambiguation) [one doc = multiple trees] • Tree kernels constitute a particular case of more general kernels defined on discrete structure (convolution kernels). Intuitively, the philosophy is • to split the structured objects in parts, • to define a kernel on the “atoms” and a way to recursively combine kernel over parts to get the kernel over the whole. • Feature space definition: one feature for each possible proper subtree in the training data; feature value = number of occurrences • A subtree is defined as any part of the tree which includes more than one node, with the restriction there is no “partial” rule production allowed.
Trees in Text : example S N NP • Example : VP S Mary VP VP NP VP V N V N John V N loves loves Mary loves Mary VP … a few among the many subtrees of this tree! A Parse Tree V N
Tree Kernels : algorithm • Kernel = dot product in this high dimensional feature space • Once again, there is an efficient recursive algorithm (in polynomial time, not exponential!) • Basically, it compares the production of all possible pairs of nodes (n1,n2) (n1T1, n2 T2); if the production is the same, the number of common subtrees routed at both n1 and n2 is computed recursively, considering the number of common subtrees routed at the common children • Formally, let kco-rooted(n1,n2)=number of common subtrees rooted at both n1 and n2
All sub-tree kernel • Kco-rooted(n1,n2)=0 if n1 or n2 is a leaf • Kco-rooted(n1,n2)=0 if n1 and n2 have different production or, if labeled, different label • Else Kco-rooted(n1,n2)= • “Production” is left intentionally ambiguous to both include unlabelled tree and labeled tree • Complexity s O(|T1|.|T2|)
References • J. Shawe-Tayor and N. Cristianini, Kernel Methods for Pattern Analysis, 2004 (Chapter 10 and 11) • J. Tian, “PCA/Kernel PCA for Image Denoising”, September 16, 2004 • T. Gartner, “A Survey of Kernels for Structured Data”, ACM SIGKDD Explorations Newsletter, July 2003 • N. Cristianini, “Latent Semantic Kernels”, Proceedings of ICML-01, 18th International Conference on Machine Learning, 2001