Sparse Solutions for Large Scale Kernel Machines

Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems tdameh@cs.sfu.ca Dec 2nd , 2010

Outline • Introduction • Motivation: Kernel machines applications in multimedia content analysis and search • Challenges in large scale kernel machines • Previous Work • Sub-Quadratic approach to compute the sparse Gram matrix • Results • Conclusion and future Work

Introduction • Given a set of points, with a notion of distance between points, group the points into some number of clusters. • We use Kernel functions to compute the similarity between each pair of points to produce a Similarity (Gram) Matrix (O(N2) space and computation) • Example of kerenl kernel machines: • Support Vector Machines SVM (formulated for 2 classes) • Relevance Vector Machines (result much sparse models) • Guassian Process • Fisher’s Linear discriminant analysis LDA • Kernel PCA

Kernel machines applications in multimedia content analysis and search • BroadCast Video Summarization Using Clustering • Document Clustering • Audio Content Discovery • Searching one billion web images by content

Challenges and Sparse Solutions for Kernel Machines • One of the significant limitations of many kernel methods is that the kernel function k(x,y) must be evaluated for all possible pairs x and y of training points, which can be computationally infeasible. • traditional algorithm analysis assumes that the data fits in main memory, it is unreasonable to make such assumptions when dealing with massive data sets such as multimedia data, web page repositories and so on • Observing that kernel machines are Redial Basis Function, then the gram matrices have many values that are close to zero • We are developing algorithms to approximate the gram matrix to sparse one (filtering out the small similarities)

Previous Work • Approximation depending on the Eigen spectrum of the gram matrix • The Eigen spectrum rapidly decays especially when the kernel function is Radial basis (most information stored in the first few eigen vectors) • Sparse Bayesian learning • Methods that leads to much sparse models • Relevance vector machines (RVM) • Sparse kernel principle component analysis (sparse KPCA ) • Efficient Implementation of computing the kernel function • Space filling curves • Locality Sensitive Hashing (OUR Method)

Locality Sensitive hashing • Hash the data-points so that probability of collision is higher for close points. • A Family H=h : S → U is called (r1,r2,p1,p2)-sensitive, if for any v,q є S • dist(v,q) < r1 → ProbH [h(v) = h(p)] ≥ p1 • dist(v,q) > r2 → ProbH [h(v) = h(p)] ≤ p2 • p1 > p2andr1<r2  r2=cr1; c>1 • We need the gap between p1 and p2 a quite large • For a proper choice of k (will be shown later), • g(v) = {h1(v), …,hk(v) } • We compute the kernel function between the points that reside at the same bucket. • using this approach and for a hash table of size m (assuming the buckets have the same number of points) computing the gram matrix will have the complexity of N2/m

Sub-quadratic approach using LSH • Claim 1:The number of concatenated hash values k is logarithmic in the size of datasets n and independent of the dimension d • Proof: Given a set of n points P in the d-dimensional space and (r1; r2; p1; p2)-sensitive hash functions, and given a point q, the probability that • Is at most p2k = B/n , where B is the average bucket size. then we can find that:

Claim2:The complexity of computing the approximated gram matrix using the locality sensitive hashing is sub-quadratic. • Proof:

FN ratio VS Memory reeducation for different values of k

Affinity Propagation results for different values of k

Second stage of AP over the first stage weighted exemplars

N*d input vectors m segments each of size (N/m)*d Hashing L buckets files Compute Gram matrix of each bucket (gram matrix size is (N/L)2 ) and run clustering algorithm on each bucket’s Gram matrix Clustering Clusters with weights Combine clusters with Weights Run second phase of clustering Final Clusters

Conclusion and future work • Brute force kernel methods require O(N2) space and computation, where the assumption that data fits in main memory no longer works. • Approximate the full Gram matrix to sparse one depending on the redial basis property of such methods would reduce this quadratic down into sub-quadratic • Using the locality sensitive hashing we can find the close points to compute the kennel function between them and also we can distrusted the processing as the bucket will be the base unit. • Future Work: working on control the error as k increases, so we can both run very large scale data and at the same time maintain sufficient accuracy.

Sparse Solutions for Large Scale Kernel Machines