1 / 45

Dimension Reduction in the Hamming Cube (and its Applications)

Explore dimension reduction in the Hamming cube, applications in ANN and k-clustering, communication complexity game, clustering formulations, and addressing high-dimensional data challenges. Discover how data is represented and clustered using various methods.

Download Presentation

Dimension Reduction in the Hamming Cube (and its Applications)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

  2. PLAN • Problem Formulations • Communication complexity game • What really happened? (dimension reduction) • Solutions to 2 problems • ANN • k-clustering • What’s next?

  3. Problem statements • Johnson-lindenstrauss lemma: n points in high dim. Hilbert Space can be embedded into O(logn) dim subspace with small distortion • Q: how do we do it for the Hamming Cube? • (we show how to avoid impossibility of [Charicar-Sahai])

  4. Many different formulations of ANN • ANN – “approximate nearest neighbor search” • (many applications in computational geometry, biology/stringology, IR, other areas) • Here are different formulations:

  5. Approximate Searching • Motivation: given a DB of “names”, user with a “target” name, find if any of DB names are “close” to the current name, without doing liner scan.

  6. Geometric formulation • Nearest Neighbor Search (NNS): given N blue points (and a distance function, say Euclidian distance in Rd), store all these points somehow

  7. Data structure question • given a new red point, find closest blue point.

  8. Can we do better? • Easy in small dimensions (Voronoi diagrams) • “Curse of dimensionality” in High Dimensions… • [KOR]: Can get a good “approximate” solution efficiently!

  9. Hamming Cube Formulation for ANN • Given a DB of N blue n-bit strings, process them somehow. Given an n-bit red string find ANN in the Hyper-Cube {0,1}n • Naïve solution 2: pre-compute all (exponential #) of answers (want small data-structures!)

  10. Clustering problem that I’ll discuss in detail • K-clustering

  11. An example of Clustering – find “centers” • Given N points in Rd

  12. A clustering formulation • Find cluster “centers”

  13. Clustering formulation • The “cost” is the sum of distances

  14. Main technique • First, as a communication game • Second, interpreted as a dimension reduction

  15. COMMUNICATION COMPLEXITY GAME • Given two players Alice and Bob, • Alice is secretly given string x • Bob is secretly given string y • they want to estimate hamming distance between x and y with small communication (with small error), provided that they have common randomness • How can they do it? (say length of |x|=|y|= N) • Much easier: how do we check that x=y ?

  16. Main lemma : an abstract game • How can Alice and Bob estimate hamming distance between X and Y with small CC? • We assume Alice and Bob sharerandomness

  17. A simpler question • To estimate hamming distance between X and Y (within (1+ e)) with small CC, sufficient for Alice and Bob for any L to be able to distinguish X and Y for: • H(X,Y) <= L OR • H(X,Y) > (1+ e) L • Q: why sampling does not work?

  18. XOR XOR 0/1 0/1 Alice and Bob pick the SAME n-bit blue Reach bit of R=1 independently with probability 1/2L

  19. What is the difference in probabilities?H(X,Y) <= L and H(X,Y) > (1+ e) L

  20. How do we amplify?

  21. How do we amplify? - Repeat, with many independent R’s but same distribution!

  22. a refined game with a small communication • How can Alice and Bob distinguish X and Y: • H(X,Y) <= L OR • H(X,Y) > (1+ e) L

  23. Dimension Reduction in the Hamming Cube [OR]

  24. Dimension Reduction in the Hamming Cube [OR]

  25. Applications • Applications of the dimension reduction in the Hamming CUBE • For ANN in the Hamming cube and Rd • For K-Clustering

  26. Application to ANN in the Hamming Cube • For each possible L build a “small cube” and project original DB to a small cube • Pre-compute inverse table for each entry of the small cube. • Why is this efficient? • How do we answer any query? • How do we navigate between different L?

  27. Putting it All together:User’s private approx search from DB • Each projection is O(log N) R’s. User picks many such projections for each L-range. That defines all the embeddings. • Now, DB builds inverse lookup tables for each projection as new DB’s for each L. • User can now “project” its query into small cube and use binary search on L

  28. MAIN THM [KOR] • Can build poly-size data-structure to do ANN for high-dimensional data in time polynomial in d and poly-log in N • For the hamming cube • L_1 • L_2 • Square of the Euclidian dist. • [IM] had a similar results, slightly weaker guarantee.

  29. Dealing with Rd • Project to random lines, choose “cut” points… • Well, not exactly… we need “navigation”

  30. Clustering • Huge number of applications (IR, mining, analysis of stat data, biology, automatic taxonomy formation, web, topic-specific data-collections, etc.) • Two independent issues: • Representation of data • Forming “clusters” (many incomparable methods)

  31. Representation of data examples • Latent semantic indexing yields points in Rd with l2 distance (distance indicating similarity) • Min-wise permutation (Broder at. al.) approach yields points in the hamming metric • Many other representations from IR literature lead to other metrics, including edit-distance metric on strings • Recent news: [OR-95] showed that we can embed edit-distance metric into l1 with small distortion distortion= exp(sqrt(\log n \log log n))

  32. Geometric Clustering: examples • Min-sum clustering in Rd: form clusters s.t. the sum of intra-cluster distances in minimized • K-clustering: pick k “centers” in the ambient space. The cost is the sum of distances from each data-point to the closest center • Agglomerative clustering (form clusters below some distance-threshold) • Q: which is better?

  33. Methods are (in general) incomparable

  34. Min-SUM

  35. 2-Clustering

  36. A k-clustering problem: notation • N – number of points • d – dimension • k – number of centers

  37. About k-clustering • When k if fixed, this is easy for small d • [Kleinberg, Papadimitriou, Raghavan]: NP-complete for k=2 for the cube • [Drineas, Frieze, Kannan, Vempala, Vinay]” NP complete for Rd for square of the Euclidian distance • When k is not fixed, this is facility location (Euclidian k-median) • For fixed d but growing k a PTAS was given by [Arora, Raghavan, Rao] (using dynamic prog.) • (this talk): [OR]: PTAS for fixed k, arbitrary d

  38. Common tools in geometric PTAS • Dynamic programming • Sampling [Schulman, AS, DLVK] • [DFKVV] use SVD • Embeddings/dimension reduction seem useless because • Too many candidate centers • May introduce new centers

  39. [OR] k-clustering result • A PTAS for fixed k • Hamming cube {0,1}d • l1d • l2d (Euclidian distance) • Square of the Euclidian distance

  40. Main ideas • For 2-clustering find a good partition is as good as solving the problem • Switch to cube • Try partitions in the embedded low-dimensional data set • Given a partition, compute centers and cost in the original data send • Embedding/dim. reduction used to reduce the number of partitions

  41. Stronger property of [OR] dimension reduction • Our random linear transformation preserve ranges!

  42. THE ALGORITHM

  43. The algorithm yet again • Guess 2-center distance • Map to small cube • Partition in the small cube • Measure the partition in the big cube • THM: gets within (1+e) of optimal. • Disclaimer: PTAS is (almost never) practical, this shows “feasibility only”, more ideas are needed for a practical solution.

  44. Dealing with k>2 • Apex of a tournament is a node of max out-degree • Fact: apex has a path of length 2 to every node • Every point is assigned an apex of center “tournaments”: • Guess all (k choose 2) center distances • Embed into (k choose 2) small cubes • Guess center-projection in small cubes • For every point, for every pair of centers, define a “tournament” which center is closer in the projection

  45. Conclusions • Dimension reduction in the cube allows to deal with huge number of “incomparable” attributes. • Embeddings of other metrics into the cube allows fast ANN for other metrics • Real applications still require considerable additional ideas • Fun area to work in

More Related