Dimension Reduction in the Hamming Cube (and its Applications)

Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

PLAN • Problem Formulations • Communication complexity game • What really happened? (dimension reduction) • Solutions to 2 problems • ANN • k-clustering • What’s next?

Problem statements • Johnson-lindenstrauss lemma: n points in high dim. Hilbert Space can be embedded into O(logn) dim subspace with small distortion • Q: how do we do it for the Hamming Cube? • (we show how to avoid impossibility of [Charicar-Sahai])

Many different formulations of ANN • ANN – “approximate nearest neighbor search” • (many applications in computational geometry, biology/stringology, IR, other areas) • Here are different formulations:

Approximate Searching • Motivation: given a DB of “names”, user with a “target” name, find if any of DB names are “close” to the current name, without doing liner scan.

Geometric formulation • Nearest Neighbor Search (NNS): given N blue points (and a distance function, say Euclidian distance in Rd), store all these points somehow

Data structure question • given a new red point, find closest blue point.

Can we do better? • Easy in small dimensions (Voronoi diagrams) • “Curse of dimensionality” in High Dimensions… • [KOR]: Can get a good “approximate” solution efficiently!

Hamming Cube Formulation for ANN • Given a DB of N blue n-bit strings, process them somehow. Given an n-bit red string find ANN in the Hyper-Cube {0,1}n • Naïve solution 2: pre-compute all (exponential #) of answers (want small data-structures!)

Clustering problem that I’ll discuss in detail • K-clustering

An example of Clustering – find “centers” • Given N points in Rd

A clustering formulation • Find cluster “centers”

Clustering formulation • The “cost” is the sum of distances

Main technique • First, as a communication game • Second, interpreted as a dimension reduction

COMMUNICATION COMPLEXITY GAME • Given two players Alice and Bob, • Alice is secretly given string x • Bob is secretly given string y • they want to estimate hamming distance between x and y with small communication (with small error), provided that they have common randomness • How can they do it? (say length of |x|=|y|= N) • Much easier: how do we check that x=y ?

Main lemma : an abstract game • How can Alice and Bob estimate hamming distance between X and Y with small CC? • We assume Alice and Bob sharerandomness

A simpler question • To estimate hamming distance between X and Y (within (1+ e)) with small CC, sufficient for Alice and Bob for any L to be able to distinguish X and Y for: • H(X,Y) <= L OR • H(X,Y) > (1+ e) L • Q: why sampling does not work?

XOR XOR 0/1 0/1 Alice and Bob pick the SAME n-bit blue Reach bit of R=1 independently with probability 1/2L

What is the difference in probabilities?H(X,Y) <= L and H(X,Y) > (1+ e) L

How do we amplify?

How do we amplify? - Repeat, with many independent R’s but same distribution!

a refined game with a small communication • How can Alice and Bob distinguish X and Y: • H(X,Y) <= L OR • H(X,Y) > (1+ e) L

Dimension Reduction in the Hamming Cube [OR]

Applications • Applications of the dimension reduction in the Hamming CUBE • For ANN in the Hamming cube and Rd • For K-Clustering

Application to ANN in the Hamming Cube • For each possible L build a “small cube” and project original DB to a small cube • Pre-compute inverse table for each entry of the small cube. • Why is this efficient? • How do we answer any query? • How do we navigate between different L?

Putting it All together:User’s private approx search from DB • Each projection is O(log N) R’s. User picks many such projections for each L-range. That defines all the embeddings. • Now, DB builds inverse lookup tables for each projection as new DB’s for each L. • User can now “project” its query into small cube and use binary search on L

MAIN THM [KOR] • Can build poly-size data-structure to do ANN for high-dimensional data in time polynomial in d and poly-log in N • For the hamming cube • L_1 • L_2 • Square of the Euclidian dist. • [IM] had a similar results, slightly weaker guarantee.

Dealing with Rd • Project to random lines, choose “cut” points… • Well, not exactly… we need “navigation”

Clustering • Huge number of applications (IR, mining, analysis of stat data, biology, automatic taxonomy formation, web, topic-specific data-collections, etc.) • Two independent issues: • Representation of data • Forming “clusters” (many incomparable methods)

Representation of data examples • Latent semantic indexing yields points in Rd with l2 distance (distance indicating similarity) • Min-wise permutation (Broder at. al.) approach yields points in the hamming metric • Many other representations from IR literature lead to other metrics, including edit-distance metric on strings • Recent news: [OR-95] showed that we can embed edit-distance metric into l1 with small distortion distortion= exp(sqrt(\log n \log log n))

Geometric Clustering: examples • Min-sum clustering in Rd: form clusters s.t. the sum of intra-cluster distances in minimized • K-clustering: pick k “centers” in the ambient space. The cost is the sum of distances from each data-point to the closest center • Agglomerative clustering (form clusters below some distance-threshold) • Q: which is better?

Methods are (in general) incomparable

Min-SUM

2-Clustering

A k-clustering problem: notation • N – number of points • d – dimension • k – number of centers

About k-clustering • When k if fixed, this is easy for small d • [Kleinberg, Papadimitriou, Raghavan]: NP-complete for k=2 for the cube • [Drineas, Frieze, Kannan, Vempala, Vinay]” NP complete for Rd for square of the Euclidian distance • When k is not fixed, this is facility location (Euclidian k-median) • For fixed d but growing k a PTAS was given by [Arora, Raghavan, Rao] (using dynamic prog.) • (this talk): [OR]: PTAS for fixed k, arbitrary d

Common tools in geometric PTAS • Dynamic programming • Sampling [Schulman, AS, DLVK] • [DFKVV] use SVD • Embeddings/dimension reduction seem useless because • Too many candidate centers • May introduce new centers

[OR] k-clustering result • A PTAS for fixed k • Hamming cube {0,1}d • l1d • l2d (Euclidian distance) • Square of the Euclidian distance

Main ideas • For 2-clustering find a good partition is as good as solving the problem • Switch to cube • Try partitions in the embedded low-dimensional data set • Given a partition, compute centers and cost in the original data send • Embedding/dim. reduction used to reduce the number of partitions

Stronger property of [OR] dimension reduction • Our random linear transformation preserve ranges!

THE ALGORITHM

The algorithm yet again • Guess 2-center distance • Map to small cube • Partition in the small cube • Measure the partition in the big cube • THM: gets within (1+e) of optimal. • Disclaimer: PTAS is (almost never) practical, this shows “feasibility only”, more ideas are needed for a practical solution.

Dealing with k>2 • Apex of a tournament is a node of max out-degree • Fact: apex has a path of length 2 to every node • Every point is assigned an apex of center “tournaments”: • Guess all (k choose 2) center distances • Embed into (k choose 2) small cubes • Guess center-projection in small cubes • For every point, for every pair of centers, define a “tournament” which center is closer in the projection

Conclusions • Dimension reduction in the cube allows to deal with huge number of “incomparable” attributes. • Embeddings of other metrics into the cube allows fast ANN for other metrics • Real applications still require considerable additional ideas • Fun area to work in

Dimension Reduction in the Hamming Cube (and its Applications)