NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)

Outline • Motivation: NN search • NNH: Proposed histogram structure • Main idea • Utilizing NNH in a search (KNN, join) • Constructing NNH • Incremental maintenance • Experiments

q NN-join: find the k nearest neighbors in the 2nd dataset for each object in the 1st dataset D1 D2 NN (nearest-neighbor) search KNN: find the k nearest neighbors of an object.

Example: image search Query image • Images represented as features (color histogram, texture moments, etc.) • Similarity search using these features • “Find 10 most similar images for the query image”

Other Applications • Web-page search • “Find 100 most similar pages for a given page” • Page represented as word-frequency vector • Similarity: vector distance • GIS: “find 5 closest cities of Irvine” • CAD, information retrieval, molecular biology, data cleansing, … • Challenges: Efficiency, Scalability

NN Algorithms • Distance measurement: • For objects are points, distance well defined • Usually Euclidean • Other distances possible • For arbitrary-shaped objects, assume we have a distance function between them • Most algorithms assume a high-dimensional tree structure exists for the datasets.

Example: R-Trees Take 2-d space as an example.

Minimal Bounding Rectangle • MBRis an n-dimensional rectangle that bounds its corresponding objects. • MBR face property: Every face of any MBR contains at least one point of some object

Search process (1-NN for example) • Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach • Keep a priority queue of nodes (mbr’s) to be visited • Sorted based on the “minimum distance” between q and each node • Improvement: • Use MINDIST and MINMAXDIST • Reduce the queue size • Avoid unnecessary disk IO’s to access MBR’s Priority queue

MINDIST & MINMAXDIST

mbr1 q MINDIST mbr2 MINMAXDIST 2. Discard object o if dist(q,o) > MINIMAXDIST(q,mbr2) o q dist mbr2 MINMAXDIST q MINDIST mbr1 dist o Pruning in NN search 3. Discard mbr1 if MINDIST(q,mbr1) > disk(q,o) 1. Discard mbr1 if MINDIST(q,mbr1) > MINMAXDIST(q,mbr2)

Problem • Queue size may be large: • Example: 60,000, 32d (image) vectors, 50 NNs • Max queue size: 15K entries • Avg queue size: half (7.5K entries) • If queue can’t fit in memory, more disk IOs! • Problem worse for k-NN joins • E.g., 1500 x 1500 join: • Max queue size: 1.7M entries: >= 1GB memory! • 750 seconds to run • Couldn’t scale up to 2000 objects! • Disk thrashing

Our Solution: Nearest-Neighbor Histogram (NNH) • Main idea • Utilizing NNH in a search (KNN, join) • Constructing NNH • Incremental maintenance

NNH: Nearest-Neighbor Histograms pm p2 p1 m: # of pivots Distances of its nearest neighbors: r1, r2, …,

Main idea • Keep a histogram of NN distances of a pre-selected collection of objects (pivots). • They are not part of the database • They give a “big” picture of objects’ locations • Use the histogram to estimate the NN distance of each certain query object. • Use these estimated NN distances to do more pruning in an NN search

Structure • Nearest Neighbor Vectors: each ri is the distance of p’s i-th NN T: length of each vector • Nearest Neighbor Histogram • Collection of m pivots with their NN vectors

Estimate NN distance for query object • NNH does not give exact NN information for an object • But we can estimate an upper bound for the k-NN distance qest of q Triangle inequality

Estimate NN for query object(con’t) • Apply the triangle inequality to all pivots • Upper bound estimate of NN distance of q • Complexity: O(m)

Utilizing estimates in NN search • More pruning: prune an mbr if: q MINDIST mbr

Utilizing estimates in NN join • K-NN join: for each object o1 in D1, find its k-nearest neighbors in D2. • Preliminary algorithm by Hjaltason and Samet [HS98] • Traverse two trees top down; keep a queue of pairs

Utilizing estimates in NN join (cont’t) • Construct NNH for D2. • For each object o1 in D1, keep its estimated NN radius o1estusing NNH of D2. • Similar to k-NN query, ignore mbr for o1 if: MINDIST o1 mbr

More powerful: prune MBR pairs

Prune MBR pairs (cont) mbr1 mbr2 MINDIST Prune this MBR pair if:

How to construct an NNH? • If we have selected the mpivots: • Just run KNN queries for them to construct NNH • Time is O(m) • Offline • Important: selecting pivots • Size-Constraint NNH Construction • Error-Constraint NNH Construction

Size-constraint NNH construction • # of pivots “m” determines • Storage size • Initial construction cost • Incremental-maintenance cost • Choose m “best” pivots

Size-constraint NNH construction • Given m: # of pivots • Assuming: • query objects are from the database D • H(pi,k) doesn’t vary too much • Goal: Find pivots p1, p2, …, pm to minimize object distances to the pivots: • Clustering problem: • Many algorithms available • Use K-means for its simplicity and efficiency

Error-constraint NNH construction • Assumptions: • A threshold r is set apriori • Any estimate to the k-NN distance less than r is considered “good” enough. • I.e., a maximum error of r is tolerated for any distance estimate.

Error-constraint NNH construction (cont) • Find a set points S = {p1, p2, …, pm} from the dataset D • For each point pi, its kNN’s are within distance r/2 • Then, for any point q within distance r/2 from pi, we get a distance estimate for the KNN of q:

Error-constraint NNH construction (cont) • Problem: find points such that • They cover the entire data set with spheres of radius r/2 • The sum of distances of all points in each sphere to its center is minimized • An instance of the “k-center problem” • Efficient 2-approximation algorithm using a single pass over the dataset

Incremental Maintenance • How to update the NNH when inserting or deleting objects? • Need to “shift” each vector: • Associate a valid length Ei to each NN vector.

Insertion • Locate the position j in each NN vector where

Insertion (con’t) • If j not found, we don’t need to update this pivot NN vector (why?) • If found: • insert the new radius • shift the vector to the right • increment Ei by 1.

Deletion • Similar to the Insertion • Locate position of • If not found, no update for this vector • If found: • remove rj • shift the rest to the left • decrement Eiby 1

Experiments • Dataset: • Corel image database • Contains 60,000 images • Each image represented by a 32-dimensional float vector • Test bed: • PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. • GNU C++ in CYGWIN

Questions to be answered • Is the pruning using NNH estimates powerful? • KNN queries • NN-join queries • Is it “cheap” to have such a structure? • Storage • Initial construction • Incremental maintenance

Improvement in k-NN search • Run k-means algorithm to generate 400 pivots, and construct the NNH • Perform 10-NN queries on 100 randomly selected query objects. • Queue size as the benchmark for memory usage. • Max queue size • Average queue size

Reduced Memory Requirement

Reduced running time

Effects of different # of pivots

Improvement in k-NN joins • Selected two subsets from the Corel dataset. Each contains 1500 objects. • Unfortunately couldn’t run the PC due to large memory requirement • Ran on a SUN Ultra 4 workstation with four 300MHz CPU and 3GB Memory. • Constructed NNH (400 pivots) for D2.

Join: Reduced Memory Requirement

Join: Reduced running time

Join: Effects of different # of pivots

Join:Running time for different data sizes

Cost/Benefit of NNH For 60,000 32-d float vectors. “0” means almost zero.

Conclusion • NNH: efficient, effective approach to improving NN-search performance. • Can be easily embedded into current implementation of NN algorithms. • Can be efficiently constructed and maintained. • Offers substantial performance advantages.

Work conducted in the Flamingo Project on Data Cleansing at UC Irvine

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms