NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine) Supported by NSF CAREER No. IIS-0238586 EDBT 2004

q NN-join: for each object in the 1st dataset, find the k nearest neighbors in the 2nd dataset D1 D2 NN (nearest-neighbor) search KNN: find the k nearest neighbors of an object.

Example: image search Query image • Images represented as features (color histogram, texture moments, etc.) • Similarity search using these features • “Find 10 most similar images for the query image” • Other applications: • Web-page search: “Find 100 most similar pages for a given page • GIS: “find 5 closest cities of Irvine” • Data cleaning

NN Algorithms • Distance measurement: • For objects are points, distance well defined • Usually Euclidean • Other distances possible • For arbitrary-shaped objects, assume we have a distance function between them • Most algorithms assume a high-dimensional tree structure for the datasets (e.g., R-tree).

Search process (1-NN for example) • Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach • Keep a priority queue of nodes (“MBR”) to be visited • Sorted based on the “minimum distance” between q and each node • Improvement: • Use MINDIST and MINMAXDIST • Reduce the queue size • Avoid unnecessary disk IO’s to access MBR’s Priority queue

Problem • Queue size may be large: • 60,000 objects, 32d (image) vectors, 50 NNs • Max queue size: 15K entries • Avg queue size: half (7.5K entries) • If queue can’t fit in memory, more disk IOs! • Problem worse for k-NN joins • E.g., 1500 x 1500 join: • Max queue size: 1.7M entries: >= 1GB memory! • 750 seconds to run • Couldn’t scale up to 2000 objects! • Disk thrashing

Our Solution: Nearest-Neighbor Histogram (NNH) • Main idea • Utilizing NNH in a search (KNN, join) • Construction and incremental maintenance • Experiments • Related work

NNH: Nearest-Neighbor Histograms pm p2 p1 m: # of pivots Distances of its nearest neighbors: r1, r2, …, They are not part of the database

Structure • Nearest Neighbor Vectors: each ri is the distance of p’s i-th NN T: length of each vector • Nearest Neighbor Histogram • Collection of m pivots with their NN vectors

Outline • Main idea • Utilizing NNH in a search (KNN, join) • Construction and incremental maintenance • Experiments • Related work

Estimate NN distance for query object • NNH does not give exact NN information for an object • But we can estimate an upper bound for the k-NN distance qest of q Triangle inequality

Estimate NN for query object(con’t) • Apply the triangle inequality to all pivots • Upper bound estimate of NN distance of q • Complexity: O(m)

Utilizing estimates in NN search • More pruning: prune an mbr if: q MINDIST mbr

Utilizing estimates in NN join • K-NN join: for each object o1 in D1, find its k-nearest neighbors in D2. • Traverse two trees top down; keep a queue of pairs

Utilizing estimates in NN join (cont’t) • Construct NNH for D2. • For each object o1 in D1, keep its estimated NN radius o1estusing NNH of D2. • Similar to k-NN query, ignore mbr for o1 if: MINDIST o1 mbr

More powerful: prune MBR pairs

Prune MBR pairs (cont) mbr1 mbr2 MINDIST Prune this MBR pair if:

NNH Construction • If we have selected the mpivots: • Just run KNN queries for them to construct NNH • Time is O(m) • Offline • Important: selecting pivots • Size-Constraint Construction • Error-Constraint Construction (see paper)

Size-constraint NNH construction • # of pivots “m” determines • Storage size • Initial construction cost • Incremental-maintenance cost • Choose m “best” pivots

Size-constraint NNH construction • Given m (# of pivots), assume: • query objects are from the database D • H(pi,k) doesn’t vary too much • Goal: Find pivots p1, p2, …, pm to minimize object distances to the pivots: • Clustering problem: • Many algorithms available • Use K-means for its simplicity and efficiency

Incremental Maintenance • How to update the NNH when inserting or deleting objects? • Need to “shift” each vector: • Associate a valid length Ei to each NN vector.

Experiments • Datasets: • Corel image database • Contains 60,000 images • Each image represented by a 32-dimensional float vector • Time-series data from AT&T • Similar trends. Report results for Corel data set • Test bed: • PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. • GNU C++ in CYGWIN

Goal • Is the pruning using NNH estimates powerful? • KNN queries • NN-join queries • Is it “cheap” to have such a structure? • Storage • Initial construction • Incremental maintenance

Improvement in k-NN search • Ran k-means algorithm to generate 400 pivots for 60K objects, and constructed NNH • Performed K-NN queries on 100 randomly selected query objects. • Queue size to measure memory usage. • Max queue size • Average queue size

Reduced Memory Requirement

Reduced running time

Effects of different # of pivots

Join: Reduced Memory Requirement

Join: Reduced running time

Join:Running time for different data sizes

Cost/Benefit of NNH For 60,000 float vectors (32-d). “~0” means almost zero.

Conclusion • NNH: efficient, effective approach to improving NN-search performance. • Can be easily embedded into current implementation of NN algorithms. • Can be efficiently constructed and maintained. • Offers substantial performance advantages.

Related work • Summary histograms • E.g., [Jagadish et al VLDB98], [Mattias et al VLDB00] • Objective: approximate frequency values • NN Search algorithms • Many algorithms developed • Many of them can benefit from NNH • Algorithms based on “pivots/foci/anchors” • E.g., Omni [Filho et al, ICDE01], Vantage objects [Vleugels et al VIIS99], M-trees [Ciaccia et al VLDB97] • Choose pivots far from each other (to represent the “intrinsic dimensionality”) • NNH: pivots depend on how clustered the objects are • Experiments show the differences

Work conducted in the Flamingo Project on Data Cleansing at UC Irvine

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

Presentation Transcript

Nearest Neighbor Queries

K nearest neighbor and Rocchio algorithm

Near(est) Neighbor in High Dimensions

Given by: Erez Eyal Uri Klein

Nearest Neighbor

The Moon Our Nearest Neighbor Introduction

Applied Anomaly Based IDS

Fast and Unified Local Search for Random Walk Based K-Nearest Neighbor Query in Large Graphs

k -d trees

Nearest Neighbor Editing and Condensing Techniques

Nearest Neighbor Searching Under Uncertainty

Nearest Neighbor Searching Under Uncertainty

Objectives: Density Estimation Parzen Windows k-Nearest Neighbor Properties of Metrics

6 The Mathematics of Touring

Optimized Nearest Neighbor Methods

Nearest Neighbor Retrieval Using Distance-Based Hashing

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

Self-tuning Histograms Building Histograms Without Looking at Data

Machine Learning Lecture 11: Nearest Neighbor

Big Data