iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing

iDistance-- Indexing the DistanceAn Efficient Approach to KNN Indexing • C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. • Indexing the distance: an efficient method to KNN processing, VLDB 2001.

Query Requirement • Similarity queries: • Similarity range and KNN queries • Similarity range query: Given a query point, find all data points within a given distance r to the query point. • KNN query: Given a query point, • find the K nearest neighbours, • in distance to the point. r Kth NN

Other Methods • SS-tree : R-tree based index structure; use bounding spheres in internal nodes • Metric-tree : R-tree based, but use metric distance and bounding spheres • VA-file : use compression via bit strings for sequential filtering of unwanted data points • Psphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNN • A-tree: R-tree based, but use relative bounding boxes • Problems: hard to integrate into existing DBMSs

Basic Definition • Euclidean distance: • Relationship between data points: • Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q).

Basic Concept of iDistance • Indexing points based on similarity y = i * c + dist (Si, p) Reference/anchor points d S3 S1 S2 . . . S1 S2 S3 Sk c Sk+1 S1+d

iDistance • Data points are partitioned into clusters/ partitions. • For each partition, there is a Reference Point that every data point in the partition makes reference to. • Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-tree • Iterative range queries are used in KNN searching.

KNN Searching S2 S1 ... ... A range in B+-tree • Searching region is enlarged till getting K NN.

KNN Searching dist (S1, q) dist(S2, q) S1 S2 q Dis_min(S2) Dis_min(S1) Dis_max(S2) Dis_max(S1) Increasing search radius : r r S1 S2 0 dist (S1,q) Dis_max(S1) dist (S2,q) Dis_min(S1) Dis_max(S2)

KNN Searching Q2

Over Search? • Inefficient situation: • When K= 3, query sphere with radius r will retrieve the 3 NNs. • Among them only the o1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o3) o2 o1 S q r o3 dist (S, q)

Stopping Criterion • Theorem 2: The KNN search algorithm terminates when K NNs are found and the answers are correct. Case 1: dist(furthest(KNN’), q) < r Case 2: dist(furthest(KNN’), q) > r r Kth ? In case 2

Space-based Partitioning: Equal-partitioning (external point, closest distance) (centroid of hyperplane, closest distance)

Space-based Partitioning:Equal-partitioning from furthest points (centroid of hyper-plane, furthest distance) (external point, furthest distance)

Effect of Reference Points on Query Space • Using external point to reduce searching area

Effect on Query Space The area bounded by these arches is the affected searching area. • Using (centroid, furthest distance) can greatly reduce search area

Data-based Partitioning I 1.0 0.70 0.31 0 0.20 0.67 1.0 Using cluster centroids as reference points

Data-based Partitioning II 1.0 0.70 0.31 0 0.20 0.67 1.0 Using edge points as reference points

Performance Study:Effect of Search Radius Dimension = 8 Dimension = 16 • 100K uniform data set • Using (external point, furthest distance) • Effect of search radius on query accuracy Dimension = 30

I/O Cost vs Search Radius • 10-NN queries on 100K uniform data sets • Using (external point, furthest distance) • Effect of search radius on query cost

Effect of Reference Points • 10-NN queries on 100K 30-d uniform data set • Different Reference Points

Effect of Clustered # of Partitions on Accuracy • KNN queries on 100K 30-d clustered data set • Effect of query radius on query accuracy for different partition number

Effect of # of Partitions on I/O and CPU Cost • 10-NN queries on 100K 30-d clustered data set • Effect of # of partitions on I/O and CPU Costs

Effect of Data Sizes • KNN queries on 100K, 500K 30-d clustered data sets • Effect of query radius on query accuracy for different size of data sets

Effect of Clustered Data Sets • 10-KNN query on 100K,500K 30-d clustered data sets • Effect of query radius on query cost for different size of data set

Effect of Reference Points on Clustered Data Sets • 10-KNN query on 100K 30-d clustered data set • Effect of Reference Points: Cluster Edge vs Cluster Centroid

iDistance ideal for Approximate KNN? • 10-KNN query on 100K,500K 30-d clustered data sets • Query cost for variant query accuracy on different size of data set

Performance Study -- Compare iMinMax and iDistance • 10-KNN query on 100K 30-d clustered data sets • C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using B+-trees.

iDistance vs A-tree

Summary of iDistance • iDistance is simple, but efficient • It is a Metric based Index • The index can be integrated to existing systems easily.

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing

Presentation Transcript

Metric based KNN indexing

Indexing

Indexing:

An Incremental Approach to MEDLINE MeSH Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Distance Indexing on Road Networks