320 likes | 432 Views
A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space . Presented By Umang Shah Koushik. Introduction. Sequential Scan always out perform whenever the dimension is greater then 10 or higher.
E N D
A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space • Presented By • Umang Shah • Koushik
Introduction • Sequential Scan always out perform whenever the dimension is greater then 10 or higher. • Any method of clustering or data space partition method fail to handle HDVS beyond a certain limit. • VA files is proposed to do the inevitable sequential scan more efficiently. Performance increases with dimensions.
Assumptions and Notation • Assumption 1-Data and Metric • Unit hypercube • Distances • Assumption 2-Uniformity and Independence • Data and query points are uniformly distributed • Dimensions are independent.
The Difficulties of High Dimensionality • Number of partitions. • Data space is sparsely populated • Spherical range queries • Exponentially growing DB size • Expected NN-Distance.
Number of partitions • 2d partitions • Assume N = 106 points. • For d = 100, there are 2100 ≈ 1030 partitions. • Too many partitions are empty.
Data space is sparsely populated • 0.95^100 = 0.0059 • At d = 100, even a hypercube of side 0.95 can cover only 0.59% of the data space.
Spherical range queries • The largest spherical query.
Exponentially growing DB size • At least one point falls into the largest possible sphere.
Expected NN-Distance The NN distance grows steadily with d.
General Cost Model The Probability that the ith block is visited, If we assume m objects per block, Expected number of blocks visited Is Mvisit > 20%?
Space-Partitioning Methods Space consumption – 2d.So split is done in d’ dimensions only. is independent of d. E [nndist] increases with d When E [nndist] is greater than lmax the entire database is accessed.
Data-Partitioning Methods • Rectengular MBRS • R* tree,X tree,SR tree • Spherical MBRS • TV tree,M tree,SR tree • General partitioning and Clustering schemes
General Partitioning and ClusteringSchemes • Assumptions • A cluster is characterized by a geometrical form (MBR) that covers all cluster points • Each cluster contains at least 2 points • The MBR of a cluster is convex.
Vector Approximation File Basic Idea: Technique Specially Designed For Similarity Search Object Approximation Vector Data Compression
How it is done • The data is divided in to 2^b rectangular cells • Cells are arranged in form of grid • Entire file is scanned at the time of query
Compression Vector • For each dimension a small number bits b [i] is assigned. • The sum b[i] is b • The data space is divided in 2^d hyper rectangles • Each data point is approximated by the bit string of the cell • Only the boundary points of each data set needs to be stored
Compression Vector • Normally bits chosen for each dimension vary from 4 to 8 • Typically bi = l, b = d *l, l = 4.. .8
Filtering Step • Simple Search Algorithm • An Array of k elements is maintained • This array is maintained in sorted order • File is sequentially searched. • If the element’s lower bound < k th element upper bound • The actual distance are calculated
Filtering Step • Near Optimal search algorithm • Done in two steps • While scanning through the file • Step1-Calculate the kth largest upper bound Encountered so far If new element has lower bound greater then then discard it
Filtering Step • Step2-The elements remaining in step1 are collected The elements in increasing order of lower bound are visited till it is >= to the kth element upper bound
Performance • Add Two Graphs Of Performance
Conclusion • All approaches to nearest-neighbor search in HDVSs ultimately become linear at high dimensionality. • The VA-File method can out-perform any other method known to the authors.