SIGMOD 08, June 10 th 2008, Vancouver, Canada

Efficient EMD-based Similarity Search in Multimedia Databases viaFlexible Dimensionality Reduction SIGMOD 08, June 10th 2008, Vancouver, Canada Marc Wichterich, Ira Assent, Philipp Kranen, Thomas Seidl

Outline • Introduction • Similarity Search • The Earth Mover’s Distance • Dimensionality Reduction • Dimensionality Reduction for the EMD • Reduction Matrixes • Data-independent Reduction • Data-dependent Reduction • Experimental Results • Conclusion & Outlook

Introduction – Similarity Search • Objective: Find similar objects in database • Applications: • Medical images, edutainment, engineering, etc. • Requires: • Object feature extraction (here: feature histograms) • Similarity measure (here: Earth Mover’s Distance) • Efficient retrieval technique for similar objects similar? similar?

Introduction – The Earth Mover’s Distance[1] • Transform object features to match those of other object • Minimum “cost x flow” for transformation: EMD Flows histogramy histogramx histogramx histogramy [1] Rubner, Tomasi, Perceptual Metrics for Image Database Navigation, Kluwer, 2001.

Introduction – Dimensionality Reduction • Challenge for Similarity Search: high computational complexity for high dimensionalities • Approach: • Reduce dimensionality of query & DB • Filter DB using lower dimensionality • Refine using orig. dimensionality • Filter quality criteria • Selectivity (few refinements) • No false dismissals (lower bound property) reduce

Dimensionality Reduction for the EMD reduce • Both the feature vectorsand the cost matrixhave to be reduced • General linear dimensionality reduction techniques (PCA, ICA, etc.) fail quality criteria for EMD • Discarding dimensions destroys LB property • Splitting dimensions causes poor selectivity • Aggregating dimensionality reductions can work well • Original dimensions are not split up • Each reduced dimension consists of set of orig. dimensions

Reduction Matrixes • Aggregating dimensionality reductions are characterized by reduction matrix R = [ rab ]  {0,1} d x d’ with • Example: • Lower-bounding reduced cost matrix C’ = [ c’a’b’ ] given R • as given by [2] • There is no larger lower bound (see paper) • Main question: Which dimensions to aggregate? 1 0 1 0 0 1 0 1 R = 1 0 1 0 0 1 0 1 x = ( 2 4 3 6 ) x' = ( 2 4 3 6 ) • = ( 6 9 ) [2] Ljosa, Bhattacharya, Singh, Indexing Spatially Sensitive Distance Measures using Multi-Resolution Lower Bounds, EDBT2006.

Data-Independent Reduction • Goal: Tight lower bound (large reduced EMD values) • Large cost between reduced dimensions • Small loss of cost for each reduced dimension • Matches clustering goal: low intra-cluster dissimilarity / high inter-cluster dissimilarity • kMedoid clustering based on the cost matrix 0 1 3 4 1 0 2 3 3 2 0 1 4 3 1 0 1 0 1 0 0 1 0 1 0 2 2 0 C = C' = R = lost cost information

Data-Dependent Reduction based on flows • Idea: Incorporate knowledge on data for better reduction • In data-independent reduction, only C is used • Problem: Ensuring large c’a’b’ pointless if f’a’b’ is small • Now: Also include information on F

Data-Dependent Reduction: Algorithm • Add preprocessing step analyzing the data • Collect information about flows in unreduced EMD • Use information to improve initial / intermediate reduction matrix • iterate until no improvement made intermediate R yes calculate EMD/collect flows improve R improved? sample data S flows R no final R original data initial R

Data-Dependent Reduction: Preprocessing • Calculate average flow matrix F = [ fab ] for sample S of DB • Approximate the flows F’ in reduced EMD with F’ = RT F R • Maximize approximate average reduced EMD _ _ _ ~ 2 1 2 3 0 1 2 1 3 2 3 1 1 3 0 1 1 0 1 0 0 1 0 1 _ ~ 4 8 9 5 F = F' = R = approximate average reduced flows approximate average reduced EMD average flows

Data-Dependent Reduction: Optimization • Global optimization ofrequires assessment of all possible reduction matrices • Find local optimum via reassignment of dimensions • FB-All: Choose best reassignment in each iteration • FB-Mod: Choose first profitable reassignment in each iteration • Initial reduction matrices • Base: assign all original dimensions to first reduced dimension • KMed: reduction matrix from data-independent reduction

Experimental Results • Data-independent vs. data-dependent aggregation sample image [2] data independent (kMedoid) data dependent (FB-All-Mod) costliest flows

Experimental Results • Efficiency vs. reduced dimensionality (Retina DB)

Experimental Results • Efficiency vs. reduced dimensionality (IRMA DB)

Experimental Results • Filter & Refinement times and filter selectivity (IRMA DB)

Conclusion & Outlook • Conclusion • Earth Mover’s Distance as a similarity measure • High quality, but computationally expensive in high dimensions • Dimensionality reduction for the EMD • Data-independent reduction: Clustering in feature space • Data-dependent reduction: Analyze flow information • Outlook • Local reductions • Different reduction for query and DB • Index reduced histograms using [3] [3] Assent, Wichterich, Meisen, Seidl, Efficient Similarity Search Using the Earth Mover's Distance for Large Multimedia Databases, ICDE 2008.

SIGMOD 08, June 10 th 2008, Vancouver, Canada