Clustered Pivot Tables for I/O-optimized Similarity Search

Clustered Pivot Tables forI/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics andPhysics Charles University in Prague SISAP 2011, Lipari

Presentation outline • Similarity search in metric spaces • Pivot tables • Clustered pivot tables • Static variant • Dynamic variant • Experiments SISAP 2011, Lipari

Similarity search • Suitableforunstructured data, query often not in DB • Similarityisoftenmodeled by a metric distance • Expensive distance functions- EMD, SQFD, DTW, … • Metricindexing • Based on lower-bounding • Ifabs(d(p, q) – d(p, o)) > r filter out object o SISAP 2011, Lipari

Pivot tables • Simple yet efficientmain memory metric index • Having k static pivots Pi and database S of n objects Oj, pivot table stores all the distances d(Pi, Oj) in the matrix of size k x n • Pivot tables = two structures - distance matrix + data file • Cheap filtering of non-relevant objects (lower-bounding) • Non-filtered objects are refined by the original expensive distance function SISAP 2011, Lipari

Clustered pivot tables • What if the pivot table does not fit intomainmemory? • Solution 1 – just slicedatafile • +simple to construct • - sequential scan => high I/O cost • Solution2– reorganize andslicedatafile • +similar objectsin one page (page = cluster)=> higher probability that all objects are filtered=> lower I/O cost • -metric clusteringis expensive SISAP 2011, Lipari

Metric clustering? M-tree! • Dynamic, persistent, balanced structure • Leaf node represents cluster of similar objects • Many construction strategies considering quality of M-tree hierarchy with complexity < O(n2) • Single/Multi/Hybrid-way leaf selection • Slim-down algorithm • Reinsertions SISAP 2011, Lipari

Static CPT • Data file = objects serialized from M-tree leaves • Classic pivot table reorganizing input • Fixed page size in a paged data file • Preserve M-tree? • Future re-indexing • Query processing SISAP 2011, Lipari

Dynamic CPT • Data file = set of M-tree leaves • Distance matrix connected to the M-tree leaves • Internal fragmentation • M-tree leaves contain different number of data objects, utilization is not 100% • Dynamic operations do not degenerate created clusters SISAP 2011, Lipari

CPT - Querying • Filtering based on lower-bounding • If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization SISAP 2011, Lipari

CPT - Querying problems • Problem 1 – LAESA kNN algorithm sorts DB objects according to their lower bound to the query object – not optimal for I/O cost • Solution - CPT does not sort objects => objects are processed sequentially SISAP 2011, Lipari

CPT – Querying problems • Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing • Solution - First bunch of objects is not clustered SISAP 2011, Lipari

CPT – Querying problems • Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing • Solution - First bunch of objects is not clustered x x Q Q SISAP 2011, Lipari

Experiments (1) • 2 real datasets • subset of CoPhIR, subset of Corel • 2 synthetic datasets • Cloud, PolygonSet • We considered more M-tree variants • Single/Multi way leaf selection • Reinsertions • Measured I/O cost • CPT vs. PT vs. M-tree SISAP 2011, Lipari

Experiments (2) SISAP 2011, Lipari

Experiments (3) SISAP 2011, Lipari

Conclusion • We have designed I/O-optimized method for persistent pivot tables • Future work • Thorough experiments on SSD disks • Use other metric clustering techniques SISAP 2011, Lipari

Thank you SISAP 2011, Lipari

Clustered Pivot Tables for I/O-optimized Similarity Search