120 likes | 310 Views
Answering Metric Skyline Queries by PM-tree. Tomáš Skopal, Jakub Lokoč Department of Software Engineering , FMP, Charles University in Prague. Similarity search. content - based similarity search single- example queries range query kNN query multi - example queries
E N D
Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub LokočDepartment of Software Engineering, FMP, Charles University in Prague
Similaritysearch • content-basedsimilaritysearch • single-examplequeries • rangequery • kNNquery • multi-examplequeries • combinationof single-examplequeriesis not sufficient • should support • partialmatching • compromise • metricskyline DATESO 2010, Štědronín - Plazy
Metricskylinequery (MSQ) • traditionalskylineoperator • linearlyorder-edattributedomains • dominance relation • + MDDRs (minimum dominating-dominatedrectangles) • static schema • metricskyline • multi-examplequery (not justoperator) • attributesspecifiedatquerytime – ithattribute = distance ofdatabaseobject to ithqueryexampleQi • result set interpretation:objectssimilar to allqueryexamplesyetdistinct (dissimilar to eachother) • dynamicschema, cannotbereduced to theclassicskylineoperator for efficientskylineprocessing • i.e., thecoordinatesystemisestablishedatquerytime DATESO 2010, Štědronín - Plazy
Genericalgorithmfor a hierarchic metric index • branch-and-boundalgorithm (originally developed for R-tree and classic/spatial skyline operator) • dynamic mapping of the metric space into L1 vector space (examples) • heuristics: data/regionsprocessed in L1orderguaranteeno falsedismissals • a priority heapisused, storing index entriesequippedby MDDRs to beinspected (higher priority = lower L1orderof MDDR) Thealgorithm: 0) The entry of the entire index is pushedon the heap (e.g., M-tree root node). • Anentrywiththelowest L1 distance ofits MDDR ispoppedfromtheheap. • Iftheentrycontainsjustone data object (e.g., entry in an M-treeleaf), itisadded to theskyline set, whileremovingallentriesfromtheheapdominated by theentry. Jump to 1. • Iftheentryis a region (e.g., entry in an M-treeinnernode), itschildnodeisfetched. TheMDDRsofthechildnode’s entries are checked for dominance by the already determined skyline set, while the dominated ones are filtered from further processing. • The MDDRs of the non-filtered child entries are derived, while those not dominated by the current skyline set are pushed into the heap. Jump to 1. L1 L1 DATESO 2010, Štědronín - Plazy
M-tree • metric index based on B+-tree • innernodecontains routing entries • ballregions (object and radius) + distance to parent region + pointer to subtree • leaf node contains ground entries • object + distance to parent region • 2 types of filtering by querying • parent filtering (cheap) • stored distance to parent is used • basic filtering (expensive) • distance computation needed DATESO 2010, Štědronín - Plazy
MSQ implementation using M-tree • uses the generic algorithm enhanced by specific M-tree MDDRs, mapping the M-tree regions from metric space into L1 vector space (dimensions are distances of data/regions to the query examples Qi) • 2 types of M-tree MDDR • Par-MDDR • the mapped oversized region ball (using the distance to parent) • B-MDDR • the mapped region ball DATESO 2010, Štědronín - Plazy
PM-tree • combinationof M-treeand pivot tables (LAESA) • M-treeballsreduced by ringscentered in globalpivots Pi • routingandgroundentriesstorealsothe ring radii • enhancedfiltering • cheaply in pivot space (mappingof data/ballsinto L∞vectorspace) • mappingofthequeryobjectintothe pivot spaceistheonly extra computationcosts • if not filteredout in pivot space, regular M-treefiltering DATESO 2010, Štědronín - Plazy
Papercontribution: PM-treeMSQ implementation • B-MDDR, Par-MDDR (inheritedfrom M-tree) • Piv-MDDR • using PM-treeringsthe MDDR canbetightened • for eachdimension (exampleQi) themaximallowerboundandminimalupperbound distance to the region isfound (to theringsintersection) • pivot skyline • skylineinitialized by pivotsmapped to the L1space • heavyoptimization(reductionofheapsize) • deferredheapprocessing • reinsertionsintoheapto save distance computations DATESO 2010, Štědronín - Plazy
Experiments • subsetoftheCoPhIRdatabase, onemillion 76-dimensionaltuplesrepresenting 2 MPEG7 features on flickrimages, Euclidean distance used • Polygonsdatabase, 250k 30-dimensionaltuplesrepresenting 5-15 vertex 2D polygons, Hausdorff distance used • averageover 200 metricskylinequeries • eachmetricskyline querydefined by2-5 queryexamples DATESO 2010, Štědronín - Plazy
Experiments DATESO 2010, Štědronín - Plazy
Experiments DATESO 2010, Štědronín - Plazy
Conclusions • PM-treebasedmetricskylinequeryimplementation • up to 2x faster in termsof distance computationsand I/O cost(wrtoriginal M-treeimplementation) • up to 20x faster in termsofheapoperations • needsup to 20x lessspace for theheap Thankyou for your attention! Questions? DATESO 2010, Štědronín - Plazy