200 likes | 371 Views
Minimal Probing: Supporting Expensive Predicates for Top-k Queries. Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign. Context: Top-k Queries. Ranked queries return top- k results, unlike Boolean Crucial for retrieving data by “soft” conditions
E N D
Minimal Probing:Supporting Expensive Predicates for Top-k Queries Kevin C. ChangSeung-won HwangUniv. of Illinois at Urbana-Champaign
Context: Top-k Queries • Ranked queries return top-k results, unlike Boolean • Crucial for retrieving data by “soft” conditions • relevance: e.g., text search engines • similarity: e.g., multimedia databases • preference: e.g., e-commerce product search • Example scenario: preference query for finding house: • selecth.idfromhouse h wherenew(age), cheap(price, size), large(size) order bymin(new,cheap,large) stop after5 • Observation: Crucial to support expensive predicates predicate scoring function k: retrieval size
Problem: Expensive Predicates • Expensive predicates • no pre-computed indexes for zero-time sorted-access • needaprobeto evaluate each object (similar to sequential scan) • Unified abstraction for: • user-defined functions: functional extensibility • query conditions can be arbitrary, user-specific • e.g., cheap(price,size) • external predicates: data extensibility • source interface may require one probe per object • e.g., safe(zip) access crime rate from apbnews.com • fuzzy joins • associations of relations can be arbitrary • e.g., close(house.zip, park.zip)
Current Limitations: “Sort-Merge” Framework • Require sorted access of search predicates. • To “simulate” sorted access, require complete probing • are these probes necessary? • Goal: Minimize probe cost Top-k output Merge step Sort step new (search predicate) F = min(new,cheap,large) a:0.90, b:0.80, c:0.70, d:0.60, e:0.50 k = 1 cheap (expensive predicate) û û û Merge Algorithm d:0.90, a:0.85, b:0.78, c:0.75, e:0.70 b:0.78 large (expensive predicate) û û û b:0.90, d:0.90, e:0.80, a:0.75, c:0.20
Motivation: Solution Space • Assume sequential probing: Algorithm skeleton: do: schedule next obj o, pred p probe pr(o,p) until (top-k identified) predicates p1 p2 p3 objecta bc
Our framework: Separate, Global Predicate Scheduling Two important decisions on framework: • Separate predicate scheduling • scheduling as separate “optimization” phase before probing • avoid run-time scheduling overhead • Global predicate scheduling • scheduling based on global info (predicate selectivities) • lack of per-object information to justify per-object scheduling • avoid per-object scheduling overhead • Simple framework and algorithm • and efficient! • allow essentially A* framework, for given predicate schedule • enable formal analysis: optimality, scalability
Simple Framework • Separate, global predicate scheduling predicates H=(p1,p2,p3) p1 p2 p3 Algorithm skeleton: find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) objecta bc
Challenges for Minimizing Probing • Predicate scheduling before probing • how to identify the best H? • Object scheduling during probing • how to find next object to probe, for achieving “minimal probing” with respect to H? Algorithm skeleton: find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) ? ?
Challenge 1: Object Scheduling • Goal: Perform only necessary probes • Necessary probes: • A probe is necessary if top-k answers cannot be determined by any algorithm without it, regardless of the outcomes of other probes. • Question 1: Given a probe pr(o, next(o,H)), how to determine if it is necessary? • Probe-optimal algorithm • An algorithm is probe-optimal if it performs only the necessary probes. • Question 2: How to identify necessary probes in order to design such an algorithm?
Question 1: Is this Probe Necessary? • k=1, F=min(x,p1,p2); suppose H=(p1,p2) OID x p1 p2F=min(x,p1,p2) a 0.9 b 0.8 c 0.7 d 0.6 e 0.5 ? 1 1 0.9 top 1 ? Maybe Not! £ 0.8 ? 1 1 0.7 ? 1 1 0.6 ? 1 1 0.5
Question 1: Is this Probe Necessary? • k=1, F=min(x,p1,p2); suppose H=(p1,p2) • Theorem: Probe pr(o,p) is absolutely necessary, if o is among the current top-k in terms of ceiling scores. OID x p1 p2F=min(x,p1,p2) a 0.9 b 0.8 c 0.7 d 0.6 e 0.5 ? £ 0.9 Necessary! top 1? 1 1 0.8 1 1 0.7 1 1 0.6 1 1 0.5
a:0.9 a:0.85 b:0.8 b:0.78 b:0.78 b:0.8 b:0.8 a:0.75 a:0.75 a:0.75 c:0.7 c:0.7 c:0.7 c:0.7 c:0.7 d:0.6 d:0.6 d:0.6 d:0.6 d:0.6 e:0.5 e:0.5 e:0.5 e:0.5 e:0.5 Question 2: Probe-optimal object scheduling • Objects in current top-k must be further probed • Probe-optimal object scheduling: Algorithm MPro • use a priority queue with ceiling scores as priorities pr(a,p1) =0.85 pr(a,p2) =0.75 pr(b,p1) =0.78 pr(b,p2) =0.90 top 1 b:0.78
Challenge 2: Predicate Scheduling • Scheduling problem • find minimal cost schedule from permutations • Challenges • selectivity estimation: • dynamic predicates • aggregate selectivities (context-dependent) • scheduling computation: • NP-hard • Our approach: • on-line sampling to estimate selectivities • greedy selection to schedule predicates 0.1% sampling achieves almost the best schedule
6 hour 2 min Experiment Results • Practical performance of MPro • proportional cost to the retrieval size k • significant speedup for small k • Impact of performance factors • database size: sublinear cost scalability • score distribution and scoring function: see paper
Demo : House Search • Data: All houses on sale in Illinois (N=20990) • from www.realtor.com. • objects: house(id, price, size, bed, bath, zip, city) • Query: F = Average(n, c, r) • n nearcity: close to Chicago • c cheap: “reasonable” price for its size • r roomy: prefer 4-6 rooms
Summary of Contributions (more in the paper) • Abstraction: • for user-defined, external, and fuzzy join predicates • Framework and algorithm: • sampling-based global scheduling • probe-optimal algorithm MPro • extensions of MPro: fuzzy joins, parallel MPro, approximation • Principles/Theorems: • necessary-probe principle • probe-optimality of MPro • analytical scalability of MPro • Extensive experiments
Probe-parallel MPro Probe k necessary probes concurrently Up to k-fold speedup Data-parallel MPro Partition data into s chunks Up to s-time speedup top-k Merge MPro MPro MPro Parallel MPro: Overview
Scalability N=1000N=10000N=100000 k=100N=1000 k=1000N=10000 k=10000N=100000
Comparison T T T O O O