280 likes | 515 Views
Top-k Query Processing . Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor. + Sushruth P. + Arjun Dasgupta. Why top-k query processing. Multimedia brings fuzzy data attribute values are graded typically [0,1]
E N D
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta
Why top-k query processing • Multimedia brings fuzzy data • attribute values are graded typically [0,1] • No clear boundary between “answer” / “no answer” • A query in a multimedia database means combining graded attributes • Combine attributes by aggregation function • Aggregation function gives overall grade of object • Return k objects with highest overall grade Example:
Top-k query processing Top-k query processing = Finding k objects that have the highest overall grades • How ? Which algorithms? • Fagin’s Algorithm (FA) • Threshold Algorithm (TA) • Which is the best algorithm? • Keep in mind: Database system serves as middleware • Multimedia (objects) may be kept in different subsystems • e.g. photoDB, videoDB, search engine • Take into account the limitations of these subsystems
Example • Simple database model • Simplequery • Explaining Fagin’s Algorithm (FA) • Finding top-k with FA • Explaining Threshold Algortihm (TA) • Finding top-k with TA
M Object ID Attribute 1 Attribute 2 d c b a (d, 0.9) (a, 0.9) 0.9 0.85 (a, 0.85) (b, 0.8) 0.8 0.7 (b, 0.7) (c, 0.72) 0.72 0.2 . . . . . . . . 0.6 0.9 . . . . . . . . . . . . (c, 0.2) (d, 0.6) N Example – Simple Database model Sorted L1 Sorted L2
Example – Simple Query Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware: A1 & A2(eg: color=red & shape=round) A1 & A2 as a ‘query’ to the middleware results in the middelware combining the grades of A1 en A2 by min(A1, A2) • Aggregation function: • function that gives objects an overall grade based on attribute grades • examples : min, max functions • Monotonicity!
L2 L1 (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 A2 Min(A1,A2) . . . . . . . . (d, 0.6) (c, 0.2) Example – Fagin’s Algorithm • STEP 1 • Read attributes from every sorted list • Stop when k objects have been seen in common from all lists ID a 0.85 0.9 d 0.9 b 0.8 0.7 0.72 c
ID L2 L1 c (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) Min(A1,A2) A2 A1 . . . . . . . . (d, 0.6) (c, 0.2) Example – Fagin’s Algortihm • STEP 2 • Random access to find missing grades a 0.85 0.9 0.6 d 0.9 b 0.8 0.7 0.72 0.2
ID c (d, 0.9) (a, 0.85) (b, 0.7) A1 A2 Min(A1,A2) . . . . (c, 0.2) Example – Fagin’s Algortihm • STEP 3 • Compute the grades of the seen objects. • Return the k highest graded objects. L2 L1 (a, 0.9) (b, 0.8) 0.85 a 0.85 0.9 (c, 0.72) 0.6 0.6 d 0.9 . . . . b 0.8 0.7 0.7 0.2 0.2 0.72 (d, 0.6)
d: 0.9 a: 0.85 b: 0.7 . . . . c: 0.2 New Idea !!! Threshold Algorithm (TA) • Read all grades of an object once seen from a sorted access • No need to wait until the lists give k common objects • Do sorted access (and corresponding random accesses) until you have seen the top k answers. • How do we know that grades of seen objects are higher than the grades of unseen objects ? • Predict maximum possible grade unseen objects: L2 L1 a: 0.9 Seen b: 0.8 c: 0.72 T = min(0.72, 0.7) = 0.7 f: 0.6 . . . . f: 0.65 Possibly unseen Threshold value d: 0.6
ID L2 L1 (d, 0.9) (a, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 Min(A1,A2) A2 . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 d 0.9 0.6 0.6
ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.9 d A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.6 0.6 T = min(0.9, 0.9) = 0.9
ID L2 L1 (a, 0.9) (d, 0.9) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) A1 A2 Min(A1,A2) . . . . . . . . (d, 0.6) (c, 0.2) Example – Threshold Algorithm Step 1 (Again): - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer a 0.9 0.85 0.85 d 0.6 0.9 0.6 b 0.8 0.7 0.7
ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.7 b A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 0.85 0.85 0.7 0.8 T = min(0.8, 0.85) = 0.8
ID L2 L1 a: 0.9 d: 0.9 a: 0.85 b: 0.8 a 0.9 b: 0.7 c: 0.72 0.7 b A2 Min(A1,A2) A1 . . . . . . . . d: 0.6 c: 0.2 Example – Threshold Algorithm Situation at stopping condition 0.85 0.85 0.7 0.8 T = min(0.72, 0.7) = 0.7
Comparison of Fagin’s and Threshold Algorithm • TA sees less objects than FA • TA stops at least as early as FA • When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA. • TA may perform more random accesses than FA • In TA, (m-1) random accesses for each object • In FA, Random accesses are done at the end, only for missing grades • TA requires only bounded buffer space (k) • At the expense of more random seeks • FA makes use of unbounded buffers
The best algorithm • Which algorithm is the best: TA, FA?? • Define “best” • middleware cost • concept of instance optimality • Consider: • wild guesses • aggregation functions characteristics • Monotone, strictly monotone, strict • database restrictions • distinctness property
Algorithm B isinstance optimal over A and D if : B ЄA and Cost(B,D ) = O(Cost(A,D )) A ЄA,D ЄD Which means that: Cost(B,D ) ≤ c . Cost(A,D ) + c’, A ЄA,D ЄD A A A optimality ratio The best algorithm: concept of optimality A = class of algorithms, AЄA represents an algorithm D = legal inputs to algorithms (databases), D ЄD represents a database middleware cost = cost for processing data subsystems = scS + rcR Cost(A,D ) = middleware cost when running algorithm A over database D
The best algorithm: instance optimality & wild guesses • Intuitively: B instance optimal = always the best algorithm inA • = always optimal • In reality: always is “always” we will exclude wild guesses algorithms • Wild guess = random access on object not previously encounter by sorted access • In practice not possible • Database need to know ID to do random access • If wild guesses allowed in A then no algorithm can be instance optimal • Wild guesses can find top-k objects by k·m random accesses • (k = #objects , m = #lists)
The best algorithm: aggregation functions • Aggregation function t combines object grades into object’s overall grade: • x1,…,xm t(x1,…,xm) • Monotone : • t(x1,…,xm) ≤ t(x’1,…,x’m) if xi ≤ x’i for every i • Strictly monotone: • t(x1,…,xm) < t(x’1,…,x’m) if xi < x’i for every i • Strict: • t(x1,…,xm) = 1 precisely when xi = 1 for every i
The best algorithm: database restrictions Distinctness property: A database has no (sorted) attribute list in which two objects have the same grade
The best algorithm: Fagin’s Algorithm • - Database with N objects, each with m attributes. • - Orderings of lists are independent • FA finds top-k with middleware cost O(N(m-1)/mk1/m) • FA = optimalwith high probability in the worst case for strict monotone aggregation functions
The best algorithm: Threshold Algorithm • TA = instance optimal (always optimal) for every monotoneaggregation function, over every database(excluding wild guesses) • = optimal in much stronger sense than Fagin’s Algorithm • If strict monotone aggregation function: • Optimality ratio = m + m (m-1)cR/cs = best possible (m = # attributes) • If random acces not possible (cr = 0 ) optimality ratio = m • If sorted access not possible (cs = 0) optimality ratio = infinite • TA not instance optimal • TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database(including wild guesses) that satisfies the distinctness property • Optimality ratio = cm2 with c = max {cR/cS, cS/cR}
Extending TA • What if sorted access is restricted ? e.g. use distance database • TA z • What if random access not possible? e.g. web search engine • No Random Access Algorithm • What if we want only the approximate top k objects? • TAθ • What if we consider relative costs of random and sorted access? • Combined Algorithm (between TA and NRA)
NRA • What if we also want the scores?
Combined Algorithm (CA) CA in instance optimal
Approximation • -approximation to the top k answers for the aggregation function t is a collection of k objects (each along with its grade) such that for each y among these k objects and each z not among these k objects, t(y)>=t(z) • T : As soon as at least k objects have been seen whose grade is at least equal to threshold/ then halt.