1 / 41

Aggregation Algorithms and Instance Optimality

Explore aggregation algorithms and instance optimality for combining information from multiple sources in databases. Discuss problem definition, evaluation methods, new algorithms, and future research directions.

radams
Download Presentation

Aggregation Algorithms and Instance Optimality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron Fagin Amnon Lotem

  2. Aggregating information from several lists/sources • Define the problem • Ways to evaluate algorithms • New algorithms • Further Research

  3. The problem • Database D of N objects • An object R has m fields - (x1, x2, , xm) • Each xi  0,1 • The objects are given in m lists L1, L2, , Lm • list Liall objectssorted by xi value. • An aggregation function t(x1,x2,…xm) • t(x1,x2,…xm) - a monotone increasing function Wanted: top k objects according to t

  4. c1= 0.9 b1= 0.8 Goal • Touch as few objects as possible • Access to object? List L2 List L1 s2= 0.85 a2= 0.84 r2= 0.75 s1= 0.65 r1= 0.5 b2= 0.3 a1= 0.4 c2= 0.2

  5. Where? Problem arises when combining information from several sources/criteria Concentrate on middleware complexity without changing subsystems

  6. Example: Combining Fuzzy Information Lists are results of query: ``find object with color`red’ and shape`round’” • Subsystems for color and for shape. • Each returns a score in [0,1] for each object • Aggregating function t is how the middleware system should combine the two criteria • Example: t(R=(x1,x2 )) could be min(x1,x2 )

  7. Example: scheduling pages Each object - page in a data broadcast system • 1st field - of users requesting the page • 2nd field - longest time user is waiting Combining function t - product of the two fields (geometric mean) Goal: find the page with the largest product

  8. T1 T2 Terms Tn Example: Information Retrieval Documents Dk D1 D2 W12 Query T1, T2, T3: find documents with largest sum of entries Aggregation function t is xi

  9. Modes of Access to the Lists • Sequential/sorted access: obtain next object in list Li • cost cS • Random access: for object R and i mobtain xi • cost cR Cost of an execution: cS  ( of seq. access) cR  (  of random access)

  10. Interesting Cases • cR/cS is small • cS  cR or • cR >>cS Number of lists m - small

  11. Fagin’s Algorithm - FA • For all lists L1, L2, , Lm get next object in sorted order. • Stop when there is set of k objects that appeared in all lists. • For every object R encountered • retrieve all fields x1, x2, , xm. • Compute t(x1,x2,…xm) • Return top k objects

  12. Correctness of FA... For any monotone tand any database D of objects, FA finds the top k objects. Proof: any object in the real top k is better in at least one field than the objects in intersection.

  13. Performance of FA Performance : assuming that the fields are independent (N(m-1)/m). Better performance - correlation between fields Worse performance - negative correlation Bad aggregating function: max

  14. Goals of this work • Improve complexity and analysis - worst case not meaningful Instead consider Instance Optimality • Expand the range of functions want to handle all monotone aggregating functions • Simplify implementation

  15. Instance Optimality A = class of algorithms, D = class of legal inputs. For AA and DD measure cost(A,D) 0. • An algorithm AAis instance optimal over A and D if there are constants c1and c2 s.t. For everyA’A and DD cost(A,D) c1cost(A’,D) c2. c1 is called the optimality ratio

  16. …Instance Optimality • Common in competitive online analysis • Compare an online decision making algorithm to the best offline one. • Approximation Algorithms • Compare the size that the best algorithm can find to the one the approx. algorithm finds In our case • Offline  Nondeterminism

  17. …Instance Optimality • We show algorithms that are instance optimal for a variety of • Classes of algorithms • deterministic, Probabilistic, Approximate • Databases • access cost functions

  18. Guidelines for Design of Algorithms • Format: do sequential/sorted access (with random access on other fields) until you know that you have seen the top k. • In general: greedy gathering of information; If a query might allow you to know top k objects do it. Works in all considered scenarios

  19. The Threshold Algorithm - TA • For all lists L1, L2, , Lm get next object in sorted order. • For each object R returned • Retrieve all fields x1,x2,,xm. • Compute t(x1,x2,…xm) • If one of top k answers so far - remember it. • 1im let xi be bottom value seen in Li (so far) • Define the threshold value tobe t(x1,x2,…xm) • Stop when found k objects with t value . • Return top k objects

  20. s2= 3/4 c1= 0.9 w2= 2/3 b1= 0.7 z2= 1/2 r1= 0.4 q2= 1/4 a1= 0.1 Example: m=2, k=1, t is min b , t(b) = 1/11 • Top object (so far) = • Bottom values x1= x2 = • Threshold t = c , t(c) = 1/12 r , t(r) =1/8 Maintained Information 0.7 0.9 0.4 0.1 3/4 2/3 1/2 1/4 0.4 0.1 2/3 3/4 s = (0.05,3/4) c = (0.9, 1/12) w = (0.07, 2/3) b = (0.7, 1/11) z = (0.09, 1/2) r = (0.4, 1/8) q = (0.08, 1/4) a = (0.1, 1/13)

  21. Correctness of TA For any monotone tand any database D of objects, TA finds the top k objects. Proof: If object z was not seen  1imzixi t(z1, z2,…zm)t(x1,x2,…xm) 

  22. Implementation of TA Requires only bounded buffers: • Top k objects • Bottom m values x1,x2,…xm

  23. Robustness of TA Approximation: Suppose want an (1) approx. - for any R returned and R’ not returned t(R’)(1)t(R) Modified stopping condition: Stop when found k objects with t value at least /(1). Early Stopping: can modify TA so that at any point user is • Given current view of top k list • Given a guarantee about  approximation

  24. Instance Optimality Intuition: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate...

  25. Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before • Neither FA nor TA use wild guesses • Subsystem might not allow wild guesses More exotic queries: jth position in ith list...

  26. Instance Optimality- No Wild Guesses Theorem: For any monotone tlet • A be the class of algorithms that • correctly find top k answers for every database with aggregation function t. • Do not make wild guesses • D be the class of all databases. Then TA is instance optimal over A and D Optimality ratio is m+m2 ·cR/cS - best possible!

  27. Proof of Optimality Claim: If TA gets to iteration d, then any (correct) algorithm A’ must get to depth d-1 Proof: let Rmax be top object returned by TA (d)t(Rmax) (d-1) There exists D’ with R’ at level d-1 R’ (x1(d-1), x2 (d-1),…xm(d-1) ) Where A’ fails

  28. Do wild guesses help? Aggregation function - min, k=1 Database - 1 2 … n n1 … 2n1 1 1 … 1 1 0 0 …0 0 0 … 0 1 1 1 …1 L1 : 1 2 … n n1 … 2n1 L2 : 2n1 … n1 n …1 Wild guess: access object n1 and top elements

  29. Strict Monotonicity • An aggregation function tisstrictly monotoneif when 1imxix’i Then t(x1, x2,…xm)t(x’1,x’2,…x’m) Examples:min, max, avg...

  30. Instance Optimality - Wild Guesses Theorem: For any strictly monotonetlet • A be the class of algorithms that • correctly find top k answers for every database. • D be the class of all databases with distinct values in each field. Then TA is instance optimal over A and D Optimality Ratio is c · mwhere c=max{cR/cS ,cS/cR }

  31. Related Work An algorithm similar to TA was discovered independently by two other groups • Nepal and Ramakrishna • Gntzer, Balke and Kiessling No instance optimality analysis Hence proposed modifications that are not instance optimal algorithm Power of Abstraction?

  32. Dealing with the Cost of Random Access In some scenarios random access may be impossible Cannot ask a major search engine for it internal score on some document In some scenarios random access may be expensive Cost corresponds to disk access (seq. vs. random) Need algorithms to deal with these scenarios • NRA - No Random Access • CA - Combined Algorithm

  33. No Random Access - NRA March down the lists getting the next object Maintain: • For any object R with discovered fields S1,..,m: • W(R)t(x1,x2,…,x|S|,,0…0) Worst (smallest) valuet(R) can obtain • B(R)t(x1,x2,…,x|S|, x|S|+1,, …, xm) Best(largest) value t(R) can obtain

  34. …maintained information (NRA) • Top k list, based on k largestW(R)seen so far • Ties broken according to B values Define Mk to be the kth largest W(R) in top k list • An object R is viable if B(R)Mk Stop when there are no viable elements left I.e. B(R)Mk for all R top list Return the top k list

  35. Correctness For any monotone tand any database D of objects, NRA finds the top k objects. Proof: At any point, for all objects t(R)B(R) Once B(R)Ck for all but top list  no other objects with t(R)Ck

  36. Optimality Theorem: For any monotone tlet • A be the class of algorithms that • correctly find top k answers for every database. • make only sequential access • D be the class of all databases. Then NRA is instance optimal over A and D Optimality Ratio is m !

  37. Implementation of NRA • Not so simple - need to update B(R) for all existing R whenx1,x2,…xm changes • For specific aggregation functions (min) good data structures Open Problem: Which aggregation function have good data structures?

  38. Combined Algorithm CA Can combine TA and NRA Leth = cR/cS Maintain information as in NRA For everyhsequential accesses: • Do m random access on an objects from each list. Choose top viable for which not all fields are known

  39. Instance Optimality Instance optimality statement a bit more complex Under certain assumptions (including t = min, sum) CA is instance optimal with optimality ratio ~ 2m

  40. Further Research • Middleware Scenario: • Better implementations of NRA • Is large storage essential • Additional useful information in each list? • How widely applicable is instance optimality? • String Matching, Stable Marriage... • Aggregation functions and methods in other scenarios • Rank Aggregation of Search Engines • P=NP?

  41. More Details See www.wisdom.weizmann.ac.il/~naor/PAPERS/middle_agg.html

More Related