410 likes | 430 Views
Explore aggregation algorithms and instance optimality for combining information from multiple sources in databases. Discuss problem definition, evaluation methods, new algorithms, and future research directions.
E N D
Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron Fagin Amnon Lotem
Aggregating information from several lists/sources • Define the problem • Ways to evaluate algorithms • New algorithms • Further Research
The problem • Database D of N objects • An object R has m fields - (x1, x2, , xm) • Each xi 0,1 • The objects are given in m lists L1, L2, , Lm • list Liall objectssorted by xi value. • An aggregation function t(x1,x2,…xm) • t(x1,x2,…xm) - a monotone increasing function Wanted: top k objects according to t
c1= 0.9 b1= 0.8 Goal • Touch as few objects as possible • Access to object? List L2 List L1 s2= 0.85 a2= 0.84 r2= 0.75 s1= 0.65 r1= 0.5 b2= 0.3 a1= 0.4 c2= 0.2
Where? Problem arises when combining information from several sources/criteria Concentrate on middleware complexity without changing subsystems
Example: Combining Fuzzy Information Lists are results of query: ``find object with color`red’ and shape`round’” • Subsystems for color and for shape. • Each returns a score in [0,1] for each object • Aggregating function t is how the middleware system should combine the two criteria • Example: t(R=(x1,x2 )) could be min(x1,x2 )
Example: scheduling pages Each object - page in a data broadcast system • 1st field - of users requesting the page • 2nd field - longest time user is waiting Combining function t - product of the two fields (geometric mean) Goal: find the page with the largest product
T1 T2 Terms Tn Example: Information Retrieval Documents Dk D1 D2 W12 Query T1, T2, T3: find documents with largest sum of entries Aggregation function t is xi
Modes of Access to the Lists • Sequential/sorted access: obtain next object in list Li • cost cS • Random access: for object R and i mobtain xi • cost cR Cost of an execution: cS ( of seq. access) cR ( of random access)
Interesting Cases • cR/cS is small • cS cR or • cR >>cS Number of lists m - small
Fagin’s Algorithm - FA • For all lists L1, L2, , Lm get next object in sorted order. • Stop when there is set of k objects that appeared in all lists. • For every object R encountered • retrieve all fields x1, x2, , xm. • Compute t(x1,x2,…xm) • Return top k objects
Correctness of FA... For any monotone tand any database D of objects, FA finds the top k objects. Proof: any object in the real top k is better in at least one field than the objects in intersection.
Performance of FA Performance : assuming that the fields are independent (N(m-1)/m). Better performance - correlation between fields Worse performance - negative correlation Bad aggregating function: max
Goals of this work • Improve complexity and analysis - worst case not meaningful Instead consider Instance Optimality • Expand the range of functions want to handle all monotone aggregating functions • Simplify implementation
Instance Optimality A = class of algorithms, D = class of legal inputs. For AA and DD measure cost(A,D) 0. • An algorithm AAis instance optimal over A and D if there are constants c1and c2 s.t. For everyA’A and DD cost(A,D) c1cost(A’,D) c2. c1 is called the optimality ratio
…Instance Optimality • Common in competitive online analysis • Compare an online decision making algorithm to the best offline one. • Approximation Algorithms • Compare the size that the best algorithm can find to the one the approx. algorithm finds In our case • Offline Nondeterminism
…Instance Optimality • We show algorithms that are instance optimal for a variety of • Classes of algorithms • deterministic, Probabilistic, Approximate • Databases • access cost functions
Guidelines for Design of Algorithms • Format: do sequential/sorted access (with random access on other fields) until you know that you have seen the top k. • In general: greedy gathering of information; If a query might allow you to know top k objects do it. Works in all considered scenarios
The Threshold Algorithm - TA • For all lists L1, L2, , Lm get next object in sorted order. • For each object R returned • Retrieve all fields x1,x2,,xm. • Compute t(x1,x2,…xm) • If one of top k answers so far - remember it. • 1im let xi be bottom value seen in Li (so far) • Define the threshold value tobe t(x1,x2,…xm) • Stop when found k objects with t value . • Return top k objects
s2= 3/4 c1= 0.9 w2= 2/3 b1= 0.7 z2= 1/2 r1= 0.4 q2= 1/4 a1= 0.1 Example: m=2, k=1, t is min b , t(b) = 1/11 • Top object (so far) = • Bottom values x1= x2 = • Threshold t = c , t(c) = 1/12 r , t(r) =1/8 Maintained Information 0.7 0.9 0.4 0.1 3/4 2/3 1/2 1/4 0.4 0.1 2/3 3/4 s = (0.05,3/4) c = (0.9, 1/12) w = (0.07, 2/3) b = (0.7, 1/11) z = (0.09, 1/2) r = (0.4, 1/8) q = (0.08, 1/4) a = (0.1, 1/13)
Correctness of TA For any monotone tand any database D of objects, TA finds the top k objects. Proof: If object z was not seen 1imzixi t(z1, z2,…zm)t(x1,x2,…xm)
Implementation of TA Requires only bounded buffers: • Top k objects • Bottom m values x1,x2,…xm
Robustness of TA Approximation: Suppose want an (1) approx. - for any R returned and R’ not returned t(R’)(1)t(R) Modified stopping condition: Stop when found k objects with t value at least /(1). Early Stopping: can modify TA so that at any point user is • Given current view of top k list • Given a guarantee about approximation
Instance Optimality Intuition: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate...
Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before • Neither FA nor TA use wild guesses • Subsystem might not allow wild guesses More exotic queries: jth position in ith list...
Instance Optimality- No Wild Guesses Theorem: For any monotone tlet • A be the class of algorithms that • correctly find top k answers for every database with aggregation function t. • Do not make wild guesses • D be the class of all databases. Then TA is instance optimal over A and D Optimality ratio is m+m2 ·cR/cS - best possible!
Proof of Optimality Claim: If TA gets to iteration d, then any (correct) algorithm A’ must get to depth d-1 Proof: let Rmax be top object returned by TA (d)t(Rmax) (d-1) There exists D’ with R’ at level d-1 R’ (x1(d-1), x2 (d-1),…xm(d-1) ) Where A’ fails
Do wild guesses help? Aggregation function - min, k=1 Database - 1 2 … n n1 … 2n1 1 1 … 1 1 0 0 …0 0 0 … 0 1 1 1 …1 L1 : 1 2 … n n1 … 2n1 L2 : 2n1 … n1 n …1 Wild guess: access object n1 and top elements
Strict Monotonicity • An aggregation function tisstrictly monotoneif when 1imxix’i Then t(x1, x2,…xm)t(x’1,x’2,…x’m) Examples:min, max, avg...
Instance Optimality - Wild Guesses Theorem: For any strictly monotonetlet • A be the class of algorithms that • correctly find top k answers for every database. • D be the class of all databases with distinct values in each field. Then TA is instance optimal over A and D Optimality Ratio is c · mwhere c=max{cR/cS ,cS/cR }
Related Work An algorithm similar to TA was discovered independently by two other groups • Nepal and Ramakrishna • Gntzer, Balke and Kiessling No instance optimality analysis Hence proposed modifications that are not instance optimal algorithm Power of Abstraction?
Dealing with the Cost of Random Access In some scenarios random access may be impossible Cannot ask a major search engine for it internal score on some document In some scenarios random access may be expensive Cost corresponds to disk access (seq. vs. random) Need algorithms to deal with these scenarios • NRA - No Random Access • CA - Combined Algorithm
No Random Access - NRA March down the lists getting the next object Maintain: • For any object R with discovered fields S1,..,m: • W(R)t(x1,x2,…,x|S|,,0…0) Worst (smallest) valuet(R) can obtain • B(R)t(x1,x2,…,x|S|, x|S|+1,, …, xm) Best(largest) value t(R) can obtain
…maintained information (NRA) • Top k list, based on k largestW(R)seen so far • Ties broken according to B values Define Mk to be the kth largest W(R) in top k list • An object R is viable if B(R)Mk Stop when there are no viable elements left I.e. B(R)Mk for all R top list Return the top k list
Correctness For any monotone tand any database D of objects, NRA finds the top k objects. Proof: At any point, for all objects t(R)B(R) Once B(R)Ck for all but top list no other objects with t(R)Ck
Optimality Theorem: For any monotone tlet • A be the class of algorithms that • correctly find top k answers for every database. • make only sequential access • D be the class of all databases. Then NRA is instance optimal over A and D Optimality Ratio is m !
Implementation of NRA • Not so simple - need to update B(R) for all existing R whenx1,x2,…xm changes • For specific aggregation functions (min) good data structures Open Problem: Which aggregation function have good data structures?
Combined Algorithm CA Can combine TA and NRA Leth = cR/cS Maintain information as in NRA For everyhsequential accesses: • Do m random access on an objects from each list. Choose top viable for which not all fields are known
Instance Optimality Instance optimality statement a bit more complex Under certain assumptions (including t = min, sum) CA is instance optimal with optimality ratio ~ 2m
Further Research • Middleware Scenario: • Better implementations of NRA • Is large storage essential • Additional useful information in each list? • How widely applicable is instance optimality? • String Matching, Stable Marriage... • Aggregation functions and methods in other scenarios • Rank Aggregation of Search Engines • P=NP?
More Details See www.wisdom.weizmann.ac.il/~naor/PAPERS/middle_agg.html