Aggregation Algorithms and Instance Optimality

Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron Fagin Amnon Lotem

Aggregating information from several lists/sources • Define the problem • Ways to evaluate algorithms • New algorithms • Further Research

The problem • Database D of N objects • An object R has m fields - (x1, x2, , xm) • Each xi  0,1 • The objects are given in m lists L1, L2, , Lm • list Liall objectssorted by xi value. • An aggregation function t(x1,x2,…xm) • t(x1,x2,…xm) - a monotone increasing function Wanted: top k objects according to t

c1= 0.9 b1= 0.8 Goal • Touch as few objects as possible • Access to object? List L2 List L1 s2= 0.85 a2= 0.84 r2= 0.75 s1= 0.65 r1= 0.5 b2= 0.3 a1= 0.4 c2= 0.2

Where? Problem arises when combining information from several sources/criteria Concentrate on middleware complexity without changing subsystems

Example: Combining Fuzzy Information Lists are results of query: ``find object with color`red’ and shape`round’” • Subsystems for color and for shape. • Each returns a score in [0,1] for each object • Aggregating function t is how the middleware system should combine the two criteria • Example: t(R=(x1,x2 )) could be min(x1,x2 )

Example: scheduling pages Each object - page in a data broadcast system • 1st field - of users requesting the page • 2nd field - longest time user is waiting Combining function t - product of the two fields (geometric mean) Goal: find the page with the largest product

T1 T2 Terms Tn Example: Information Retrieval Documents Dk D1 D2 W12 Query T1, T2, T3: find documents with largest sum of entries Aggregation function t is xi

Modes of Access to the Lists • Sequential/sorted access: obtain next object in list Li • cost cS • Random access: for object R and i mobtain xi • cost cR Cost of an execution: cS  ( of seq. access) cR  (  of random access)

Interesting Cases • cR/cS is small • cS  cR or • cR >>cS Number of lists m - small

Fagin’s Algorithm - FA • For all lists L1, L2, , Lm get next object in sorted order. • Stop when there is set of k objects that appeared in all lists. • For every object R encountered • retrieve all fields x1, x2, , xm. • Compute t(x1,x2,…xm) • Return top k objects

Correctness of FA... For any monotone tand any database D of objects, FA finds the top k objects. Proof: any object in the real top k is better in at least one field than the objects in intersection.

Performance of FA Performance : assuming that the fields are independent (N(m-1)/m). Better performance - correlation between fields Worse performance - negative correlation Bad aggregating function: max

Goals of this work • Improve complexity and analysis - worst case not meaningful Instead consider Instance Optimality • Expand the range of functions want to handle all monotone aggregating functions • Simplify implementation

Instance Optimality A = class of algorithms, D = class of legal inputs. For AA and DD measure cost(A,D) 0. • An algorithm AAis instance optimal over A and D if there are constants c1and c2 s.t. For everyA’A and DD cost(A,D) c1cost(A’,D) c2. c1 is called the optimality ratio

…Instance Optimality • Common in competitive online analysis • Compare an online decision making algorithm to the best offline one. • Approximation Algorithms • Compare the size that the best algorithm can find to the one the approx. algorithm finds In our case • Offline  Nondeterminism

…Instance Optimality • We show algorithms that are instance optimal for a variety of • Classes of algorithms • deterministic, Probabilistic, Approximate • Databases • access cost functions

Guidelines for Design of Algorithms • Format: do sequential/sorted access (with random access on other fields) until you know that you have seen the top k. • In general: greedy gathering of information; If a query might allow you to know top k objects do it. Works in all considered scenarios

The Threshold Algorithm - TA • For all lists L1, L2, , Lm get next object in sorted order. • For each object R returned • Retrieve all fields x1,x2,,xm. • Compute t(x1,x2,…xm) • If one of top k answers so far - remember it. • 1im let xi be bottom value seen in Li (so far) • Define the threshold value tobe t(x1,x2,…xm) • Stop when found k objects with t value . • Return top k objects

s2= 3/4 c1= 0.9 w2= 2/3 b1= 0.7 z2= 1/2 r1= 0.4 q2= 1/4 a1= 0.1 Example: m=2, k=1, t is min b , t(b) = 1/11 • Top object (so far) = • Bottom values x1= x2 = • Threshold t = c , t(c) = 1/12 r , t(r) =1/8 Maintained Information 0.7 0.9 0.4 0.1 3/4 2/3 1/2 1/4 0.4 0.1 2/3 3/4 s = (0.05,3/4) c = (0.9, 1/12) w = (0.07, 2/3) b = (0.7, 1/11) z = (0.09, 1/2) r = (0.4, 1/8) q = (0.08, 1/4) a = (0.1, 1/13)

Correctness of TA For any monotone tand any database D of objects, TA finds the top k objects. Proof: If object z was not seen  1imzixi t(z1, z2,…zm)t(x1,x2,…xm) 

Implementation of TA Requires only bounded buffers: • Top k objects • Bottom m values x1,x2,…xm

Robustness of TA Approximation: Suppose want an (1) approx. - for any R returned and R’ not returned t(R’)(1)t(R) Modified stopping condition: Stop when found k objects with t value at least /(1). Early Stopping: can modify TA so that at any point user is • Given current view of top k list • Given a guarantee about  approximation

Instance Optimality Intuition: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate...

Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before • Neither FA nor TA use wild guesses • Subsystem might not allow wild guesses More exotic queries: jth position in ith list...

Instance Optimality- No Wild Guesses Theorem: For any monotone tlet • A be the class of algorithms that • correctly find top k answers for every database with aggregation function t. • Do not make wild guesses • D be the class of all databases. Then TA is instance optimal over A and D Optimality ratio is m+m2 ·cR/cS - best possible!

Proof of Optimality Claim: If TA gets to iteration d, then any (correct) algorithm A’ must get to depth d-1 Proof: let Rmax be top object returned by TA (d)t(Rmax) (d-1) There exists D’ with R’ at level d-1 R’ (x1(d-1), x2 (d-1),…xm(d-1) ) Where A’ fails

Do wild guesses help? Aggregation function - min, k=1 Database - 1 2 … n n1 … 2n1 1 1 … 1 1 0 0 …0 0 0 … 0 1 1 1 …1 L1 : 1 2 … n n1 … 2n1 L2 : 2n1 … n1 n …1 Wild guess: access object n1 and top elements

Strict Monotonicity • An aggregation function tisstrictly monotoneif when 1imxix’i Then t(x1, x2,…xm)t(x’1,x’2,…x’m) Examples:min, max, avg...

Instance Optimality - Wild Guesses Theorem: For any strictly monotonetlet • A be the class of algorithms that • correctly find top k answers for every database. • D be the class of all databases with distinct values in each field. Then TA is instance optimal over A and D Optimality Ratio is c · mwhere c=max{cR/cS ,cS/cR }

Related Work An algorithm similar to TA was discovered independently by two other groups • Nepal and Ramakrishna • Gntzer, Balke and Kiessling No instance optimality analysis Hence proposed modifications that are not instance optimal algorithm Power of Abstraction?

Dealing with the Cost of Random Access In some scenarios random access may be impossible Cannot ask a major search engine for it internal score on some document In some scenarios random access may be expensive Cost corresponds to disk access (seq. vs. random) Need algorithms to deal with these scenarios • NRA - No Random Access • CA - Combined Algorithm

No Random Access - NRA March down the lists getting the next object Maintain: • For any object R with discovered fields S1,..,m: • W(R)t(x1,x2,…,x|S|,,0…0) Worst (smallest) valuet(R) can obtain • B(R)t(x1,x2,…,x|S|, x|S|+1,, …, xm) Best(largest) value t(R) can obtain

…maintained information (NRA) • Top k list, based on k largestW(R)seen so far • Ties broken according to B values Define Mk to be the kth largest W(R) in top k list • An object R is viable if B(R)Mk Stop when there are no viable elements left I.e. B(R)Mk for all R top list Return the top k list

Correctness For any monotone tand any database D of objects, NRA finds the top k objects. Proof: At any point, for all objects t(R)B(R) Once B(R)Ck for all but top list  no other objects with t(R)Ck

Optimality Theorem: For any monotone tlet • A be the class of algorithms that • correctly find top k answers for every database. • make only sequential access • D be the class of all databases. Then NRA is instance optimal over A and D Optimality Ratio is m !

Implementation of NRA • Not so simple - need to update B(R) for all existing R whenx1,x2,…xm changes • For specific aggregation functions (min) good data structures Open Problem: Which aggregation function have good data structures?

Combined Algorithm CA Can combine TA and NRA Leth = cR/cS Maintain information as in NRA For everyhsequential accesses: • Do m random access on an objects from each list. Choose top viable for which not all fields are known

Instance Optimality Instance optimality statement a bit more complex Under certain assumptions (including t = min, sum) CA is instance optimal with optimality ratio ~ 2m

Further Research • Middleware Scenario: • Better implementations of NRA • Is large storage essential • Additional useful information in each list? • How widely applicable is instance optimality? • String Matching, Stable Marriage... • Aggregation functions and methods in other scenarios • Rank Aggregation of Search Engines • P=NP?

More Details See www.wisdom.weizmann.ac.il/~naor/PAPERS/middle_agg.html

Aggregation Algorithms and Instance Optimality

Aggregation Algorithms and Instance Optimality

Presentation Transcript

Pareto Optimality

Serial Nesting and Revenue Optimality

Optimality

Optimality theory

Multiple Instance Real Boosting with Aggregation Functions

Optimality Theory

Optimality Theory

Reachability, Schedulability and Optimality

Optimality, Scalability and Stability study of Partitioning and Placement Algorithms

Optimality theory

Optimality of Randomized Algorithms for the Intersection Problem

Classes and Instance Variables

OPTIMALITY

Optimality in Cognition and Grammar

Aggravation and Aggregation:

Instance-based Learning Algorithms

Aggregation and propagation

Reachability, Schedulability and Optimality

Lecture 20 Optimality and Symmorphosis

Optimality, Scalability and Stability study of Partitioning and Placement Algorithms

Aggregation and Secure Aggregation