Size-estimation framework with applications to transitive closure and reachability

Size-estimation framework with applications to transitive closure and reachability Edith Cohen AT&T Bell Labs 1996 Presented by Maxim Kalaev

Agenda • Intro & Motivation • Algorithm sketch • The estimation framework • Estimating reachability • Estimating neighborhood sizes

Introduction • Descendant counting problem:“Given a directed graph G compute for each node number of nodes reachable from it and the total size of the transitive closure”

E A D C B Introduction • - set of nodes reachable from node • Transitive closure size: • Example:|S(‘A’)|=5, |S(‘B’)|=3T=|S(‘A’)|+|S(‘B’)|+…= 15

Motivation • Applicable for DB-query size estimations • Data mining • Matrixes multiplications optimizations • Parallel DFS algorithms optimizations

E3 A4 D2 C5 B1 Framework algorithm sketch • Least descendant mappingGiven graph G(V,E) with ranks on it’s nodes compute a mapping for each node v in V to the least-ranked node in S(v) • Example: • LE(‘A’) = 1 • LE(‘C’) = 2

Framework algorithm sketch • The LE (least element) is highly correlated with size of S(v) !! • The precision can be improved by applying several iterations with random ranks assignment and recalculation of LE

The estimation framework • Let X be a set of elements x with non-negative weights w(x). • Let Y be a set of labels y, and mapping S: from labels y to subsets of x • Our object is to compute an estimate on:- assuming X,Y and weights are given but it’s costly to calculate w(S(y)) for all y’s

The estimation framework • Assume we have the following LE (LeastElement) Oracle: given ranks R(x) on elements of X, LE(y) returns element with minimal rank in S(y) in O(1) time: • The estimation algorithm will perform k iterations, where k is determined by required precision

The estimation framework • Iteration: • Independently, for each x in X select a random rank R(x) from exponential distribution with parameter w(x)Exponential distribution function will be: • Apply LE on selected ranking and store obtained min-ranks for each y in Y

The estimation framework • Proposition: The distribution of minimum rank R(le(y)) depends only on w(S(y)) • Proof: The min of k r.v.’s with distribution with parameters has distribution with parameter • Our objective now is to estimate distribution parameter from given samples

The estimation framework • Mean of exponentially distributed with parameter λ r.e.’s is: 1/λ • We can use this fact to estimate λ from samples by 1/(samples mean) • Use this to estimate w(S(y)) from minimal ranks obtained from k iterations:

The estimation framework • More estimators: • Selecting k(1-1/e) –smallest sample of k samples. (Like median for uniform distribution) • Using this non-intuitive average estimator:

The estimation framework • Complexity so far: • Allowing relative tolerated error ε we need to store significant bits for R’s • k assignment iterations will take O(k|X|) time • + k*O(Oracle setup time) • Asymptotic accuracy bounds (the proof will go later)

Estimating reachability • Objective: Given graph G(V,E) for each v estimate number of its descendantsand size of transitive closure: • All we need is to implement an Oracle for calculating LE mapping.Following algorithm inputs arbitrary ranking of nodes in sorted order and does this in O(|E|) time:

Estimating reachability • LE subroutine() • Reverse edges direction of the graph • Iterate until V = {} • Pop v with minimal rank from V • Run DFS to find all nodes reachable from v (call this set of nodes U) • For each node in U set LE == v • V = V \ U • E = E \ {edges incident to nodes in U}

Estimating reachability • Each estimation iteration takes O(|V|) + O(|E|) assuming we can sort nodes ranks in expected linear time. • Accuracy bounds (from estimator bounds)

Estimating neighborhood sizes • Problem: Given graph G(V,E) with nonnegative edges lengths should be able to give an estimation for number of nodes within distance of at most d from node v – n(v,d) • Our algorithm will preprocess G in time and after that will be able to answer (v,d) queries in time

E3 1 A4 3 2 D2 1 4 C5 1 B1 Estimating neighborhood sizes • N(A,7)={A,B,C,D,E} • N(A,3)={A,C,E} • N(D,0)={D} • N(C,∞)={C} • n(A,7)=5 • n(A,3)=3 • n(D,0)=1 • n(C,∞)=1

Estimating neighborhood sizes • After preprocessing of G we will generate for each node v a list of pairs: ({d1,s1}, {d2,s2},…,{dη,sη}), where d’s stays for distances and s’s stays for estimated neighborhoods sizes. The lists will be sorted by d’s. • To obtain n(v,d) we’ll look for a pair i such that and return

Estimating neighborhood sizes • The algorithm will run k iterations, in each iteration it will create for each node in G a least-element list ({d1,v1}, {d2,v2},…,{dη,vη}) such that for any neighborhood (v,d) we will be able to find a min-rank node using the list: for min-rank node will be:

E3 1 A4 3 2 D2 1 4 C5 1 B1 Estimating neighborhood sizes Neighborhoods: • N(A,7)={A,B,C,D,E} • N(A,3)={A,C,E} • N(D,1)={C,D} • N(C,∞)={C} LE-lists: • A: ({A,0}{E,1}{D,2}{B,4}) • B: ({B,0}) • C: ({C,0}) • D: ({D,0}) • E: ({E,0}{D,3})

Estimating neighborhood sizes - alg • sub Make_le_lists() • Assume nodes are sorted by rank in increasing order • Reverse edge direction of G • For i=1..n: , • For i=1..n (modified Dijkstra’s alg.) DO: (next slide)

Estimating neighborhood sizes - alg • Start with empty heap, place on heap with label 0 • Iterate until the heap is empty: • Pop node vk with minimal label d from the heap • Add pair to vk’s LE-list, set For each out-edge of vk: • If is in the heap – update its label to • Else: if place on the heap with label

Estimating neighborhood sizes - demo A A:0 E:1 D:2 B:4 B B:0 ∞ 3 0 E3 1 C C:0 0 4 2 1 ∞ A4 3 2 D D:0 D2 E E:0 D:3 0 ∞ 1 A:4 A:1 A:0 C:0 D:0 4 A:2 E:0 B:0 C5 1 ∞ 0 E:3 B1 ∞ 0

Estimating neighborhood sizes - analysis • CorrectnessProposition 1: • A node v is placed on heap in iteration i if an only if • If v is placed on the heap in iteration i, then the pair is placed on v’s list and the value d is updated to be

Estimating neighborhood sizes - analysis • ComplexityProposition 2: • If the ranking is a random permutation, the expected size of LE-lists is O(log(|V|) The proof is based on proposition 1 and divide&conquer style analysis -

Estimating neighborhood sizes - analysis (proof cont) Assume LE-list of node u contains x pairs. Consider nodes v sorted by their distance to node u: v1,v2,….According to preposition 1 node v will enter heap at iteration i iff all the nodes with lower ranks are farer from u than is. Random ranks are expected to partition v1,v2,… sequence such that rank i will be nearer to u than about half of nodes with ranks > i. It follows that x is ~ O( log|V| )

Estimating neighborhood sizes - analysis • Complexity (cont)Running time: Using Fibonacci heaps we have O(log|V|) pop() operation and O(1) insert() or update(). Let be a number of iterations in which was placed on the heap (0<i≤|V|). It follows that running time is:As is also a size of LE-list we get:

Estimating neighborhood sizes K – iterations issues • What to do with obtained k LE-lists per node? Naïve way brings us to O(k*loglog|V|) time.It can be improved to O(logk + loglog|V|) by merging the lists and storing sums of ranks / breakpoint. • Total algorithm setup time is:

This page has intentionally left blank

Summary • General size-estimation framework • Two applications – transitive closure size estimation and neighborhoods size estimation

E3 1 A4 3 2 D2 1 4 C5 1 B1 THE END!

Size-estimation framework with applications to transitive closure and reachability