1 / 33

Size-estimation framework with applications to transitive closure and reachability

Size-estimation framework with applications to transitive closure and reachability. Edith Cohen AT&T Bell Labs 1996. Presented by Maxim Kalaev. Agenda. Intro & Motivation Algorithm sketch The estimation framework Estimating reachability Estimating neighborhood sizes. Introduction.

seanna
Download Presentation

Size-estimation framework with applications to transitive closure and reachability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Size-estimation framework with applications to transitive closure and reachability Edith Cohen AT&T Bell Labs 1996 Presented by Maxim Kalaev

  2. Agenda • Intro & Motivation • Algorithm sketch • The estimation framework • Estimating reachability • Estimating neighborhood sizes

  3. Introduction • Descendant counting problem:“Given a directed graph G compute for each node number of nodes reachable from it and the total size of the transitive closure”

  4. E A D C B Introduction • - set of nodes reachable from node • Transitive closure size: • Example:|S(‘A’)|=5, |S(‘B’)|=3T=|S(‘A’)|+|S(‘B’)|+…= 15

  5. Motivation • Applicable for DB-query size estimations • Data mining • Matrixes multiplications optimizations • Parallel DFS algorithms optimizations

  6. E3 A4 D2 C5 B1 Framework algorithm sketch • Least descendant mappingGiven graph G(V,E) with ranks on it’s nodes compute a mapping for each node v in V to the least-ranked node in S(v) • Example: • LE(‘A’) = 1 • LE(‘C’) = 2

  7. Framework algorithm sketch • The LE (least element) is highly correlated with size of S(v) !! • The precision can be improved by applying several iterations with random ranks assignment and recalculation of LE

  8. The estimation framework • Let X be a set of elements x with non-negative weights w(x). • Let Y be a set of labels y, and mapping S: from labels y to subsets of x • Our object is to compute an estimate on:- assuming X,Y and weights are given but it’s costly to calculate w(S(y)) for all y’s

  9. The estimation framework • Assume we have the following LE (LeastElement) Oracle: given ranks R(x) on elements of X, LE(y) returns element with minimal rank in S(y) in O(1) time: • The estimation algorithm will perform k iterations, where k is determined by required precision

  10. The estimation framework • Iteration: • Independently, for each x in X select a random rank R(x) from exponential distribution with parameter w(x)Exponential distribution function will be: • Apply LE on selected ranking and store obtained min-ranks for each y in Y

  11. The estimation framework • Proposition: The distribution of minimum rank R(le(y)) depends only on w(S(y)) • Proof: The min of k r.v.’s with distribution with parameters has distribution with parameter • Our objective now is to estimate distribution parameter from given samples

  12. The estimation framework • Mean of exponentially distributed with parameter λ r.e.’s is: 1/λ • We can use this fact to estimate λ from samples by 1/(samples mean) • Use this to estimate w(S(y)) from minimal ranks obtained from k iterations:

  13. The estimation framework • More estimators: • Selecting k(1-1/e) –smallest sample of k samples. (Like median for uniform distribution) • Using this non-intuitive average estimator:

  14. The estimation framework • Complexity so far: • Allowing relative tolerated error ε we need to store significant bits for R’s • k assignment iterations will take O(k|X|) time • + k*O(Oracle setup time) • Asymptotic accuracy bounds (the proof will go later)

  15. Estimating reachability • Objective: Given graph G(V,E) for each v estimate number of its descendantsand size of transitive closure: • All we need is to implement an Oracle for calculating LE mapping.Following algorithm inputs arbitrary ranking of nodes in sorted order and does this in O(|E|) time:

  16. Estimating reachability • LE subroutine() • Reverse edges direction of the graph • Iterate until V = {} • Pop v with minimal rank from V • Run DFS to find all nodes reachable from v (call this set of nodes U) • For each node in U set LE == v • V = V \ U • E = E \ {edges incident to nodes in U}

  17. Estimating reachability • Each estimation iteration takes O(|V|) + O(|E|) assuming we can sort nodes ranks in expected linear time. • Accuracy bounds (from estimator bounds)

  18. Estimating neighborhood sizes • Problem: Given graph G(V,E) with nonnegative edges lengths should be able to give an estimation for number of nodes within distance of at most d from node v – n(v,d) • Our algorithm will preprocess G in time and after that will be able to answer (v,d) queries in time

  19. E3 1 A4 3 2 D2 1 4 C5 1 B1 Estimating neighborhood sizes • N(A,7)={A,B,C,D,E} • N(A,3)={A,C,E} • N(D,0)={D} • N(C,∞)={C} • n(A,7)=5 • n(A,3)=3 • n(D,0)=1 • n(C,∞)=1

  20. Estimating neighborhood sizes • After preprocessing of G we will generate for each node v a list of pairs: ({d1,s1}, {d2,s2},…,{dη,sη}), where d’s stays for distances and s’s stays for estimated neighborhoods sizes. The lists will be sorted by d’s. • To obtain n(v,d) we’ll look for a pair i such that and return

  21. Estimating neighborhood sizes • The algorithm will run k iterations, in each iteration it will create for each node in G a least-element list ({d1,v1}, {d2,v2},…,{dη,vη}) such that for any neighborhood (v,d) we will be able to find a min-rank node using the list: for min-rank node will be:

  22. E3 1 A4 3 2 D2 1 4 C5 1 B1 Estimating neighborhood sizes Neighborhoods: • N(A,7)={A,B,C,D,E} • N(A,3)={A,C,E} • N(D,1)={C,D} • N(C,∞)={C} LE-lists: • A: ({A,0}{E,1}{D,2}{B,4}) • B: ({B,0}) • C: ({C,0}) • D: ({D,0}) • E: ({E,0}{D,3})

  23. Estimating neighborhood sizes - alg • sub Make_le_lists() • Assume nodes are sorted by rank in increasing order • Reverse edge direction of G • For i=1..n: , • For i=1..n (modified Dijkstra’s alg.) DO: (next slide)

  24. Estimating neighborhood sizes - alg • Start with empty heap, place on heap with label 0 • Iterate until the heap is empty: • Pop node vk with minimal label d from the heap • Add pair to vk’s LE-list, set For each out-edge of vk: • If is in the heap – update its label to • Else: if place on the heap with label

  25. Estimating neighborhood sizes - demo A A:0 E:1 D:2 B:4 B B:0 ∞ 3 0 E3 1 C C:0 0 4 2 1 ∞ A4 3 2 D D:0 D2 E E:0 D:3 0 ∞ 1 A:4 A:1 A:0 C:0 D:0 4 A:2 E:0 B:0 C5 1 ∞ 0 E:3 B1 ∞ 0

  26. Estimating neighborhood sizes - analysis • CorrectnessProposition 1: • A node v is placed on heap in iteration i if an only if • If v is placed on the heap in iteration i, then the pair is placed on v’s list and the value d is updated to be

  27. Estimating neighborhood sizes - analysis • ComplexityProposition 2: • If the ranking is a random permutation, the expected size of LE-lists is O(log(|V|) The proof is based on proposition 1 and divide&conquer style analysis -

  28. Estimating neighborhood sizes - analysis (proof cont) Assume LE-list of node u contains x pairs. Consider nodes v sorted by their distance to node u: v1,v2,….According to preposition 1 node v will enter heap at iteration i iff all the nodes with lower ranks are farer from u than is. Random ranks are expected to partition v1,v2,… sequence such that rank i will be nearer to u than about half of nodes with ranks > i. It follows that x is ~ O( log|V| )

  29. Estimating neighborhood sizes - analysis • Complexity (cont)Running time: Using Fibonacci heaps we have O(log|V|) pop() operation and O(1) insert() or update(). Let be a number of iterations in which was placed on the heap (0<i≤|V|). It follows that running time is:As is also a size of LE-list we get:

  30. Estimating neighborhood sizes K – iterations issues • What to do with obtained k LE-lists per node? Naïve way brings us to O(k*loglog|V|) time.It can be improved to O(logk + loglog|V|) by merging the lists and storing sums of ranks / breakpoint. • Total algorithm setup time is:

  31. This page has intentionally left blank

  32. Summary • General size-estimation framework • Two applications – transitive closure size estimation and neighborhoods size estimation

  33. E3 1 A4 3 2 D2 1 4 C5 1 B1 THE END!

More Related