Top-Down Analysis of Path Compression in Union-Find Structures

Top-Down Analysis of Path Compression Raimund Seidel and Micha Sharir Presented by Or Ozery

Introduction • What is path compression? • An operation on rooted trees. • Used mainly in union-find algorithms. • This paper analyzes the amortized time-complexity of path compression. • Previous analyses all used low-level accounting. • The analysis in this paper is based on a more “natural” top-down approach.

Lecture Outline • Recall the union-find data structure. • Implementation using rooted trees. • Doing better: path compression. • The top-down analysis by Seidel and Sharir: • The Main Lemma. • Repetitive use of the Main Lemma yields increasingly better bounds. • Finally we get the “famous” α(n) bound.

The Union-Find DS • X is a set of n elements. • A partition of X is given. • Each subset has a unique representative. • Support for the following operations: • Find(x) returns the representative of the subset containing x. • Union(x, y) merges the subsets represented by x and y. Representative of the new set picked arbitrarily.

A Simple Example • X = {1, 2, 3, 4, 5, 6, 7, 8}. • The partition is given by: • After Union(6,7): • Note: representatives are colored in yellow. 1, 2, 3 4, 5 6 7, 8 1, 2, 3 4, 5 6, 7, 8

Implementation • Abstract: • Represent X by a rooted forest. • Nodes are the elements of X. • Each tree represents a subset. • Roots are representatives. • Memory representation: • Each node, except a root, has a parent. • The nodes will keep a pointer to their parent.

Implementation – cont’d • Find(x) if (parent[x] = null) return x else return Find(parent[x]) • Union(x, y) parent[x] = y

Example • After Union(6,7): 1 1 4 4 7 6 2 2 3 3 5 5 8 7 6 8

Complexity Analysis • Union(x, y) – O(1). • Find(x) – O(length of path from x to root) = O(n) W.C. • We need to keep trees balanced. How? • Trees are built in Union. • Make the “bigger” tree parent of the “smaller”.

Linking Strategies • Our implementation used naïve arbitrary linking. • Consider 2 other heuristics: • Linking by weight: Take the tree with more nodes as parent. • Linking by rank: Take the tree with the higher height as parent. • Both strategies can be easily implemented in Union in O(1) time, and both yield trees of height O(log n).

Path Compression • Ideally, we’d have all nodes children of the root. • Changing a tree to look that way will cost O(n). • But we can use the time spent on Find(x) to do some work that will bring us closer to our vision: While climbing the path from x to the root, make all the nodes in the way children of the root.

Implementation • Find(x) if (parent[x] = null) return x else root = Find(parent[x]) parent[x] = root return root

Example • After Find(x): x x

Complexity? • Union(x, y) hasn’t change – O(1). • Find(x) is still, per 1 operation, on W.C. • O(n) for arbitrary linking. • O(log n) for linking by rank/weight. • But what about a series of operations? • Each Find operation may reduce the running-time of subsequent Find’s. • What is the amortized cost per operation?

A Little Bit of History • What is the amortized cost per operation employing linking by rank/weight and path compression? • 1972: Fischer showed O(log log n). • 1973: Hopcroft & Ullman showed O(log*n). • 1975: Tarjan showed O(α(m,n)). • 1979: Tarjan showed also Ω(α(m,n)).

Definitions • We consider only paths p from x to an ancestor of x, y. • p is called a rootpath if y is a root. • p is called a nonrootpath if y is not a root. In this case we define a(p) as the parent of y. • We also consider empty paths involving no nodes, and we classify them as rootpaths.

Definitions – cont’d • Compressing a rootpathp is making all the nodes of p roots. • Compressing a nonrootpathp is making all the nodes of p children of a(p). • We define Cost(p) to be the number of nodes that get a new parent in the operation of compressing p. • If p is nonrootpath with d nodes then Cost(p) = d-1. • If p is a rootpath then Cost(p) = 0.

Example • Result of compressing the path from x to y: x y y x y y x x

More Definitions • Let C = (p1, p2, …, pk) be a sequence of path compressions on an initial forest F. • We define Cost(C) as the total number of times that a node of F gets a new parent. • Cost(C) = Cost(p1) + Cost(p2) + … + Cost(pk) • Also define |C| as the number of nonrootpaths in C.

The Reduction Lemma • Let S be a sequence of Union and mFind operations on an initial partitioning of an n-element set X into singletons. • Let T be the time necessary to execute S. • There is a forest F on X and a sequence C of m path compressions, all nonrootpaths or empty paths, such that T = O(m + n + Cost(C)).

The Reduction Lemma • Proof: • For F take the forest generated by executing just the Union operations. • The sequence of Find operations then defines a sequence C of nonrootpaths (or empty paths) in F.

The Reduction Lemma • Proof – cont’d: • A Find operation costs O(1 + length(p)). • For nonrootpaths and empty paths we have Cost(p) = length(p). • Thus each Find operation costs O(1 + Cost(p)). • Summing all of the mFind’s we get O(m + Cost(C)). • Since the size of the input is n, and at most n-1Union’s can occur, we deduce that T = O(m + n + Cost(C)).

Now What? • We proved T = O(m + n + Cost(C)). • Thus if we prove an upper-bound on Cost(C), then we immediately get an upper-bound on T. • So now we have to try and bound Cost(C). • We’ll use a “divide and conqueror” technique: • Divide the problem into 2 independent sub-problems. • Solve each sub-problem (can be done recursively). • Combine both results for a result to the main problem.

Dissections • As before, F is a rooted forest on a node set X. • A dissection of F is a partition (Xb,Xt) of X such that Xt is upwards closed in F, i.e. If x is in Xt, then every ancestor of x is also in Xt. • Intuitively, you can think of a dissection as a horizontal line that cuts the forest into a top and a bottom part.

Examples • Is this a dissection? • Note:Xb nodes are red, Xt nodes are green. Yes No Yes

Simple Facts • Some simple facts about dissections: • A dissection (Xb,Xt) cuts every path p into 2 contiguous paths: pb in F(Xb) and pt in F(Xt). • Dissections are preserved under path compression. • Let C = (p(1), p(2), …, p(k)) be a compression sequence on F, and let F(1), F(2), …, F(k) be the sequence of resulting forests, then: • Cb = (pb(1), pb(2), …, pb(k)) is a compression sequence on F(Xb) with resulting forests F(1)(Xb), F(2)(Xb), …, F(k)(Xb). • The same holds for Ct = (pt(1), pt(2), …, pt(k)).

The Main Lemma • Let C = (p(1), p(2), …, p(k)) be a sequence of path compressions on an initial forest F with node set X. • Let (Xb,Xt) be a dissection of F. • Then the following holds: • |Cb| + |Ct| ≤ |C| • Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct|

The Main Lemma • Proof: • Recall that |C| = the number of nonrootpaths in C. • If p(i) is a nonrootpath, then either p(i)b or p(i)t (or both) must be a rootpath. • If p(i) is a rootpath, then both p(i)b and p(i)t are rootpaths. • Thus we get |Cb| + |Ct| ≤ |C|.

The Main Lemma • Proof – cont’d: • Recall that Cost(C) is the number of times a node of X gets a new parent while executing C. • We have 4 cases. The number of times that… • A node from Xt gets a new parent from Xt = Cost(Ct) • A node from Xt gets a new parent from Xb = 0 • A node from Xb gets a new parent from Xb = Cost(Cb) • A node from Xb gets a new parent from Xt = ?

The Main Lemma • Proof – cont’d: • What is the number of times that a node from Xb gets a new parent from Xt? • The number of times that node from Xb gets a parent from Xt for the first time is ≤ |Xb|. • A node from Xb can get a new parent from Xt again only when p(i)t is a nonrootpath, so the number of times this case can occur is ≤ |Ct|. • Thus Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct|

Arbitrary Linking • Let f(m,n) be the maximum cost for a sequence of m path compressions in any forest of n nodes. • Let C be a sequence of m path compressions on a forest F of n nodes. • Pick a dissection of F such that |Xb| = |Xt| = n/2. • Using the main lemma we get: Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct| ≤ f(m, n/2) + f(m, n/2) + n/2 + m = 2·f(m, n/2) + n/2 + m

Arbitrary Linking - cont’d • We got: Cost(C) ≤ 2·f(m, n/2) + n/2 + m • Since C was picked arbitrarily we get: f(m,n) ≤ 2·f(m, n/2) + n/2 + m • This recursion solves to f(m,n) = O((m+n)log n). • Thus the running-time using path compression and arbitrary linking is O((m+n)log n). • For k > 1, picking |Xt| = n/k instead of n/2 we get a bound of O((m+k·n)logkn). • For k = 1+⌈m/n⌉ we get a better O((m+n)log1+⌈m/n⌉n).

Linking by Rank • We now wish to use our main lemma to bound the running-time of path compression with linking by rank. • The key for getting a good bound is choosing a “good” dissection on which to invoke the lemma. • Whether a dissection is “good” or “bad” depends on the forest structure. • So first we need to characterize the forests that arise while using linking by rank, and investigate their properties.

Rank Forests • Define rank(x)= the height of the sub-tree rooted at x. • F is called a rank forest if for every node x in F the following property holds: • For each 0 ≤ i < rank(x), x has a child with rank i. • One can easily confirm (using induction) that: • Linking by rank yields rank forests. • In rank forests, each node of rank k is a root of a sub-tree of size at least 2k.

Rank Forest Properties • Let F be a rank forest with node set X and maximum rank r, and let 0 ≤ s < r be some integer. • Denote by X≤s the set of nodes with rank ≤ s. • Denote by X>s the set of nodes with rank > s. • Then: • (X≤s,X>s) is a dissection of F. • F(X≤s) is a rank forest with maximum rank s. • F(X>s) is a rank forest with maximum rank r-s-1. • |X>s| ≤ |X|/2s+1.

Linking by Rank • Let f(m,n,r) be the maximum cost for a sequence of path compressions of length m on a rank forest with n nodes and maximum rank r. • We start with the trivial bound f(m,n,r) < r·n (each node can get a new parent at most r-1 times). • Repetitive use of the main lemma will give us better and better bounds of f. • Note that what we’re really looking for is f(m,n,⌊logn⌋).

Iteration I • Let C be any sequence of path compressions of length m in a rank forest with n nodes and maximum rank r. • Consider the dissection (X≤s,X>s) with s = log r. • Cost(Ct) ≤ f(|Ct|,|X>s|,r-s-1) ≤ (r-s-1)·|X>s| ≤ r·|X>s| ≤ r·|X|/2s+1 ≤ r·n/2s = r·n/2log r = r·n/r = n • Now apply the main lemma on (X≤s,X>s): • Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct| ≤ Cost(Cb) + n + n + (|C| - |Cb|) = Cost(Cb) - |Cb| + 2n + |C|

Iteration I – cont’d • We showed: Cost(C) ≤ Cost(Cb) - |Cb| + 2n + |C| • We can bound Cost(Cb) the same way we did for Cost(Ct), using the trivial bound: • Cost(Cb) ≤ f(m,n,s) ≤ n·s = n·log r • Using this way we get Cost(C) ≤ n·log r + 2n + m. • Hence also f(m,n,r) ≤ n·log r + 2n + m. • Using the reduction lemma, we get a bound of O(m + n + f(m,n,log n)) = O(m + n·log log n). • But a better idea is to apply the top inequality on itself.

Iteration II • Cost(C) ≤ Cost(Cb) - |Cb| + 2n + |C| ≤ (Cost(Cbb) - |Cbb| + 2n + |Cb|) - |Cb| + 2n + |C| ≤ Cost(Cbb) - |Cbb| + 4n + |C| • Cbb is a compression sequence in a rank forest with maximum rank log s = log log r. • Bounding Cost(Cbb) using our trivial bound will now yield a bound of O(m + n·log log log n). • But why stop here?

Iteration #j • Now we have Cost(C) ≤ Cost(C’) - |C’| + 2j·n + |C| • C’ is a compression sequence in a rank forest with maximum rank log(j)r. • For j = log*r we have log(j)r ≤ 1, so C’ is a compression sequence in a rank forest with maximum rank 1. • In such a forest any compression sequence has cost 0, so looks like our party is over: • Cost(C) ≤ 2n·log*r + |C|, yielding O(m + n·log*n).

There’s More… • We now know the bound f(m,n,r) ≤ 2n·log*r + m. • Using it, we can start all over again: • Apply main lemma on (X≤s,X>s) with s = log log*r. • Bound Cost(Ct) ≤ n + |Ct|. • Bound Cost(C) ≤ Cost(Cb) – 2|Cb| + 2n + 2|C|. • Apply the last inequality on itself (log log*)*(r) times, yielding Cost(C) ≤ 2n·(log log*)*(r) + 2|C|. • Now we have a new improved bound of f, and we can repeat the process over and over again.

The Shifting Lemma • Define g⟡ = (⌈log⌉∘g)*. • Let g(r) be a non-decreasing function with g(r) < r. • Suppose that f(m,n,r) ≤ k·m + 2n·g(r) for all m, n, r. • Then also f(m,n,r) ≤ (k+1)·m + 2n·g⟡(r) for all m, n, r. • Proof outline: • Use induction on r. • As we’ve done before, apply the main lemma on the dissection (X≤s,X>s) with s =⌈log g(r)⌉.

The “Ackermann Bound” • Define: • J0(r) = ⌈(r – 1)/2⌉ • Jk(r) = Jk-1⟡(r) for k > 0 • For all m, n, r: • f(m,n,r) ≤ n·(r-1) ≤ 2n·J0(r) = 0·m + 2n·J0(r) • Thus, by the shifting lemma, for all m, n, r, k: • f(m,n,r) ≤ k·m + 2n·Jk(r) • Define α(m,n) = min { k | Jk(⌊log n⌋) ≤ 1 + m/n }. • Then f(m,n,⌊log n⌋) ≤ α(m,n)·m + 2n + 2m

The “Ackermann Bound” • So by the reduction lemma we finally get the desired: • Theorem: Performing a sequence of Union operations and mFind operations on a set of size n using linking by rank and path compression requires O(n + m·α(m,n)) time.

Concluding Remarks • We reached α(m,n) in a somewhat “natural” way. • We could have just proven the shifting lemma. • But going through the iterations improved our intuition of the “top-down method”. • It also made us better understand the inverse Ackermann function, and how slow it really grows. • An analogous top-down proof for the α(m,n) bound also exists for path compression with linking by weight.

Top-Down Analysis of Path Compression in Union-Find Structures

Top-Down Analysis of Path Compression in Union-Find Structures

Presentation Transcript

Starting Down the Path

Path Analysis

Path Analysis

Concrete Compression Analysis

Path Analysis

Path Analysis

TOP-DOWN !

Top Down Analysis on Dell

Gliding Path - Drift down

Top-Down/Bottom-Up Analysis Workshop

Path Analysis

Top Down Function Analysis

Path Analysis

PATH ANALYSIS

Chapter4 Top –down Syntax Analysis

Path Analysis

Path Analysis

Path Analysis

Path Analysis

PATH ANALYSIS

Top Down View of Estimation