470 likes | 520 Views
Top-Down Analysis of Path Compression. Raimund Seidel and Micha Sharir. Introduction. What is path compression ? An operation on rooted trees . Used mainly in union-find algorithms. This paper analyzes the amortized time-complexity of path compression.
E N D
Top-Down Analysis of Path Compression Raimund Seidel and Micha Sharir Presented by Or Ozery
Introduction • What is path compression? • An operation on rooted trees. • Used mainly in union-find algorithms. • This paper analyzes the amortized time-complexity of path compression. • Previous analyses all used low-level accounting. • The analysis in this paper is based on a more “natural” top-down approach.
Lecture Outline • Recall the union-find data structure. • Implementation using rooted trees. • Doing better: path compression. • The top-down analysis by Seidel and Sharir: • The Main Lemma. • Repetitive use of the Main Lemma yields increasingly better bounds. • Finally we get the “famous” α(n) bound.
The Union-Find DS • X is a set of n elements. • A partition of X is given. • Each subset has a unique representative. • Support for the following operations: • Find(x) returns the representative of the subset containing x. • Union(x, y) merges the subsets represented by x and y. Representative of the new set picked arbitrarily.
A Simple Example • X = {1, 2, 3, 4, 5, 6, 7, 8}. • The partition is given by: • After Union(6,7): • Note: representatives are colored in yellow. 1, 2, 3 4, 5 6 7, 8 1, 2, 3 4, 5 6, 7, 8
Implementation • Abstract: • Represent X by a rooted forest. • Nodes are the elements of X. • Each tree represents a subset. • Roots are representatives. • Memory representation: • Each node, except a root, has a parent. • The nodes will keep a pointer to their parent.
Implementation – cont’d • Find(x) if (parent[x] = null) return x else return Find(parent[x]) • Union(x, y) parent[x] = y
Example • After Union(6,7): 1 1 4 4 7 6 2 2 3 3 5 5 8 7 6 8
Complexity Analysis • Union(x, y) – O(1). • Find(x) – O(length of path from x to root) = O(n) W.C. • We need to keep trees balanced. How? • Trees are built in Union. • Make the “bigger” tree parent of the “smaller”.
Linking Strategies • Our implementation used naïve arbitrary linking. • Consider 2 other heuristics: • Linking by weight: Take the tree with more nodes as parent. • Linking by rank: Take the tree with the higher height as parent. • Both strategies can be easily implemented in Union in O(1) time, and both yield trees of height O(log n).
Path Compression • Ideally, we’d have all nodes children of the root. • Changing a tree to look that way will cost O(n). • But we can use the time spent on Find(x) to do some work that will bring us closer to our vision: While climbing the path from x to the root, make all the nodes in the way children of the root.
Implementation • Find(x) if (parent[x] = null) return x else root = Find(parent[x]) parent[x] = root return root
Example • After Find(x): x x
Complexity? • Union(x, y) hasn’t change – O(1). • Find(x) is still, per 1 operation, on W.C. • O(n) for arbitrary linking. • O(log n) for linking by rank/weight. • But what about a series of operations? • Each Find operation may reduce the running-time of subsequent Find’s. • What is the amortized cost per operation?
A Little Bit of History • What is the amortized cost per operation employing linking by rank/weight and path compression? • 1972: Fischer showed O(log log n). • 1973: Hopcroft & Ullman showed O(log*n). • 1975: Tarjan showed O(α(m,n)). • 1979: Tarjan showed also Ω(α(m,n)).
Definitions • We consider only paths p from x to an ancestor of x, y. • p is called a rootpath if y is a root. • p is called a nonrootpath if y is not a root. In this case we define a(p) as the parent of y. • We also consider empty paths involving no nodes, and we classify them as rootpaths.
Definitions – cont’d • Compressing a rootpathp is making all the nodes of p roots. • Compressing a nonrootpathp is making all the nodes of p children of a(p). • We define Cost(p) to be the number of nodes that get a new parent in the operation of compressing p. • If p is nonrootpath with d nodes then Cost(p) = d-1. • If p is a rootpath then Cost(p) = 0.
Example • Result of compressing the path from x to y: x y y x y y x x
More Definitions • Let C = (p1, p2, …, pk) be a sequence of path compressions on an initial forest F. • We define Cost(C) as the total number of times that a node of F gets a new parent. • Cost(C) = Cost(p1) + Cost(p2) + … + Cost(pk) • Also define |C| as the number of nonrootpaths in C.
The Reduction Lemma • Let S be a sequence of Union and mFind operations on an initial partitioning of an n-element set X into singletons. • Let T be the time necessary to execute S. • There is a forest F on X and a sequence C of m path compressions, all nonrootpaths or empty paths, such that T = O(m + n + Cost(C)).
The Reduction Lemma • Proof: • For F take the forest generated by executing just the Union operations. • The sequence of Find operations then defines a sequence C of nonrootpaths (or empty paths) in F.
The Reduction Lemma • Proof – cont’d: • A Find operation costs O(1 + length(p)). • For nonrootpaths and empty paths we have Cost(p) = length(p). • Thus each Find operation costs O(1 + Cost(p)). • Summing all of the mFind’s we get O(m + Cost(C)). • Since the size of the input is n, and at most n-1Union’s can occur, we deduce that T = O(m + n + Cost(C)).
Now What? • We proved T = O(m + n + Cost(C)). • Thus if we prove an upper-bound on Cost(C), then we immediately get an upper-bound on T. • So now we have to try and bound Cost(C). • We’ll use a “divide and conqueror” technique: • Divide the problem into 2 independent sub-problems. • Solve each sub-problem (can be done recursively). • Combine both results for a result to the main problem.
Dissections • As before, F is a rooted forest on a node set X. • A dissection of F is a partition (Xb,Xt) of X such that Xt is upwards closed in F, i.e. If x is in Xt, then every ancestor of x is also in Xt. • Intuitively, you can think of a dissection as a horizontal line that cuts the forest into a top and a bottom part.
Examples • Is this a dissection? • Note:Xb nodes are red, Xt nodes are green. Yes No Yes
Simple Facts • Some simple facts about dissections: • A dissection (Xb,Xt) cuts every path p into 2 contiguous paths: pb in F(Xb) and pt in F(Xt). • Dissections are preserved under path compression. • Let C = (p(1), p(2), …, p(k)) be a compression sequence on F, and let F(1), F(2), …, F(k) be the sequence of resulting forests, then: • Cb = (pb(1), pb(2), …, pb(k)) is a compression sequence on F(Xb) with resulting forests F(1)(Xb), F(2)(Xb), …, F(k)(Xb). • The same holds for Ct = (pt(1), pt(2), …, pt(k)).
The Main Lemma • Let C = (p(1), p(2), …, p(k)) be a sequence of path compressions on an initial forest F with node set X. • Let (Xb,Xt) be a dissection of F. • Then the following holds: • |Cb| + |Ct| ≤ |C| • Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct|
The Main Lemma • Proof: • Recall that |C| = the number of nonrootpaths in C. • If p(i) is a nonrootpath, then either p(i)b or p(i)t (or both) must be a rootpath. • If p(i) is a rootpath, then both p(i)b and p(i)t are rootpaths. • Thus we get |Cb| + |Ct| ≤ |C|.
The Main Lemma • Proof – cont’d: • Recall that Cost(C) is the number of times a node of X gets a new parent while executing C. • We have 4 cases. The number of times that… • A node from Xt gets a new parent from Xt = Cost(Ct) • A node from Xt gets a new parent from Xb = 0 • A node from Xb gets a new parent from Xb = Cost(Cb) • A node from Xb gets a new parent from Xt = ?
The Main Lemma • Proof – cont’d: • What is the number of times that a node from Xb gets a new parent from Xt? • The number of times that node from Xb gets a parent from Xt for the first time is ≤ |Xb|. • A node from Xb can get a new parent from Xt again only when p(i)t is a nonrootpath, so the number of times this case can occur is ≤ |Ct|. • Thus Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct|
Arbitrary Linking • Let f(m,n) be the maximum cost for a sequence of m path compressions in any forest of n nodes. • Let C be a sequence of m path compressions on a forest F of n nodes. • Pick a dissection of F such that |Xb| = |Xt| = n/2. • Using the main lemma we get: Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct| ≤ f(m, n/2) + f(m, n/2) + n/2 + m = 2·f(m, n/2) + n/2 + m
Arbitrary Linking - cont’d • We got: Cost(C) ≤ 2·f(m, n/2) + n/2 + m • Since C was picked arbitrarily we get: f(m,n) ≤ 2·f(m, n/2) + n/2 + m • This recursion solves to f(m,n) = O((m+n)log n). • Thus the running-time using path compression and arbitrary linking is O((m+n)log n). • For k > 1, picking |Xt| = n/k instead of n/2 we get a bound of O((m+k·n)logkn). • For k = 1+⌈m/n⌉ we get a better O((m+n)log1+⌈m/n⌉n).
Linking by Rank • We now wish to use our main lemma to bound the running-time of path compression with linking by rank. • The key for getting a good bound is choosing a “good” dissection on which to invoke the lemma. • Whether a dissection is “good” or “bad” depends on the forest structure. • So first we need to characterize the forests that arise while using linking by rank, and investigate their properties.
Rank Forests • Define rank(x)= the height of the sub-tree rooted at x. • F is called a rank forest if for every node x in F the following property holds: • For each 0 ≤ i < rank(x), x has a child with rank i. • One can easily confirm (using induction) that: • Linking by rank yields rank forests. • In rank forests, each node of rank k is a root of a sub-tree of size at least 2k.
Rank Forest Properties • Let F be a rank forest with node set X and maximum rank r, and let 0 ≤ s < r be some integer. • Denote by X≤s the set of nodes with rank ≤ s. • Denote by X>s the set of nodes with rank > s. • Then: • (X≤s,X>s) is a dissection of F. • F(X≤s) is a rank forest with maximum rank s. • F(X>s) is a rank forest with maximum rank r-s-1. • |X>s| ≤ |X|/2s+1.
Linking by Rank • Let f(m,n,r) be the maximum cost for a sequence of path compressions of length m on a rank forest with n nodes and maximum rank r. • We start with the trivial bound f(m,n,r) < r·n (each node can get a new parent at most r-1 times). • Repetitive use of the main lemma will give us better and better bounds of f. • Note that what we’re really looking for is f(m,n,⌊logn⌋).
Iteration I • Let C be any sequence of path compressions of length m in a rank forest with n nodes and maximum rank r. • Consider the dissection (X≤s,X>s) with s = log r. • Cost(Ct) ≤ f(|Ct|,|X>s|,r-s-1) ≤ (r-s-1)·|X>s| ≤ r·|X>s| ≤ r·|X|/2s+1 ≤ r·n/2s = r·n/2log r = r·n/r = n • Now apply the main lemma on (X≤s,X>s): • Cost(C) ≤ Cost(Cb) + Cost(Ct) + |Xb| + |Ct| ≤ Cost(Cb) + n + n + (|C| - |Cb|) = Cost(Cb) - |Cb| + 2n + |C|
Iteration I – cont’d • We showed: Cost(C) ≤ Cost(Cb) - |Cb| + 2n + |C| • We can bound Cost(Cb) the same way we did for Cost(Ct), using the trivial bound: • Cost(Cb) ≤ f(m,n,s) ≤ n·s = n·log r • Using this way we get Cost(C) ≤ n·log r + 2n + m. • Hence also f(m,n,r) ≤ n·log r + 2n + m. • Using the reduction lemma, we get a bound of O(m + n + f(m,n,log n)) = O(m + n·log log n). • But a better idea is to apply the top inequality on itself.
Iteration II • Cost(C) ≤ Cost(Cb) - |Cb| + 2n + |C| ≤ (Cost(Cbb) - |Cbb| + 2n + |Cb|) - |Cb| + 2n + |C| ≤ Cost(Cbb) - |Cbb| + 4n + |C| • Cbb is a compression sequence in a rank forest with maximum rank log s = log log r. • Bounding Cost(Cbb) using our trivial bound will now yield a bound of O(m + n·log log log n). • But why stop here?
Iteration #j • Now we have Cost(C) ≤ Cost(C’) - |C’| + 2j·n + |C| • C’ is a compression sequence in a rank forest with maximum rank log(j)r. • For j = log*r we have log(j)r ≤ 1, so C’ is a compression sequence in a rank forest with maximum rank 1. • In such a forest any compression sequence has cost 0, so looks like our party is over: • Cost(C) ≤ 2n·log*r + |C|, yielding O(m + n·log*n).
There’s More… • We now know the bound f(m,n,r) ≤ 2n·log*r + m. • Using it, we can start all over again: • Apply main lemma on (X≤s,X>s) with s = log log*r. • Bound Cost(Ct) ≤ n + |Ct|. • Bound Cost(C) ≤ Cost(Cb) – 2|Cb| + 2n + 2|C|. • Apply the last inequality on itself (log log*)*(r) times, yielding Cost(C) ≤ 2n·(log log*)*(r) + 2|C|. • Now we have a new improved bound of f, and we can repeat the process over and over again.
The Shifting Lemma • Define g⟡ = (⌈log⌉∘g)*. • Let g(r) be a non-decreasing function with g(r) < r. • Suppose that f(m,n,r) ≤ k·m + 2n·g(r) for all m, n, r. • Then also f(m,n,r) ≤ (k+1)·m + 2n·g⟡(r) for all m, n, r. • Proof outline: • Use induction on r. • As we’ve done before, apply the main lemma on the dissection (X≤s,X>s) with s =⌈log g(r)⌉.
The “Ackermann Bound” • Define: • J0(r) = ⌈(r – 1)/2⌉ • Jk(r) = Jk-1⟡(r) for k > 0 • For all m, n, r: • f(m,n,r) ≤ n·(r-1) ≤ 2n·J0(r) = 0·m + 2n·J0(r) • Thus, by the shifting lemma, for all m, n, r, k: • f(m,n,r) ≤ k·m + 2n·Jk(r) • Define α(m,n) = min { k | Jk(⌊log n⌋) ≤ 1 + m/n }. • Then f(m,n,⌊log n⌋) ≤ α(m,n)·m + 2n + 2m
The “Ackermann Bound” • So by the reduction lemma we finally get the desired: • Theorem: Performing a sequence of Union operations and mFind operations on a set of size n using linking by rank and path compression requires O(n + m·α(m,n)) time.
Concluding Remarks • We reached α(m,n) in a somewhat “natural” way. • We could have just proven the shifting lemma. • But going through the iterations improved our intuition of the “top-down method”. • It also made us better understand the inverse Ackermann function, and how slow it really grows. • An analogous top-down proof for the α(m,n) bound also exists for path compression with linking by weight.