Recursion and Divide-and-Conquer Algorithms

Recursion and Divide-and-Conquer Algorithms CS 331, Fall 2013 Tandy Warnow

Algorithm Design: Recursion and Divide-and-Conquer • Recursion and Divide-and-Conquer are very similar (mainly differing in the first step) • Sorting algorithms: • Bubble-sort • Merge-sort

Bubble-Sort • Input: list of n integers, A1, A2, …, An • Output: sorted version, so that smallest entry is in the first position, and largest entry is in the last position. • Algorithm: • Make a left-to-right sweep, swapping adjacent elements if they are out of order. If nothing is out of order, return the list. • After the first sweep, the largest element is in the last position – so subsequent sweeps are only applied to the first n-1 elements. • In other words, after the first sweep, you recurse on a smaller list.

Analysis of running time • Let T(n) denote the time to run bubble-sort on a list of n elements. Count comparisons and input/output operations as all being unit cost. • T(1) = C (for some constant C) • T(n) = T(n-1) + C’n (for some constant C’) • Solving: T(n) <= CC’n2

Merge Sort Same input, same output • Step 1: Split list into two roughly equal parts • Step 2: Recursively merge each part • Step 3: Merge the two lists together (taking smallest entry off the top of the two lists, until both lists are empty)

Analysis of running time • Let T(n) denote the time to run MergeSort • Assume n=2k • T(1) = C • T(n) = 2T(n/2) + C’n (for some constant C’), n>1 • Solving for T(n) we obtain • T(n) = CC’nk • i.e. T(n) is O(n log n)

Observations • The main difference between these two algorithms is how free you are to pick the division into subproblems– both are recursive! • Note that MergeSort could have been used on a division into 3 sets instead of 2, or into log n sets instead of 2, etc. – and the algorithm would still work (but with a different running time). • However, BubbleSort, by design, recurses after you successfully move the largest element to the end of the array – not before.

Homework • Analyze the running time of MergeSort when each division is into three sets of roughly equal size • Analyze the running time of MergeSort when the division is into sqrt(n) sets of size sqrt(n)

Computing rooted trees from rooted subtrees • Input: set X of k rooted three-leaf trees (on set S of n leaves) • Output: tree T on the entire set S that agrees with the set X – if it exists. Example: {((a,b),c) , ((b,c),d), ((d,e),f)), ((e,f),g)} Solution: ((g,(f,(e,(d,(c,(a,b)))))))

Computing rooted trees from rooted subtrees • Input: set X of rooted three-leaf trees (on leafset S) • Output: tree T on the entire set S that agrees with the set X – if it exists. Example: {((a,b),c) , ((b,c),d), ((d,e),f)), ((e,f),g) (a,(e,d))} Solution: No tree exists

Suppose X is compatible? • If X is compatible, then there is a tree T that agrees with X. Suppose there are j children of the root of T, and they have leaf sets A1, A2, …, Aj below them. • Consider the 3-leaf trees in X. They fall into three possible types: • Leaves from three different sets: (a,(b,c)) • Leaves from two different sets: (a,(b,b’)) or (a,(a’,b)) • Leaves from same set: (a,(a’,a’’))

Suppose X is compatible? • There are j children of the root of T, and they have leaf sets A1, A2, …, Aj below them. • The 3-leaf trees in X: • Leaves from three different sets (a,(b,c)) impossible • Leaves from two different sets • (a,(b,b’)) • (a,(a’,b)) – impossible • Leaves from same set (a,(a’,a’’))

Suppose X is compatible? • There are j children of the root of T, and they have leaf sets A1, A2, …, Aj below them. • The 3-leaf trees in X: • Leaves from three different sets (a,(b,c)) impossible • Leaves from two different sets • (a,(b,b’)) - possible • (a,(a’,b)) – impossible • Leaves from same set (a,(a’,a’’)) - possible

Bipartitions that are compatible with X • Definition: The bipartition of S into A and B is contradicted by X if some tree in X has form (a,(a’,b)) • Lemma 1: If set X is compatible, then there is a bipartition of S that is not contradicted by X.

Proof of Lemma 1 • Lemma 1: If the set X is compatible, then there exists a bipartition of S into non-empty sets A and B, so that no triplet (u,(v,w)) in X has v in A and w in B. • Proof: by contradiction. • Suppose the set X is compatible and T is a rooted tree that agrees with every tree in X. Let u1, u2, …, uk be the children of the root of T, and let A1, A2, …, Ak be the leafsets below u1, u2,…, uk. • Let A=A1and let B be the union of the remaining sets. • Now suppose there is a triplet (u,(a,b)) in X with a in A and b in B. Then this triplet does not agree with the tree T, contradicting our hypothesis.

Bipartitions that are compatible with X Definitions: • The bipartition of S into A and B is contradicted by X if some tree has form (a,(a’,b)) • XA is the set of trees in X with all leaves in A • XB is the set of trees in X with all leaves in B Lemma 2: If set X is compatible and bipartition A|B is not contradicted by X, then the sets XA and XB are compatible.

Bipartitions that are compatible with X Definitions: • The bipartition of S into A and B is contradicted by X if some tree has form (a,(a’,b)) • XA is the set of trees in X with all leaves in A • XB is the set of trees in X with all leaves in B Lemma 3: If bipartition A|B is not contradicted by X and the sets XA and XB are both compatible, then set X is compatible.

Recursive algorithm • Given set X of k rooted three-leaf trees on leafset S, find a bipartition (if one exists) that is not contradicted by X. • If no such bipartition exists, then return “Fail”. • Else let A|B be the bipartition that is not contradicted by X. • Recurse on A, constructing rooted binary tree TA • Recurse on B, constructing rooted binary tree TB • If either fails, then return “Fail”. Else, return rooted binary tree T with trees TA and TB as children.

How to find the bipartition? • Given set X of k 3-leaf rooted trees, how can we find a bipartition A|B that is not contradicted by any tree in X? (Or determine that no such bipartition exists?) • Consider a 3-leaf tree, (u,(v,w)). Note that v and w must be in the same subtree off the root of any tree T that is compatible with X. Hence v and w must be in the same piece of the bipartition (both in A or both in B).

How to find the bipartition? • Algorithm: • Make a graph G=(V,E) with one vertex for every element of S, and an edge (s,s’) for every 3-leaf tree (u,(s,s’)) in X. • If the graph is connected, then there is no tree T that is compatible with X (because there is no bipartition that is not contradicted by X). • If the graph is not connected, then let A be one component of the graph, and S-A the other component.

Running time • Divide step: • Constructing graph: O(k) • Determining if the graph is connected: O(k+n) • After division, the subsets partition the taxa and also the triplets. The worst case time is where one subset has a single leaf, and the other subset has n-1 leaves. • Combining the trees together takes O(1) time. • Hence the recurrence relation for the running time is:T(n) <= O(k) + T(n-1). Solving for T(n) we obtain T(n) is O(kn).

Triplet tree compatibility • Developed by Aho, Sagiv, Szymanski and Ullman in 1981 for a problem related to relational databases. • Has also been used for other problems • Most commonly used now in phylogenetic inference!

Quartet Tree Compatibility • Input: Set Y of unrooted quartet trees, each with leaves from set S. • Output: Tree T (if it exists) on leafset S that agrees with every tree in Y. • NP-complete!!!

Quartet Tree Compatibility • Special cases that can be solved in polynomial time: • All trees have leaf x: • Root all the trees at x, and apply algorithm for testing compatibility of rooted three-leaf trees. Check if result is compatible with remaining three-leaf trees. • O(kn) • Set Y has a tree on every four leaf subset of S

Quartet Tree Compatibility • Input: set Y containing a tree on every four leaves in S • Output: tree T that agrees with all the trees in Y, if it exists; otherwise “Fail” • This can be solved in O(n5) time using the Aho, Sagiv, Szymanski and Ullman algorithm. However, there is another way, too.

Naïve Quartet Method • Find a sibling pair a,b (that is a pair that is not contradicted by any quartet tree in Y) • Remove a, and recurse on quartet trees that do not contain leaf a. • If no tree is found, return “Fail”. • Otherwise, let T be the tree that is returned, and add leaf a to T, by making it a sibling of b.

Naïve Quartet Method • How to find siblings? • What is the running time?

Finding siblings • Let a and b be elements of S. • Then a and b cannot be siblings if any quartet tree splits them: ab|cd. • If no quartet tree splits them, then you can consider them siblings!

Running time • Finding a sibling pair takes O(k) = O(n4) time • Hence the algorithm takes O(n5) time, too.

Homework • Implement one of the two methods for Quartet Tree Compatibility, assuming input has a tree on every quartet. • Make sure that the code returns a tree that agrees with all the input quartet trees, if they are compatible.

Checking compatibility with a tree T • Input: set Y of quartet trees and a tree T • Output: YES if all trees in Y are compatible with T, and NO otherwise. • Obviously polynomial – O(kn) time will suffice (k=|Y| and n is the number of leaves in T) • Can we do this faster?

Approach • Preprocessing step so that each subsequent query (“Is quartet tree t compatible with T?”) can be answered in constant time. • If preprocessing takes p(n) time, then total time is O(p(n) + k) • How do we do the preprocessing?

LCA queries • Least Common Ancestor (LCA) • Input: Leaves x and y, tree T • LCA(x,y) is the node v that is a common ancestor of both x and y, and furthest from the root of T • Algorithm to compute LCA • Start at x and write down sequence of nodes on path from x to root • Do the same for y • Compare the paths – first node in common on both paths is the LCA of x and y • O(n) time

LCA queries • Input: tree T • Output: LCA matrix -- LCA[x,y] is the LCA of leaves x and y Trivial to do in O(n3) time Also easy to do this in O(n2) time.

Computing LCA matrix using DP • Start at leaves of tree T, and visit nodes of T only after visiting the children • Let L(v)={v} if v is a leaf in T • For each node v in T with children x and y • For all a in L(x) and b in L(y), set LCA(a,b):=v • Let L(v) be L(x) union L(y)

Amortized analysis • The LCA of each pair a, b of leaves is set only once, and so overall contributes only O(n2) time. • The other expense of visiting a node v is constant time per node. • Hence, the total cost is O(n2).

How to compute induced quartet tree? • Back to the problem – determining if a quartet tree t on leafset {a,b,c,d} is compatible with an unrooted tree T. • Solution: root the tree T, construct the tree on {a,b,c,d} induced by T, and compare it to t. • Can we use the LCA matrix to make this efficient? • Idea: include all nodes in T (not just leaves) for the LCA matrix.

Running time • Visiting a node v (after visiting its children x and y) costs • O(1) to compute L(v) • O(|L(x)| |L(y)|) time to set LCA(a,b) = v for all a in L(x) and b in L(y) • So, naively, O(n2) to visit node v • This would suggest an O(n3) time. • However, an amortized analysis gives a better bound.

Computing LCA matrix using DP • Start at leaves of tree T, and visit nodes of T only after visiting the children • Let L(v)={v} if v is a leaf in T • For each node v in T with children x and y • For all a in L(x) and b in L(y), set LCA(a,b):=v • Set L(v) = L(x) union L(y) union {v} • Note that now the LCA matrix has 2n-1 rows and columns, and we can find LCA of any two vertices – not just the leaves.

How to compute induced 4-leaf tree? • Do the O(n2) time preprocessing to compute the LCA matrix (for all pairs of vertices, not just leaves) • Then given {a,b,c,d} (leaves in the tree), compute all LCAs. The result produces 3 distinct LCAs, and at least one pair (a,b, without loss of generality) whose LCA is not obtained by any other pair. The unrooted tree induced on this set of 4 leaves is then ab|cd. • To find the rooted version of the induced tree takes only a little bit more analysis.

Putting this together • Given unrooted tree T and set Y of weighted quartet trees, to compute the total weight of the compatible quartet trees of Y, DO: • Root T arbitrarily (on some edge), and compute the LCA matrix (for all nodes in T’, the rooted version of T). • Initialize compat-weight = 0. • For every quartet tree t in Y, compute the induced quartet tree in T, and check if it agrees with t (as unrooted trees). If so, add weight(t) to compat-weight. • Return compat-weight. Running time: O(n2 + k), where k is the number of quartet trees.

Improvements • There are better LCA query algorithms! Less preprocessing time (linear instead of quadratic) and still constant time for the queries. These use more sophisticated techniques – and we’ll discuss them later. • Hence the running time of this algorithm could be reduced to O(n+k). • Phylogenetic estimation from quartet trees is pretty popular. Unfortunately, it’s just about never the case that the quartet trees are compatible! And finding the maximum weight subset of compatible quartets is NP-hard (even if all quartets have unit weight).

Homework • Programming: Given tree T and set Y of weighted quartet trees, compute the total weight of the quartet trees in Y that are compatible with T. • Theory: Show how to calculate the rooted subtree induced by T on {a,b,c} (for any three nodes a,b,c – not just leaves – in the tree).

Recursion and Divide-and-Conquer Algorithms