290 likes | 398 Views
Modified Mincut Supertrees. Roderic Page University of Glasgow. Tree of Life. About 1.7 million species described. What we have so far: TreeBASE database (15,000 taxa) Ribosomal Database Project (RDP II) (20,000 sequences) The Tree of Life Project (11,000 taxa).
E N D
Modified Mincut Supertrees Roderic Page University of Glasgow
Tree of Life About 1.7 million species described. What we have so far: • TreeBASE database (15,000 taxa) • Ribosomal Database Project (RDP II) (20,000 sequences) • The Tree of Life Project (11,000 taxa)
Recent interest in the Tree of Life NSF sponsored “Tree of Life” workshops (2000-2001) $US 10 million “to construct a phylogeny for the 1.7 million described species of Life” announced February 15th 2002 Assembling the Tree of Life: Science, Relevance, and Challenges AMNH, New York, May 2002 European initiative (ATOL) under FP6
Problem: how to build the tree of life Solutions: • Find one or more “magic markers” that will allow us to recover the whole tree in one go (problems: combinability and complexity) • Assemble big tree from many smaller trees derived from many kinds of data (supertrees)
Tree terminology d a b c leaf { a,b } edge internal node cluster { a,b,c } root { a,b,c,d }
Nestings and triplets d a b c Nestings {a,b} <T {a,b,c,d} {b,c} <T {a,b,c,d} Triplets (bc)d bc|d
Supertree d a b c a b c b c d = + T T 1 2 supertree
Some desirable properties of a supertree method(Steel et al., 2000) • The supertree can be computed in polynomial time • A grouping in one or more trees that is not contradicted by any other tree occurs in the supertree
1 2 3 MRP (Matrix Representation Parsimony) Homo sapiens 1 1 1 Pan paniscus 1 1 1 Gorilla gorilla 1 1 0 Pongo pygmaeus 1 0 0 Hylobates 0 0 0 3 2 1 • NP-hard • Can generate many solutions
Aho et al.’s algorithm (OneTree) Aho, A. V., Sagiv, Y., Syzmanski, T. G., and Ullman, J. D. 1981. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10: 405-421. Input: set of rooted trees 1. If set is compatible (i.e., will agree on a tree), output that tree. 2. If set is not compatible, stop!
a b a a b b a, b c a, b, c, d a, b, c d d c c a b c b c d Aho et al.’s OneTree algorithm T T 1 2 supertree
Mincut supertrees Semple, C., and Steel, M. 2000. A supertree method for rooted trees. Discrete Appl. Math. 105: 147-158. • Modifies OneTree by cutting graph • Requires rooted trees (no analogue of OneTree for unrooted trees) • Recursive • Polynomial time
b a c e d a b c d e a b c d T T 1 2 S { T , T } 1 2 Semple and Steel (2000)
Collapsing the graph(Semple and Steel mincut algorithm) This edge has maximum weight b a,b 2 1 1 c a c 1 1 1 e d e d 1 1 max S S / E { T , T } { T , T } { T , T } 1 2 1 2 1 2
Cut the graph to get supertree a,b a b c d e 1 c 1 e d 1 max S / E { T , T } { T , T } 1 2 1 2 supertree
My mincut supertree implementationdarwin.zoology.gla.ac.uk/~rpage/supertree • Written in C++ • Uses GTL (Graph Template Library) to handle graphs (formerly a free alternative to LEDA) • Finds all mincuts of a graph faster than Semple and Steel’s algorithm
A counter example: two input trees... a c b b a c y 1 x 1 y 2 x 2 y 3 x y 3 4
Mincut gives this (strange) result • Disputed relationships among a, b, and c are resolved • x1, x2, and x3 collapsed into polytomy c x 1 x 2 x 3 b a y 1 y 2 y 3 y 4
S { T , T } 1 2 Problem:Cuts depend on connectivity(in this example it is a function of tree size) y4 x3 y1 x2 b y2 x1 y3 c a
So, mincut doesn’t work • But, Semple and Steel said it did • My program seems to work • Argh!!! What is happening….?
What mincut does… …and does not do • Mincut supertree is guaranteed to include any nesting which occurs in all input trees • Makes no claims about nestings which occur in only some of the trees • “Does exactly what it says on the tin™”
Modifying mincut supertree • Can we incorporate more of the information in the input trees? • Three categories of information • Unanimous (all trees have that grouping) • Contradicted (trees explicitly disagree) • Uncontradicted (some trees have information that no other tree disagrees with)
Uncontradicted informationassume we have k input trees a and b co-occur in a tree a and b nested in a tree n c a b a b c - n = 0 uncontradicted (if c = k then unanimous) c - n > 0 contradicted
Uncontradicted informationassume we have k input trees a and b in a fan a and b co-occur in a tree a and b nested in a tree f n c a b a b a b c - n -f = 0 uncontradicted (if c = k then unanimous) c - n - f > 0 contradicted
Classifying edges S { T , T } 1 2 y x 1 1 y y 1 2 x x y 2 1 2 y y x 3 4 2 x 3 b y b 4 y x 3 3 c a a c Uncontradicted Uncontradicted but adjacent to contradicted Contradicted
Modified mincut • Species a, b, and c form a polytomy • x1, x2, and x3 resolved as per the input tree modified mincut a b c x 1 x 2 x 3 y 1 y 2 y 3 y 4
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 If no tree contradicts an item of information, is that information always in the supertree? (23)5 (12)5 (45)1 (34)1
1 2 3 4 5 No!Steel, Dress, & Böcker 2000 • The four trees display (12)5, (23)5, (34)1, and (45)1 • No tree displays (IK)J or (JK)I for any (IJ)K above • Triplets are uncontradicted, but cannot form a tree
Future directions • Improve handling of uncontradicted information • Add support for constraints • Visualising very big trees • Better integration into phylogeny databases (www.treebase.org) darwin.zoology.gla.ac.uk/~rpage/supertree