240 likes | 331 Views
Species Trees & Constraint Programming. Ongoing work with Ian Gent, Barbara Smith, Wu Wei (Christine). The Tree of Life. A central goal of systematics construct the tree of life a tree that represents the relationship between all living things including constraint programmers
E N D
Ongoing work with Ian Gent, Barbara Smith, Wu Wei (Christine)
TheTree of Life • A central goal of systematics • construct the tree of life • a tree that represents the relationship between all living things • including constraint programmers • The leaf nodes of the tree are species • The interior nodes are hypothesized species • extinct, where species diverged
Properties of a Species Tree • We have a set of leaf nodes, each labelled with a species • the interior nodes have no labels • each interior node has 2 children and one parent • except the root (it has no parent) • if we have n leaf nodes we then have n 1 interior nodes • it is a bifurcating tree
Super Trees • We are given two trees, T1 and T2 • T1 has leaf set S1 and S2 has leaf set • remember, leaves are species! • But S1 and S2 have a non-empty intersection • why? How can that happen? • We want to combine T1 and T2 • so, why is that a problem?
c a b Most Recent Common Ancestors (mrca) We have 3 species, a, b, and c • mrca(a,b) mrca(a,c) • mrca(a,b) mrca(b,c) • mrca(a,c) mrca(b,c) Species a and b are more closely related to each other than they are to c The most recent common ancestor of a and b is further from the root than the most recent common ancestor of a and c (and b and c)
c d a b b c Triples (and Fans) Species trees are frequently presented as a set of triples (and fans)
c d d a b b c c a b Triples (and Fans)
BreakUp & OneTree (circa 1996) Algorithm breakUp takes a species tree and produces a set of rooted triples R that define that tree. Algorithm OneTree takes a set of species and a set of rooted triples, and builds a tree that respects those triples, or reports that no tree exists (in polytime) OneTree is a specialisation of Build, an algorithm proposed by Aho, Sagiv, Szymanski, and Ulman in 1981
The Flavour of OneTree • Given a set of species S and rooted triples R • produce a node N • construct a graph G • with vertices in S • and edge (x,y) if triple xy|z is in R • if G is a single component fail • else recursively build • on the left with one component • with S’ and R’ (the set of species and triples in that component) • on the right, with the other components
a b d c d a b c The Flavour of OneTree
Min-cut Super Trees • What happens if OneTree fails? • Gives us the best you can • by breaking some triples (resulting in fans) • by excluding some species • There are polytime algorithms for this • but they are greedy and biased
Constraint Programming solutions to building a species tree from a set of rooted triples
A naïve constraint encoding (footnotes 756, 789, 794, 796) • n-1 variables as interior nodes • v[i] = j parent(v[i]) = v[j] • no loops/cycles • Barbara used set variables (ILOG) • Patrick used specialised constraint (Chco) • Francois then encoded set variables! • n variables as leaf nodes • each takes a value respecting triples • I am sparing you (and me) the details
Why was this a naïve constraint encoding? • It produced the right number of trees when no triples • the Catalan number • symmetry breaking • It would produce a tree if one existed • A 2 stage process • (1) build a tree from the interior nodes • there are Catalan many of these • (2) given an “interior tree” place the leaf nodes • there are n! ways to do this • if step (2) fails generate the next interior tree in (1) Yikes! That’s expensive. Imagine {ab|c,bc|d,cd|a}
Ultrametric Trees & Species Trees (footnotes 803,804,805,810,819) What is an ultrametric tree? • We are given a 2d symmetric matrix D • D[i][j] is the time of divergence of species i and j. • D[i,j] is the the mrca(i,j) labeled with time of divergence • D[i,j] is the value of mrca(i,j) • Build a bifurcating tree • n leaves and n - 1 interior nodes • interior nodes labeled with entries from D • any path from the root is a strictly decreasing sequence
Ultrametric Trees: here’s one I (well, Dan Gusfield actually ) prepared earlier 8 5 3 3 D B C A E Note: if the sequence increases, we have min-ultrametric tree
Ultrametric Matrix: necessary & sufficient conditions • cannot have more than n - 1 distinct values • because there are n - 1 interior nodes • For every 3 indices i,j,k • there is a tie for the maximum between D[i,j], D[i,k], D[j,k] Given an ultrametric matrix, an ultrametric tree can be constructed in O(n2) … see Dan Gusfield’s book “Algorithms on Strings, Trees, and Sequences”
A CP encoding of D • We have a 2 dimensional matrix of constrained integer cvariables D • We must ensure that for any i,j,k the following holds Think isosceles triangles, allowing equilateral An ultrametric space, composed of isosceles triangles
A CP encoding of D Any instantiation of the variables in D is now guaranteed to be min-ultrametric We get Catalan number of min-ultrametric solutions
k i j How can we exploit this? • We are given triples and fans, but not distances! • But we can consider a triple ij|k as a constraint This over-rides the disjunctions posted across the matrix Note: our tree is min-ultrametric!
The CP encoding (contd) • we have the “blanket” disjunctive constraint to ensure min-ultrametric • triples are constraints that break the disjunctions • a solution (if one exists) is min-ultrametric respecting triples • we can then produce tree from the matrix, as a post process • NOTE: we need a pre-process to break up trees into triples
So where are we? • Good question: • we have not yet tried real data • we have a number of different micro-encodings • Are we in P for decision? • Not sure yet • How about optimisation? • We can see a way, by introducing penalties • Wu Wei is coding up BreakUp and OneTree • so we have something real to compare with • We need real data to check this out • I need to get funding for this • write a grant proposal with DRG I think!