170 likes | 294 Views
15-853:Algorithms in the Real World. Computational Biology IV Phylogenetic Trees. Phylogenetics. The study of genetic connections and relationships among species. Classically was based on physical or morphological features (e.g. size, eye-color, hoof-type, …)
E N D
15-853:Algorithms in the Real World • Computational Biology IV • Phylogenetic Trees 15-853
Phylogenetics • The study of genetic connections and relationships among species. • Classically was based on physical or morphological features (e.g. size, eye-color, hoof-type, …) • Now is based on DNA and protein sequencing. • Goal is to find the “most likely” evolutionary connection among species or individuals and possibly the time at which they diverged. • E.g., Mitochondrial DNA has been used to trace humans back to a single female ancestor from Africa (“African Eve”). 15-853
Phylogenetic Trees • Phylogenetic relationships are typically represented as a tree. dog cat lynx Typically leaves are current species and internal nodes represent hypothetical evolutionary ancestors. Edge lengths can indicate evolutionary or genetic distance. Trees can be rooted or not. In this lecture we will assume rooted binary trees. 15-853
Perfect Genetic Trees • The “molecular clock theory” (Zuckerkandl and Pauling, 1962) assumes that there is an evolutionary “clock” that determines the rate of “accepted” mutations. The distance (weight) on edges then represents time on the clock. • A perfect or ultrametric tree is one in which the time from the root (a common ancestor) to all leaves (current species) is equal. 3 5 1 4 2 3 3 2 15-853
Scoring/Costing a Tree • Three main models: • Parsimony • Distance Matrix • Maximum likelihood • Can give different results, and there are different opinions on what is best, or even whether a tree is adequate at all. • For all three models, the general problem is NP-hard and in the worst-case can require enumerating all trees of size n (this is super-exponential in n). • Phylogeny Software 15-853
Parsimony • Cost = # of changes along each edge summed across all edges. • e.g. CT 1 0 AT Cost = 2 1 0 CT AG AT • Need to choose: • Topology of the tree • Alignment of the sequences • Assignment of the internal nodes • Small parsimony: The topology and alignment are given • Large parsimony: The full problem 15-853
Small Parsimony • Observation: can process each character position separately since the costs are additive • Fitch-Hartigan Algorithm: S = the character set C(v,x) = best cost for the sub-tree rooted at v assuming v is assigned the character x 2S Internalnodes: Leaves: 15-853
Dynamic programming • Go up the tree calculating C(v,x) and C(v) • Trace back down the tree assigning one of the x to each node. • Time: k = |S|, m = number of characters in each sequence O(nk) per character O(nmk) total time. 15-853
Large Parsimony • Solution 1: Branch and Bound (exact solution) Each node of the search tree adds a new leaf in all possible positions. A 3 5 B C A Algorithm works OK if initial estimate is very good and pruning works well. A D A D D C B B C B C 15-853
A B C D A C B D A D C B Large Parsimony • Solution 2: Local Search: Start with a good guess and use local search to find a local optimum Can hill-climb, or use simulated annealing. 15-853
Trees based on Evolutionary Distances • Assume a distance metric Dij between sequences that models evolutionary distance between i an j (i.e., time on an evolutionary clock) Problem: Find a phylogenetic tree with edge weights that “best” matches these distances. 15-853
The Distance • Edit Distance does not properly model evolutionary change when the distance is large and the alphabet is small. • Jukes Cantor method: For a mutation rate a and a single DNA location which gives: where f is the fraction of locations that have mutated 15-853
The Cost • D(i,j) = sum of weights on path from i to j in the phylogenic tree, e.g. 2 1 2 1 C A B Cavalli/Edwards cost metric: Fitch/Margoliash cost metric: Need to determine both the tree and the edge weights. 15-853
Finding the optimal • In general the problem is NP-hard. • If there is a solution with zero cost, the matrix defines an “additive metric space”. • In this case there is an O(n2) algorithm for the problem. • Otherwise heuristics based on clustering are used. • e.g. UPGMA (Unweighted Pair Group Method with Arithmetic-mean) 15-853
UPGMA Clustering • Initially each sequence is its own cluster • Repeat: • Find two clusters i and j with minimum Dij • Join into new cluster, and new phylogenetic tree with Dij/2 as the weight of the two root branches, e.g., • For each other cluster k, DAB,C/2 DAB,C/2 C C B A A B 15-853
Maximum Likelihood a • Problem: find a tree, an internal labeling, and a set of edge weights (representing evolutionary time) such that:is maximized. • The probability Px ! y(txy) is the probability x will mutate to y in time txy. • PT is the likelihood (probability) of the given tree. tab taC b tbA tbB C A B 15-853
Maximum Likelihood • Finding internal labeling given times and tree is easy using dynamic programming. • In practice finding the times is not hard. • Finding the tree is hard. 15-853