290 likes | 476 Views
Building Phylogenies. Distance-Based Methods. Methods. Distance-based Parsimony Maximum likelihood. a 0 b 6 0 c 7 3 0 d 14 10 9 0 a b c d. 0. 1. 2. 3. 4. 5. 6. 7. 8. Distance Matrices. a. b. c. d. Distance matrix is additive if there is a tree that fits it exactly.
E N D
Building Phylogenies Distance-Based Methods
Methods • Distance-based • Parsimony • Maximum likelihood
a 0 b 6 0 c 7 3 0 d 14 10 9 0 a b c d 0 1 2 3 4 5 6 7 8 Distance Matrices a b c d Distance matrix is additive if there is a tree that fits it exactly
a 0 b 2 0 c 6 6 0 d 10 10 10 0 a b c d 0 1 2 3 4 5 Ultrametric Matrices a b c d Additive + molecular clock assumption
Methods • Fitch - Margoliash • UPGMA • Neighbor-joining • Many others
Least squares trees • Minimize over all trees • Choice of weights wij : • Uniform:wij 1 • Fitch-Margoliash:wij 1/Dij2 • Others . . .
Clustering Methods • E.g., UPGMA and Neighbor-Joining • A cluster is a set of taxa • Interspecies distances translate into intercluster distances • Clusters are repeatedly merged • “Closest” clusters merged first • Distances are recomputed after merging
UPGMA • Unweighted pair group method using arithmetic averages • The distance between clusters Ci and Cj is • After merging Ci and Cj to create cluster Ck define distance from k to every other cluster r as
UPGMA: Initialization • Assign each sequence i to its own cluster Ci • Define one leaf (tip) of tree for each sequence and place it at height 0
UPGMA: Iteration Repeat until only two clusters remain: • Choose the two clusters i and j with smallest Dij • Create a new cluster k, where Ck = CiCj • Compute Dkr for all r. • Define a new node k with children i and j, and place it at height Dij /2. • Add k to the current clusters and delete i and j Letiandjbe the remaining clusters. Place root at heightDij /2
A pitfall of UPGMA • The algorithm produces an ultrametric tree: the distance from the root to any leaf is the same • UPGMA assumes a constant molecular clock: all species accumulate mutations (evolve) at the same rate.
Neighbor Joining • Saitou and Nei, Molecular Biology and Evolution4 (1987) • Idea: Find a pair of leaves that are close to each other but far from other leaves • Implicitly finds a pair of neighboring leaves • Advantages: • Works well for additive and other nonadditive matrices • Does not have the molecular clock assumption
Long branches must be handled carefully! 0.1 0.1 0.1 0.4 0.4 and are closer to each other than to or . Obvious approach produces incorrect clusters!
Compensating for long edges Introduce “correction terms” Average dist. to other taxa “Corrected” distances: Distances are reduced for pairs that are far away from all other species: They may be close to each other.
Neighbor-joining Repeat the following until only two leaves remain: • Choose i, j such that Dij ui uj is minimum • Define a new leaf k whose distances to i and j are • Compute the distance from k to every other leaf r • Delete i and j Connect the 2 remaining leaves by a branch of lengthDij
Computing distance matrices • Based on sequence alignment • Various possibilities: • Distance = average number of differences • Try different PAM matrices; distance = index of matrix that gives highest score • Feng and Doolitle: Based on alignment scores – roughly ratio to max possible score (see text) • Read, e.g., PHYLIP documentation:http://evolution.genetics.washington.edu/phylip/general.html
Distance correction • The amount of evolutionary change is not linearly related to time • Over a long period of time, a series of substitutions may bring us back to where we started • Percentage difference may underestimate evolutionary time