160 likes | 278 Views
Phylogenetics. Motivation Concepts Algorithms. Motivation. Life arose just once - “Thou art my brethren but a few sequences removed”
E N D
Phylogenetics • Motivation • Concepts • Algorithms Lecture 13 CS566
Motivation • Life arose just once - “Thou art my brethren but a few sequences removed” • Phylogenetic trees = Topologies of evolutionary relationships between sequences, and, possibly, species - “Is it man or cow that is the true heir of the fabulous treasures of the woolly mammoth dynasty?” • Phylogenetic tree as guide tree for multiple sequence alignment (déjà vu) Lecture 13 CS566
Concepts • Mutation and Evolution • Mutations that persist over generations = Evolution • Tree, not a lattice • Each species arose just once • Speciesphylogeny (often) != Sequence phylogeny • Sequences evolve at different rates • Within a single species • Between different species • Within a single sequence • Especially in bacteria, horizontal transfer (“Napster’s been around for ages”) quite common Lecture 13 CS566
Concepts • Molecular clock assumption • Sequences drift apart at a constant rate • Aka edge length proportional to time • Aka satisfaction of ultrametricity • For any 3 sequences (all pair-wise distances are equal) xor (2 distances are equal, and the third one smaller) • If true, then • All path lengths from root to leaf nodes are equal • Additivity • Distance metric chosen is • True distance (fulfils triangular inequality) • Such that cumulative sum of edge lengths along path between 2 sequences equals the distance between 2 sequences Lecture 13 CS566
Concepts • Heuristic forays into intractable space • Start with pairwise “distances” • Path length = Distance (~Evolutionary time) • Work from leaves to node to generate tree • (opposite of binary tree generation) • “Its easier to be rootless than to be rooted” • Binary tree approximation of higher order trees • Edges do not imply direct links (Missing links/incomplete data), only a representation of sequence evolution Lecture 13 CS566
Algorithms • Parsimony (Character-based) • Distance based methods • Neighbor joining • UPGMA • Maximum Likelihood IIncreasing Sequence Similarity Lecture 13 CS566
Algorithms • UPGMA (Unweighted pair group method with arithmetic averages) • Caveat if molecular clock not applicable: “If my cousin looks more like me than my brother, he must be my lost brother, and perhaps my brother my cousin?” • Neighbor joining • “Give me additive distances, and I shall give thee a tree, even if some sequences morph faster than others” • Parsimony • “Its just a bruise, not Kaposi’s sarcoma!” • Maximum Likelihood • “Given the facts, Watson, the answer is elementary!” Lecture 13 CS566
UPGMA • Easiest to use if molecular clock and additivity are valid • No. of clusters = no. of sequences = no. of leaf nodes • Inter cluster distance = Average pairwise distance {While (no. of clusters > 1) • Connect pair of closest clusters (at distance d) with intermediate node at distance d/2 from each of them} • Caveat: Satisfies minimal distance requirement, but may result in spurious topologies – because of constant rate evolution assumption Lecture 13 CS566
Parsimony • Parsimony (“Miserliness in model space”): Pick the simplest explanation that fits the facts - “If I hear a blood-curdling scream, it’s just one of my sons trying to kill the other – not an invasion by aliens!” • Every possible tree evaluated in terms of total number of steps needed to convert each sequence to another • Practical for only a few sequences • High percentage of similarity a prerequisite • Neither identical or ‘completely different’ sequence positions useful • Each difference should represent a single step (WYSIWYG) and not a ‘full circle’ or ‘non-shortest route’ Lecture 13 CS566
Parsimony 123456789………… • ACCEFAHIKLKNPR • ACCEFGHILLLNPR • ACDEFGHIKLINPK • AADEFGHILLNNPK * * * 1 C 3 D D C 2 C 4 D Candidate tree for position 3 Lecture 13 CS566
Parsimony • 3 sets of 3 trees each compared • The one with lowest total number of substitutions selected • Refinements: • Branch and bound: • Abandon a tree if subtree has a higher score than current minimal score tree • Heuristic branch-pattern representatives • Non-boolean costs: Tranversion > transition OR use of amino-acid substitution matrices Lecture 13 CS566
Neighbor Joining • Generates unrooted tree, allowing for unequal branches • Given: Distance matrix for sequences • Steps: Repeat 1-3 till all branches generated • Take closest sequences i, j • Find branch lengths between i and j by treating remaining sequences as composite (c) • Calculate average i-C and j-C distances • Calculate branch lengths i and j • Treat ij as composite sequence now and generate new distance table. • Generate multiple trees by starting with different pairs • Compare resulting trees in terms of best fit to original distance matrix Lecture 13 CS566
Rooting trees • Based on a “proxy ancestor” • Include a distant relative (“outgroup”) as the proxy ancestor • Add the outgroup as the last node • Point of attachment of outgroup represents root • Diameter center • Place root at center of longest path through tree Lecture 13 CS566
Summary • Parsimony and ML based approaches computationally intensive – scalability poor • Neighbor joining adequate if additivity assumption is valid • UPGMA adequate if both molecular clock and additivity assumptions are valid for given set of sequences Lecture 13 CS566
Summary • Phylogenetics useful to understand sequence evolution • Phylogenetics makes sense for • sequences with a high percentage of sequence identity • sequences not subject to ‘selection’ • Sequence tree not the same as species tree Lecture 13 CS566