190 likes | 383 Views
Chapter 10 Phylogenetic Basics. Molecular evolution and molecular phylogenetics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is the study of the evolutionary history of organisms
E N D
Chapter 10 Phylogenetic Basics
Molecular evolution and molecular phylogenetics • Similarities and divergence between biological sequences are often represented by phylogenetic trees • Phylogenetics is the study of the evolutionary history of organisms • Based on fossil data in the Victorian era, but more recently on molecular data • Sequences in biological polymers provide a history of changes • Advantages of molecular Phylogenetics: • Molecular data more numerous than fossils • No sampling bias involved • More robust phylogenetic trees can be constructed
Major assumptions • Sequences used must be homologous • Phylogenetic divergence is assumed to be bifurcating (=forking) • Each position in the sequence evolved independently • Variability is informative enough to construct unambiguous trees
Terminology clade monophyletic taxon node branch polytomy dichotomy lineage root node
A C • Unrooted tree • No knowledge of common ancestor • Relative relationships • No evolutionary direction • To root unrooted tree: • Use outgroup (distant relation; e.g.. bird for mammal tree) • Midpoint rooting (midpoint of two most divergent groups) D B unrooted C D B A rooted
Gene phylogeny versus species phylogeny • Objective of constructing molecular phylogenetic trees is to reconstruct the evolutionary history and relation ships between species or organisms • The rate at which a gene evolves may not mirror that of a species • Genes may arrive by horizontal transfer • An internal node in a molecular phylogenetic tree represents a gen duplication, whereas in a species phylogenetic tree, it represents a speciation event • To get accurate phylogenetics of species from molecular data require phylogenetic analysis of several gene or protein families
Forms of tree representation E A B C D B C D E A Non-scaled Cladogram E C C E A B D B D A Scaled Phylogram
Newick format C B C D E E A A B D (((B,C),A),(D,E)) (((B:1,C:2),A:2),(D:1.2,E:2.4))
Finding a tree may be difficult Number of possible tree topologies is a function of the number of taxa Rooted trees: NR = (2n-3)!/2n-2(n-2)! Unrooted trees: NU = (2n-5)!/2n-3(n-3)!
Procedure to construct a tree • Choosing molecular markers • Performing multiple sequence alignment • Choose model of evolution • Determining a tree-building method • Assessing tree reliability
Choice of molecular markers • DNA retains smaller changes (only 4 nucleotides) • To study closely related organisms, use DNA • For human population studies, use non-coding mitochondrial sequences • More widely divergent groups, rRNA or protein sequences • Comparing bacteria with eukaryotes, use conserved protein sequences • Proteins more conserved to due degeneracy of codons • Different evolutionary rates between nucleotides in codons • DNA sequences biased because of codon preferences • Two random DAN sequences will have 50% identity if gaps are allowed • Random protein sequences only 10% identity • Gaps in protein coding sequences are biologically meaningless • Protein-based phylogeny preferable to nucleotide-based phylogeny • DNA provides data on synonymous and non-synonymous substitution that provides information on positive and negative selection
Alignment • Correct alignment crucial otherwise there will be errors in trees • Use modern package such as T-coffee • Manual verification and editing essential • Secondary structure can serve as guide in alignment (Praline) • Non-homologous regions may have to be removed (subjective) • Remove Indels • Gaps regions may belong to signature indels and contain phylogenetic information
Multiple substitutions The number of differences between two aligned sequence is an indication of their evolutionary distance … or does at? What about A->T->G->C? G->C->G? Such multiple substitutions and convergences obscure true evolutionary distances Known as homoplasy Need statistical models to correct for homoplasy
Jukes-Cantor Model Assumes all substitutions occur with same probability dAB = -(3/4)ln[1-(4/3)AB] dAB is evolutionary distance AB observed sequences difference Two 10 nucleotide sequences that differ at three nucleotides: AB = 0.3 dAB = -(3/4)ln[1-(4/3)0.3] = 0.38 Mostly for closely related sequences
Kimura Model dAB = -(1/2)ln(1-2 ti-tv)-(1/4)ln(1-2 tv) dAB evolutionary distance between two aligned sequences A and B ti observed frequency for transition tv observed frequency for transversion If 30% difference is due to 20% transitions and 10% transversion: dAB = -(1/2)ln(1-2.0.2-0.1)-(1/4)ln(1-2.0.1) = 0.4 For protein sequences can use a PAM substitution matrix that includes evolutionary information Kimura model for proteins: d = -ln(1-p-0.2p2) where p is observed pairwise distance
Among site variation In DNA mutation rate differs by codon position In proteins there are functional constraints Proportion of positions have invariant rates and others variable rates The distribution of variable sites follow a distribution -corrected Jukes-Cantor: dAB = (3/4)[(1-4/3AB)-1/ -1] -corrected Kimura: dAB = (/2)[(1-2ti-tv)-1/ -(1/2)(1-2tv)-1/ -1/2]