Inferring phylogenetic trees: Distance methods

Inferring phylogenetic trees:Distance methods Prof. William Stafford Noble Department of Genome SciencesDepartment of Computer Science and Engineering University of Washington thabangh@gmail.com

One-minute responses • Thank you for this lecture. It was very interesting. • I think I’m starting to program like a pro. • I wish to hear more on how we can understand better the evolutionary relationships among species, preferably among distinct human populations. • I think I enjoyed today’s lecture. More especially the class problems! • 70% of the course has been understood by me. • Tell us more about interpretations. • Python part was easy to follow today. • Python part was very easy to follow. I did not have any problem for the first time. • The lecture was well understood. • The Python part was not so easy for me, but OK. • I appreciate the revision every day, it is very helpful. • Can we learn how to have better output from Python (form / appearance)? • Can we work at this stage on real human genetic data?

Outline • Parsimony • Distance methods • Computing distances • Finding the tree • Maximum likelihood

Revision • What is the input to a phylogenetic inference problem? • A multiple alignment of DNA or protein sequences. • What is the output? • A binary tree showing the inferred evolutionary relationships. • For what types of phylogenetic inference problems is maximum parsimony the right approach? • Small numbers of input sequences. • Closely related sequences. • What are the two computational problems that must be solved in a maximum parsimony approach? • Enumerating all possible tree topologies. • Evaluating the parsimony score for a given topology.

Revision • Evaluate the parsimony score of the given tree with respect to the first column of the given alignment. Skud Sbay R R Scer Svin R Scer RTGH Skud RTGV Sbay RVGV Smik SVGH Spom STIL Svin RLGH R R R R Score = 1 S S S Smik Spom

Revision • Repeat, but use the second column of the alignment. Skud Sbay T V Scer Smik V Scer RTGH Skud RTGV Sbay RVGV Smik SVGH Spom STIL Svin RLGH T X V T T Score = 2 T X L T Svin Spom

Selecting a method Choose set of related sequences Obtain multiple sequence alignment Is there strong sequence similarity? Yes Maximum parsimony methods No Is there clearly recognizable sequence similarity Yes Distance methods No Maximum likelihood methods

Distance methods Multiple sequence alignment Pairwise distance matrix Phylo- genetic tree

Calculating distance ACTGAACGTAACGC Y X Species 2: AATGAAAGAATCGC Species 1: ACTGTAGGAATCGC The distance between species 1 and 2 is the sum of X and Y. Species 1: ACTGTAGGAATCGC Species 2: AATGAAAGAATCGC

True evolutionary history Ancestral Species 1 Species 2 A CTGA  C  TA C  GGT  AAA  C  TCGC A C  ATGAAC  AGT  AAA  TCGC  T  C A CTGAACGTAACGC Single substitution Multiple substitutions Coincidental substitutions Parallel substitutions Convergent substitution Back substitution

Jukes-Cantor model • Assume the same probability of change at all positions and all times. • dAB is the proportion of changed sites in the alignment. • KAB is the expected number of changes per position. Derivation at http://en.wikipedia.org/wiki/Models_of_DNA_evolution

Jukes-Cantor model Species 1 Species 2 3 observed changes in 20 sites A CTGA  C  TA C  GGT  AAA  C  TCGC A C  ATGAAC  AGT  AAA  TCGC  T  C

Computing JK distances Proportion of changed sites Species 1: ACGTGATCGGTGA Species 2: ACTTGATGCCTAG Species 3: A-TTACGTAATGG Species 4: A-TTGATGGCGTA Pairwise distances

Computing JK distances Proportion of changes sites Species 1: ACGTGATCGGTGA Species 2: ACTTGATGCCTAG Species 3: A-TTACGTAATGG Species 4: A-TTGATGGCGTA From this matrix, we calculate the tree. Pairwise distances

Other models • Jukes-Cantor • The simplest possible model • Kimura • 2 parameters • Differentiates between transitions and transversions. • F84, HKY • 5 parameters • Allows arbitrary base frequencies. • Tamura-Nei • 6 parameters • Combination of F84 and HKY. • General time-reversible model • 12 parameters • Only assumes Pr(x→y) = Pr(y→x)

Distance methods • Fitch-Margoliash • Neighbor-joining • UPGMA Multiple sequence alignment Pairwise distance matrix Phylo- genetic tree

UPGMA • Unweighted pair group method with arithmetic mean. • Also known as agglomerative hierarchical clustering. • Basic idea: iteratively connect the two most closely related sequences.

UPGMA

UPGMA • Find the smallest off-diagonal element in the matrix.

UPGMA • Compute the average between the two rows and columns.

UPGMA

UPGMA • Each merger creates a subtree. Smik Sbay

Perform the next merger Smik Sbay

Smik Sbay

Smik Sbay Skud Scer

What is next? Skud Scer Smik Sbay

Formatting with % • Insert % between a string and a tuple to get formatted output. • Use %s for strings, %d for integers, and %f or %g for floats. • Use %f for a fixed number of decimal places, %e for exponent, %g for either. • %g rounds to specified number of digits of precision • %g uses either fixed or exponential notation, depending on the value • Use leading numbers to specify width. • Replace with * to provide width as an input. Full details at http://docs.python.org/2/library/string.html

Problem #1 • Write a program that reads sequences from a given file and prints, in aligned columns, the sequence ID, length and frequency of each letter. You may assume that each sequence is no more than 100,000 characters. • Version 1: Use the alphabet ACGT and a fixed width for the sequence ID. • Version 2: Adjust the field width of the sequence ID based on the longest sequence ID. • Version 2: Use the alphabet of the given sequences. Print fields in alphabetical order. • Version 3: Add a header line to your output file. • ./compute-seq-stats.py sample-dna.txt • Read 11 sequences from sample-dna.txt. • ce1cg 77 A=0.17 C=0.12 G=0.31 T=0.40 • ara 87 A=0.34 C=0.23 G=0.18 T=0.24 • bglr1 61 A=0.41 C=0.13 G=0.07 T=0.39 • crp 105 A=0.35 C=0.20 G=0.22 T=0.23 • cya 72 A=0.24 C=0.19 G=0.21 T=0.36 • deop2 102 A=0.29 C=0.11 G=0.25 T=0.34 • gale 73 A=0.30 C=0.23 G=0.12 T=0.34 • ilv 105 A=0.22 C=0.26 G=0.17 T=0.35 • lac 86 A=0.22 C=0.22 G=0.22 T=0.34 • male 54 A=0.31 C=0.24 G=0.28 T=0.17 • malk 65 A=0.26 C=0.15 G=0.37 T=0.22

> ./compute-seq-stats-4.py ribosomal.txt Read 13 sequences from ribosomal.txt. Longest sequence ID = 32. 20 letters in alphabet. Alphabet=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']. Sequence Len A C D E F G H I K L M N P Q R S T V W Y gi|457875803|ref|XP_004224433.1| 108 0.111 0.009 0.009 0.065 0.009 0.028 0.019 0.074 0.194 0.083 0.019 0.046 0.028 0.046 0.028 0.093 0.037 0.065 0.009 0.028 gi|351065825|emb|CCD61804.1 117 0.077 0.009 0.043 0.051 0.009 0.085 0.026 0.026 0.205 0.077 0.017 0.017 0.051 0.026 0.034 0.051 0.043 0.111 0.009 0.034 gi|459660330|gb|EMH75739.1 137 0.146 0.015 0.044 0.044 0.007 0.051 0.015 0.044 0.234 0.066 0.022 0.022 0.066 0.007 0.015 0.058 0.051 0.073 0.015 0.007 gi|449802221|pdb|3ZEY|U 113 0.097 0.018 0.035 0.035 0.018 0.071 0.009 0.044 0.186 0.080 0.044 0.027 0.044 0.035 0.062 0.062 0.053 0.053 0.009 0.018 gi|198419437|ref|XP_002130703.1 112 0.062 0.000 0.027 0.045 0.009 0.071 0.009 0.062 0.179 0.098 0.009 0.036 0.045 0.062 0.054 0.080 0.054 0.054 0.009 0.036 gi|17542024|ref|NP_500895.1 117 0.077 0.009 0.043 0.051 0.009 0.085 0.026 0.026 0.205 0.077 0.017 0.017 0.051 0.026 0.034 0.051 0.043 0.111 0.009 0.034 gi|187129228|ref|NP_001119663.1 116 0.034 0.009 0.043 0.052 0.009 0.078 0.017 0.034 0.216 0.095 0.009 0.017 0.043 0.069 0.043 0.078 0.043 0.078 0.009 0.026 gi|359807542|ref|NP_001241406.1 108 0.102 0.000 0.037 0.028 0.009 0.056 0.009 0.056 0.167 0.074 0.028 0.037 0.065 0.056 0.065 0.102 0.046 0.028 0.009 0.028 gi|351725913|ref|NP_001236341.1 108 0.093 0.000 0.037 0.028 0.009 0.065 0.009 0.056 0.167 0.074 0.037 0.037 0.065 0.046 0.065 0.102 0.046 0.028 0.009 0.028 gi|52346074|ref|NP_001005084.1 125 0.088 0.008 0.072 0.040 0.008 0.072 0.008 0.032 0.216 0.096 0.008 0.048 0.048 0.016 0.048 0.056 0.040 0.064 0.008 0.024 gi|41387126|ref|NP_957109.1 124 0.089 0.000 0.065 0.048 0.008 0.065 0.008 0.032 0.218 0.097 0.008 0.040 0.048 0.024 0.048 0.056 0.040 0.065 0.008 0.032 gi|6323365|ref|NP_013437.1 108 0.139 0.000 0.037 0.046 0.000 0.046 0.028 0.074 0.167 0.083 0.019 0.000 0.037 0.046 0.065 0.093 0.019 0.056 0.009 0.037 gi|6321464|ref|NP_011541.1 108 0.130 0.000 0.037 0.046 0.000 0.046 0.028 0.074 0.167 0.083 0.019 0.000 0.037 0.046 0.065 0.093 0.028 0.056 0.009 0.037

Inferring phylogenetic trees: Distance methods

Inferring phylogenetic trees: Distance methods

Presentation Transcript

Phylogenetic Trees

Inferring phylogenetic trees: Maximum likelihood methods

Phylogenetic Trees

Inferring phylogenetic trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic trees

Phylogenetic trees

Phylogenetic Trees

Phylogenetic trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Phylogenetic trees

Phylogenetic Trees

Phylogenetic trees

Inferring Trees from Trees Consensus and Supertree Methods

Inferring phylogenetic trees: Distance and maximum likelihood methods

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Phylogenetic Trees