Phylogenetic Estimation using Maximum Likelihood

By: Jimin Zhu Xin Gong Sravanti polsani Rama sharma Shlomit Klopman Phylogenetic Estimation using Maximum Likelihood

The Scope of the Presentation Introduction Maximum Likelihood and Coin Tossing The Phylogenetic Tree Maximum Likelihood and DNA Substitution Advantages and Disadvantages Maximum Likelihood

Introduction Phylogeny: the study of relationships between life forms Phylogenetics is part of the field of taxonomy and systematics Phylogenetics received a huge push forward thanks to modern computers Various phylogenetic methods are used to explain the evolutionary process, and often give contradicting results!

Introduction (cont.) Scientists agree that a correct species linage should be determined using statistics Maximum Likelihood is the method of choice for establishing the most realistic phylogenetic tree of a given data The Maximum Likelihood method was introduced in 1922 by R.A. Fisher an English statistician

Maximum Likelihood in a Nutshell The method depends on: Complete data set Probabilistic model that describes the data Explicitly expressing the likelihood function The likelihood of a data set is the probability of obtaining it, given the chosen probability distribution model We seek the values of the parameters that maximize the sample likelihood

Maximum Likelihood approach using Coin Tossing Experiment Find the parameter value(s) that make the observed data most likely. Basically, choose the value of parameter that maximizes the probability of observing the data. Probability: Knowing parameters  Prediction of outcome Likelihood: Observation of data  Estimation of parameters Parameters describe the characteristics of a population. Their values are estimated from samples collected from that population.

Simple Coin Tossing Experiment Binomial probability distribution The probability of observing h heads out of n tosses can be described as: Pr[h|p, n] = n! ph(1-p)n-h h!(n-h)! Where p is probability of Heads (1-p) is probability of Tails.

Simple Coin Tossing Experiment Suppose I told you we tossed a coin 10 times and got 4 heads and 6 tails, then the probability would be P(4Heads, 6Tails) = 10! p4(1-p)6 4!*6! The whole notion of maximum likelihood estimation is that we choose p to be the one that makes the probability of getting our set of observations the largest possible: i.e. maximize P4 (1-P)6 . So our likelihood function would be: like = p4(1-p)6

Two ways to find MLE Take the first derivative of the likelihood function with respect to each parameter, set the resulting equations equal to 0, and solve for the parameter estimates. Applying log on both sides Log(L(p)) = n Log(p) + (n-h) Log(1-p) Take first derivative w.r.t p (n / p) – (n-h) / (1-p) = 0 Solving for p, We get p = h / n This value maximizes the likelihood function and is the MLE.

Find the maximum using Numeric search procedures 2. Plug in different values for p into the probability model and calculate likelihood. Lets take sample n = 100, h = 56. Imagine that p was 0.5. Plugging this value into our probability model as follows:- L(p = 0.5 | data ) = 100 ! 0.556 0.544 = 0.0389 56! 44! But what if p was 0.52 instead? L(p = 0.52 | data) = 100 ! 0.52560.4844 = 0.0581 56! 44!

So from this we can conclude that p is more likely to be 0.52 than 0.5. We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p: p L p L ------ ------- ------- -------- 0.48 0.0222 0.50 0.03889 0.52 0.0581 0.54 0.0739 0.56 0.0801 0.58 0.0738 0.60 0.0576 0.62 0.0378

Maximum likelihood estimate for p seems to be exactly at 0.56.

MLE: Sample Graphs (using Mathematica)

Simple Coin Tossing Experiment The best estimate for p from any one sample is clearly going to be the proportion of heads observed in that sample. A very simple example like this is over rated for evaluating p using MLE approach. But not all problems are this simple! The more complex the model and the greater the number of parameters, it often becomes very difficult to make even reasonable guesses at the MLEs.

Phylogenetic Tree A phylogenetic tree is a data structure, characterized by: • topology (form) • its branch lengths Stores information regarding the relationship of several species or sequences.

leaf branch leaf branch root Types of Phylogenetic Trees Rooted tree: assumed ancestral state "d" is theroot species. Unrooted tree... no implicit "directionality", but is a measureof similarity between species. a b b c c d a d

(1) A G G C U C C A A (2) A G G U U C G A A (3) A G C C C A G A A (4) A U U U C G G A A Molecular phylogenetic methods use a given set of aligned sequences to construct a phylogenetic Tree sequence 1 sequence 2 sequence 3 sequence 4 There are several ways to construct phylogenetic trees. The Maximum Likelihood method will pick out the tree that most represents the true tree.

j (1) A G G C T C C A A….A (2) A G G T T C G A A.…A (3) A G C C C A G A A....A (4) A T T T C G G A A....C The Maximum Likelihood Approach 1. Assumes that all sequences at each site are considered independent. ….N 1 2

The Maximum Likelihood Approach(cont.) • The log-likelihood is computed fora given topologyby usinga particular probability model. G Binomial; Multinomial; Poisson….. C A C a) x y G C A C G C A C b) L ( j ) = Prob + …+ Prob G A G A N c) ln L= ln L(1) + ln L(2) ..+ ln L(j)+… + ln L(N) = SUM ln L(i) i=1

3. After procedure is done forall possible topologies, the topology that showsthe highest likelihoodis chosen as thetrue (realistic) tree. The Maximum Likelihood Approach (cont.) How many topologies do we have to go through forn sequences? #Rooted trees = #Unrooted trees =

The Maximum Likelihood Approach(cont.)

The Maximum Likelihood Approach (cont.) • result is consistent. • but time consuming!

The DNA molecule (polymer) is made of monomer units called nucleotides Each nucleotide consists of: 5 carbon sugar a phosphate group a nitrogen base DNA – THE BASIS OF MOLECULAR PHYLOGENETICS

There are two groups of nitrogen bases: Purines Pyrimidines

There are 4 different types of nucleotides in DNA, differing only in the nitrogen base. The four nitrogen base nucleotides are given one letter abbreviation (the first letter of their name) “A”denine “G”uanine “C”ytosine “T”hymine

Purines, is the larger molecule of the two groups • Adenine and Guanine belong to the purines group

Pyrimidines, the smaller molecule of the two groups • Cytosine and Thymine belong to the Pyrimidines group

The DNA backbone is a polymer with alternating sugar-phosphate sequence

Adenine forms 2 hydrogen bonds with thymine on the opposite strand • This is a fixed pairing

Guanine forms a triple hydrogen bond with Cytosine • This is also a fixed pairing

Changes in DNA sequences occur through mutations • There are two kind of mutations between nucleotides: • Transition • transversion

Transition Transversion • A mutation between any two nucleotides belonging to different groups Purines  Pyrimidines • T  A • C  G • A mutation between two nucleotides from the same nitrogen base group • Purine transition G  A • Pyrimidine transition C  T

Two basic elements of DNA substitution :Composition r:The process

: Composition: The composition is just the proportion of four nucleotides.  = [ 0.1, 0.4, 0.2, 0.3], the sum of  = 1 r: The process: can be described by a matrix of numbers, describing how the nucleotides change from one to another

DNA substitution can be described by time-homogeneousPoisson process

DNA substitution model A C G T A . Cr2 G r4 Tr6 C A r1 . G r8 Tr10 G A r3 C r7 . T r12 T A r5 C r9 Gr11 .

The Likelihood of two DNA equences JC69 model assumed : [¼, ¼, ¼, ¼] : the rate of change, where  is equal for all nucleotides n1: the number of sites remain same n2: the number of sites change t: the distance form node A to B.

Sequence A CCGGCCGCGCG Sequence B CGGGCCGGCCG Length = 11; n1 = 8; n2 = 3; = 0.007; Similarity between A and B is n1/(n1 + n2) = 73% From following plot we find the ML is 1.4E-14 where distance is 17

High similarity vs. low similarity Higher similarity, shorter distance

Long sequences vs. short sequences Longer sequences input produce sharper curve

Big  vs. small  Longer distance with slow rate of change

Multi DNA sequences as input PAUP* is designed for reconstruction of phylogenetic tree based on nucleic acid alignments. is Available at http://www.sinauser.com

Example output from PAUP*

DNA Substitution Models All models are special cases of the general model The unknown parameters are: Nucleotide frequency Rate of change (mutation) Simplest model: equal mutation rates and equal nucleotide frequencies Other models assume unequal nucleotide frequencies and/or different mutation rates

Likelihood & Phylogenetics Maximum Likelihood method helps us: Determine the most probable tree of a set of DNA sequences Determine the best DNA substitution model to describe our data

Advantages of the Maximum Likelihood Method The method can be used in a wide range of estimation problems, and produce consistent results When the data set is large the parameter results have a very small variance and come very close to the true value This allows us to draw conclusions about the evolutionary process

Disadvantages of the Maximum Likelihood Method The Likelihood equations need to be worked out for a given distribution, and they are usually very complicated Fortunately Maximum Likelihood software is becoming common Maximum Likelihood estimates can be very biased for small samples

Phylogenetic Estimation using Maximum Likelihood