Genome Research 2004 By Peter D. Keightly & Toby Johnson

MCALIGN: Stochastic Alignment of Non-coding DNA Sequences based on an Evolutionary Model of Sequence Evolution Genome Research 2004 By Peter D. Keightly & Toby Johnson Majid Kazemian

Motivations • The genomes of higher eukaryotes contain large amounts of non-coding DNA seq. (intergenic DNA & introns) • Seq. alignment is the major issue for the evolutionary analysis of noncoding DNA • Aligning non-coding DNA is more difficult than aligning protein-coding DNA Seqs • Where indels almost occur in multiples of three bp and rarely cross codon boundaries

Which alignment is correct? • Alignments with few gaps tend to have many differences • Alignments with too many fragmented gaps tend to have too few nucleotide differences Alignment 1: TTATA - - - - CAG TTAGCTAAGCCG Alignment 2: TTA - - TA - - CAG TTAGCTAAGCCG

Why a model based alignment method? • Heuristic methods produce alignments by optimizing a parametric scoring function, whose values are chosen in a more or less arbitrary fashion • How the penalties of substitution and indels events relate to DNA sequence evolution is unclear • So, Inferences based on such alignments, such as estimates of sequence divergence or conservation are almost biased. • Therefore Explicit model based approaches are desirable

Features of MCALIGN • It assumes a model that allows an arbitrary distribution of indel lengths. • The distribution of indel lengths is derived empirically from additional data • It uses stochastic hill-climbing algorithm to search for more probable alignment • It is intended for global alignment • Simulations vs. real data for examining the statistical properties • It uses Jukes-Cantor model, the simplest model of nucleotide substitution.

Statistical framework • Let a = variable describing the alignment • t = parameter of sequence evolution over time • S = observed sequence data • Unconditional = constant • In both equation we need to compute Inference about a alone when t is a “nuisance parameter”

Probability model of sequence evolution • For two sequences there is a single parameter of sequence evolution t=(t12) where t12 is time of evolution between two sequences • For three sequences a second parameter is added that t vector is t=(t12,t(12)3) is Probability of alignment (indel pattern) is Probability of observed sequence given this indel pattern

Assumptions • In the common ancestor the two sequences were identical to the ancestral sequence • There were no indels in the alignment • Insertions and deletions occurred independently at a rate ө • Probability of an indel өt12 is per interbase site • The proportion of an indel of length i is wi ,such that • An alignment α is characterized by gi gaps of length i and m sites at which indels could have occurred but did not (non-indels)

Probability of a given alignement α with gi gaps of length i and m non-indels is given by • The parameters ө and wi are treated as known (they must be estimated from external data) Total # indels

- l n = - 1 1 1 n u Pr( S | a , t ) ( k ) [ ( 1 k )] ( ) 1 12 12 2 4 4 • To derive Jukes-Cantor model of nucleotide substitution was used .n = # of nucleotides differences, l = # of nucleotides not aligned to a gap, u = # of nucleotides aligned to a gap where

Jukes and Cantor, 1969 • The simplest substitution model • Markov chain with four states: a,c,g,t • Transition matrix P given by:

Jukes and Cantor, 1969 (cont.) • As a function of time n, we get Pr(x -> y) = 0.25 + 0.75 (1-4)n if x = y and Pr(x -> y) = 0.25 - 0.25 (1-4)n otherwise

Alignment Algorithm • Start from an initial alignment a1 • A heuristic method similar to “divide-and-conquer” algorithm • Generate a2 from a1 by randomly selecting one of the following transformation: • Add gap pair in random sites • Remove random gap pair or parts thereof • Move gap within sequence • Split gap within sequence • Merge a pair of adjacent gaps

Alignment Algorithm (cont.) • Accept a2 with the following probability: • Store the alignment with max. probability • Reset alignment if Pr(ai|S) < 0.01 Pr(amax|S) for more than 100 iterations • Stop after preset iterations from amaxwithout increasing amax

Three-way alignment • Approximation for three-way alignment • where maximizes Pr(a,t|s) and C(S) is a constant. • Adjust transformation appropriately

Parameter estimations • Drosophila sps. data was used to estimate q and t • Seqs. Of length ~ 6300 bases found to have 193 substitutions and 44 indels and 6328 non-indels • This gives a nucleotide difference of t= 0.0306 and therefore q = 0.225 • Proportions of indels ( wx ) for 1-bp(0.455) and 2-bp(0.182) indels adopted from data. • Indels in range 3-40 assigned w from • where b is a constant and a is estimated to be 1.167.

Results • Fraction of correctly aligned bases across all methods

Conclusion & Discussion • aligns non coding regions better than heuristic alignments • returns estimates of sequence divergence and nucleotide substitution in addition to most probable alignment. • Based on Jukes-Cantor model of nucleotide substitution • Parameters q and t derived from training data limited to that genera • Search space can have unreachable states. • Execution time increases non-linearly with seq. length and as a function of t.

Thanks for your attention

Genome Research 2004 By Peter D. Keightly & Toby Johnson