Sequence Alignment

Sequence Alignment - III Chitta Baral

Scoring Model • When comparing sequences • Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection • Basic mutational processes • Substitutions; • insertions; deletions (together referred to as gaps) • Total Score • sum for each aligned pair + terms for each gap • Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated. • Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms • Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms • Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently • Reasonable for DNA and protein sequences • Inaccurate for structural RNAs

Substitution Matrices • Notation: pair of sequence x[1..n] and y[1..m] • Let xi be the ith symbol in x • And yj be the jth symbol in y • Let pxiyi – probability that xi and yi are related • Let qxi – probbaility that we have xi by chance • Frequency of occurrence of xi • Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)] • P(x and y supposing they are related) = px1y1 px2y2 … • P(x and y supposing they are unrelated) = qx1q x2 … X qy1qy2 … • Odds ratio: (px1y1/qx1qy1) X (px2y2/qx2qy2) X … • Log-odds ratio: s(x1,y1) + s(x2, y2) + … • Where s(a,b) = log (pab/qaqb) • The s(a,b) table is known as the score matrix or substitution matrix

Gap Penalties • Also based on a probabilistic model of alignment • Less widely recognized than the probabilistic basis of substitution matrices • Gap of length g due to insertion of a1…ag • p(gap because of mutation) = f(g) (qa1…qag) • p(having a1…ag by chance) = qa1…qag • Ratio = f(g) • Log of ratio = log (f(g)) • Geometric distribution: f(g) = ke-xg • Suppose f(g) = e-gd ; then log of ratio = -gd ## linear score • Suppose f(g) = ke-ge ; then log of ratio = -ge + log k = -ge + e + (log k - e) = - (e - log k) – (g – 1) e = - d – (g-1) e where d = e – log k ## affine score

Repeated matches • A big string x[1..n] and smaller string y[1..m] • Asymmetric: looking for multiple matches of y in x. • As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y. • F(i,0): Assuming xi is in an unmatched region, what is the best total score so far. • F(i,j), j >= 1: Assuming xi is in a matched region and the last matching ends at xi and yj, the best total score so far. • F(0,0) = 0. • F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d } • F(i,0) corresponds to start over option (but now we store the total score so far) • F(i,0) = maximum of • F(i-1,0) • F(i-1, j) – T j = 1, …, m • T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)

Illustration of repeated matches

Next • Alignment with affine gap scores. • Heuristic based approach.

Sequence Alignment - III