CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics

Review of last lecture • How to map gaps more accurately? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d • Gaps usually occur in bunches • During evolution, chunks of DNA may be lost or inserted entirely • Aligning genomic sequences vs. cDNAs: cDNAs are spliced versions of the genomic seqs

Model gaps more accurately • Previous model: • Gap of length n incurs penalty nd • General: • Convex function • E.g. (n) = c * sqrt (n) F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) • Running Time: O((M+N)MN) (cubic) • Space: O(NM)  n  n

Compromise: affine gaps (n) (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 • We want to find the optimal alignment with affine gap penalty in • O(MN) time • O(MN) or better O(M+N) memory

Dynamic programming • Consider three sub-problems when aligning x1..xi and y1..yj • F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj • Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap • Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap xi xi xi yj yj yj Iy(i, j) Ix(i, j) F(i, j)

Input Output (-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e Start state (xi,yj) / 

(-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e start state (xi,yj) /  F-F-Iy-F-Ix F-F-F-F F-Iy-F-F-Ix AAC ||| ACT AAC- | | A-CT AAC- || -ACT AAC ACT Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: a state path to read the two sequences such that the total output score is the highest

(-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j-1) + (xi, yj) F(i, j) = max Ix(i-1, j-1) + (xi, yj) Iy(i-1, j-1) + (xi, yj) xi yj

(-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xi yj Ix(i, j)

(-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xi yj Iy(i, j)

F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) + d Ix(i, j) = max Ix(i, j – 1) + e F(i – 1, j) + d Iy(i, j) = max Iy(i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F: aligned on both Iy: Insertion on y y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix: Insertion on x

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1,j) G C A C F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C x GCAC || | GC-C y F Iy y = G C C y = G C C x = x = G C A C G C A C Ix

Today: statistics of alignment Where does (xi, yj)come from? Are two aligned sequences actually related?

Probabilistic model of alignments • We’ll first focus on protein alignments without gaps • Given an alignment, we can consider two possible models • R: the sequences are related by evolution • U: the sequences are unrelated • How can we distinguish these two models? • How is this view related to amino-acid substitution matrix?

Model for unrelated sequences • Assume each position of the alignment is independently sampled from some distribution of amino acids • ps: probability of amino acid s in the sequences • Probability of seeing an amino acid s aligned to an amino acid t by chance is • Pr(s, t | U) = ps * pt • Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is i

Model for related sequences • Assume each pair of aligned amino acids evolved from a common ancestor • Let qst be the probability that amino acid s in one sequence is related to t in another sequence • The probability of an alignment of x and y is give by

Probabilistic model of Alignments • How can we decide which model (U or R) is more likely? • One principled way is to consider the relative likelihood of the two models (the odd ratios) • A higher ratio means that R is more likely than U

Log odds ratio • Taking logarithm, we get • Recall that the score of an alignment is given by

Therefore, if we define • We are actually defining the alignment score as the log odds ratio between the two models R and U

How to get the probabilities? • ps can be counted from the available protein sequences • But how do we get qst? (the probability that s and t have a common ancestor) • Counted from trusted alignments of related sequences

Protein Substitution Matrices • Two popular sets of matrices for protein sequences • PAM matrices [Dayhoff et al, 1978] • Better for aligning closely related sequences • BLOSUM matrices [Henikoff & Henikoff, 1992] • For both closely or remotely related sequences

BLOSUM-N matrices • Constructed from a database called BLOCKS • Contain many closely related sequences • Conserved amino acids may be over-counted • N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity • identity: % of matched columns • Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

: Scaling factor to convert score to integer. Important: when you are told that a scoring matrix is in half-bits =>  = ½ ln2 Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights

45 62 90 Weak homology Strong homology BLOSUM-N matrices • If you want to detect homologous genes with high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75 • On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 • BLOSUM-62: good for most purposes

For DNAs • No database of trusted alignments to start with • Specify the percentage identity you would like to detect • You can then get the substitution matrix by some calculation

For example • Suppose pA = pC = pT = pG = 0.25 • We want 88% identity • qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 • (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 • (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for s ≠ t.

Substitution matrix

Scale won’t change the alignment • Multiply by 4 and then round off to get integers

Arbitrary substitution matrix • Say you have a substitution matrix provided by someone • It’s important to know what you are actually looking for when you use the matrix

NCBI-BLAST WU-BLAST • What’s the difference? • Which one should I use for my sequences?

We had • Scale it, so that • Reorganize:

Since all probabilities must sum to 1, • We have • Suppose again ps = 0.25 for any s • We know (s, t) from the substitution matrix • We can solve the equation for λ • Plug λ into to get qst

NCBI-BLAST WU-BLAST  = 1.33 qst = 0.24 for s = t, and 0.004 for s ≠ t Translate: 95% identity  = 0.19 qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity

Details for solving  Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t. Since and s,t qst = 1, we have 12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1 Let e = x, we have ¾ x-2 + ¼ x = 1. Hence, x3 – 4x2 + 3 = 0; • X has three solutions: 3.8, 1, -0.8 • Only the first leads to a positive  •  = ln (3.8) = 1.33

Today: statistics of alignment Where does (xi, yj)come from? Are two aligned sequences actually related?

Statistics of Alignment Scores • Q: How do we assess whether an alignment provides good evidence for homology (i.e., the two sequences are evolutionarily related)? • Is a score 82 good? What about 180? • A: determine how likely it is that such an alignment score would result from chance

P-value of alignment • p-value • The probability that the alignment score can be obtained from aligning random sequences • Small p-value means the score is unlikely to happen by chance • The most common thresholds are 0.01 and 0.05 • Also depend on purpose of comparison and cost of misclaim

Statistics of global seq alignment • Theory only applies to local alignment • For global alignment, your best bet is to do Monte-Carlo simulation • What’s the chance you can get a score as high as the real alignment by aligning two random sequences? • Procedure • Given sequence X, Y • Compute a global alignment (score = S) • Randomly shuffle sequence X (or Y) N times, obtain X1, X2, …, XN • Align each Xi with Y, (score = Ri) • P-value: the fraction of Ri >= S

Human HEXA Fly HEXO1 Score = -74

-74 Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences There are 88 random sequences with alignment score >= -74. So: p-value = 88 / 200 = 0.44 => alignment is not significant

Mouse HEXA Human HEXA Score = 732 ……………………………………………………

CS 5263 Bioinformatics