Life Science 20001396 IIS Lab 이청재

Peptide Sequence Pairwise Alignment using Hidden Markov Model - Requirement Analysis Life Science 20001396 IIS Lab 이청재

Pairwise Alignment Seq X : HEAGAWGHEEE How much similar are two sequences? Seq Y : PAWHEAE Identity Scoring Chemical Similarity Scoring Observed Substitution Peptide Sequence Alignment How to Know? Pairwise Alignment • The peptide sequences are comprised of 20 amino acids, which have diverged from a common ancestor by a process of mutation and selection. • Substitution : changing residues (symbols) in a sequence  Scoring Schemes • Insertion, Deletion : adding or removing residues  Gap penalty Key Issues Scoring Schemes or Weight Matrices Techniques of Alignment • Substitution matrix • Global alignment VS Local alignment • Gap Penalty : Opening VS Extended • What kinds of algorithms : Dynamic Programming, HMM, …

Scoring Scheme BLOSUM100 Substitution Matrix • We need score terms for each aligned residue pair. : Score s(a,b) elements are can be arranged in a 20 X 20 matrix. • How to make this matrix : Deriving scores for pairwise alignment algorithms from probabilities. • How to estimate the probabilities BLOSUM : They were derived from a set of aligned, ungapped regions from protein families called the BLOCKS database.  from known dataset symmetric S(xi,yj) : Score of substitution between xi and yj. xi is the i-th amino acid in a sequence x. yj is the j-th amino acid in a sequence y. eg) HEAGAWGHE--E ----P—AW--HEAE S(A,P) = -2

Techniques of Alignment : Linear Score g : a gap of length d : a gap-open penalty : Affine Score Generally, d > e e : a gap-extension penalty F(i-1, j-1) + s(xi,yj) F(i-1, j-1) F(i, j-1) F(i-1, j) - d F(i,j) = max F(i, j-1) - d -d s(xi,yj) F(i-1, j) F(i, j) -d Techniques of Alignment • Global alignment VS Local alignment Global Alignment Local Alignment Seq X : HEAGAWGHEEE HEAGAWGHE--E ----P—AW--HEAE AWGHE AW--HE Seq Y : PAWHEAE • Gap Penalty : Opening VS Extended • Alignment Algorithms : Needleman-Wunsch, Smith-Waterman, Hidden Markov Model, … Global Alignment : Needleman-Wunsch Algorithm  Dynamic Programming • Initialize • F(0,0) = 0 • At i=0, F (0,j) = -dj • At j=0, F (i,0) = -di Traceback : store the pointer of a previous state

Dynamic Programming Seq A : HEAGAWGHEEE H E A G A W G H E E E 0 -8 -16 -32 P -8 -2 Seq B : PAWHEAE A -16 F(1,1) = F(0,0) + S(P,H) W -32 H F(i - 1,j - 1) + S(P,H) = 0 - 2 F(i, j) = MAX F(i - 1,j) - d = -8 - 8 E F(i,j - 1) - d = -8 - 8 A E Pointer for traceback

Dynamic Programming with More Complex Model Mi-1,j - d Iyi,j = max Iyi,j-1 - e -e Mi-1,j-1 – S(xi,yj) Gap-extension penalty Mi,j = max Ixi-1,j-1 - S(xi,yj) Ix (+1,+0) Iyi-1,j-1 - S(xi,yj) S(xi,yj) Gap-open penalty -d M (+1,+1) S(xi,yj) -d S(xi,yj) Iy (+0,+1) Mi,j-1 - d -e Ixi,j = max Ixi-1,j - e FSA Alignment with Affine Gap Penalty Symbol was inserted into a sequence x: there is a gap in the position of a sequence y

Hidden Markov Model • Decoding • Observed sequence  Decode the sequence of the underlying states Viterbi Algorithm (similar to dynamic programming) The Most Probable State Path π* : the sequence of states from symbol sequence with unknown states. Vk(i) : the probability of the most probable path ending in state k with observation i. Markov Chain “A probabilistic model of sequences of events where the probability of an event occurring depends upon the fact that a preceding event occurred” Hidden Markov Model • “The real sequence of states is invisible” •  We can see only visible symbols, that is sequences, or measurements • More complicated than simple Markov chain • (Hidden) states • Transition probability • (Visible) symbols • Emission probability : The probability that certain symbol is emitted from a state.

Pairwise Alignment using HMM ε Pair HMMs δ Ix Qxi δ γ 1 - 2δ - γ 1 – ε - γ M Pxiyj γ Begin 1 - 2δ - γ End γ 1 – ε - γ δ Iy Qyj δ ε γ Probabilistic Model Further Study

References 1. Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison. Biological Sequence Analysis : Probabilistic models of proteins and nucleic acids (2000). Cambridge University Press. • Available Lecture Note : http://bi.snu.ac.kr/Courses/bio02/bio02_2.html 2. Website For Bioinformatics : http://bioinfo.sarang.net/wiki/BioinfoWiki 3. Programs for biosequence analysis : http://www.dina.dk/~sestoft/bsa.html 4. Dynamic Programming : http://no-smok.net/nsmk/DynamicProgramming

Life Science 20001396 IIS Lab 이청재