800 likes | 1.07k Views
Inexact Matching. Charles Yan 2008. Longest Common Subsequence. Given two strings, find a longest subsequence that they share substring vs. subsequence of a string Substring: the characters in a substring of S must occur contiguously in S
E N D
Inexact Matching Charles Yan 2008
Longest Common Subsequence • Given two strings, find a longest subsequence that they share • substring vs. subsequence of a string • Substring: the characters in a substring of S must occur contiguously in S • Subsequence: the characters can be interspersed with gaps. • Consider ababc and abdcb alignment 1 ababc. abd.cb the longest common subsequence is ab..c with length 3 alignment 2 aba.bc abdcb. the longest common subsequence is ab..b with length 3
Longest Common Subsequence Let’s give a score M an alignment in this way, M=sum s(xi,yi), where xi is the i character in the first aligned sequence yi is the i character in the second aligned sequence s(xi,yi)= 1 if xi= yi s(xi,yi)= 0 if xi≠yi or any of them is a gap The score for alignment: ababc. abd.cb M=s(a,a)+s(b,b)+s(a,d)+s(b,.)+s(c,c)+s(.,b)=3 To find the longest common subsequence between sequences S1 and S2 is to find the alignment that maximizes score M.
Longest Common Subsequence S1: a1a2a3…ai S2: b1b2b3…bj • Subproblem optimality Consider two sequences Let the optimal alignment be x1x2x3…xn-1xn y1y2y3…yn-1yn There are three possible cases for the last pair (xn,yn):
Longest Common Subsequence There are three cases for (xn,yn) pair: Mi,j = MAX {Mi-1, j-1 + S (ai,bj) (match/mismatch) Mi,j-1 + 0 (gap in sequence #1) Mi-1,j + 0 (gap in sequence #2) } Mi,j is the score for optimal alignment between strings a[1…i] (substring of a from index 1 to i) and b[1…j] S1: a1a2a3…ai S2: b1b2b3…bj x1x2x3…xn-1xn y1y2y3…yn-1yn
Longest Common Subsequence Examples: G A A T T C A G T T A (sequence #1) G G A T C G A (sequence #2) Mi,j = MAX { Mi-1, j-1 + S(ai,bj) Mi,j-1 + 0 Mi-1,j + 0 } s(ai,bj)= 1 if ai=bj s(ai,bj)= 0 if ai≠bjor any of them is a gap
Longest Common Subsequence Fill the score matrix M and trace back table B M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1 Score matrix M Trace back table B
Longest Common Subsequence Score matrix M Trace back table B M7,11=6 (lower right corner of Score matrix) This tells us that the best alignment has a score of 6 What is the best alignment?
Longest Common Subsequence We need to use trace back table to find out the best alignment, which has a score of 6 • Find the path from lower right • corner to upper left corner
Longest Common Subsequence (2) At the same time, write down the alignment backward S1 :Take one character from each sequence :Take one character from sequence S1 (columns) S2 :Take one character from sequence S2 (rows)
Longest Common Subsequence :Take one character from each sequence :Take one character from sequence S1 (columns) :Take one character from sequence S2 (rows)
Longest Common Subsequence Thus, the optimal alignment is The longest common subsequence is G.A.T.C.G..A There might be multiple longest common subsequences (LCSs) between two given sequences. These LCSs have the same number of characters (not include gaps)
Longest Common Subsequence Algorithm LCS(string A, string B) { Input strings A and B Output the longest common subsequence of A and B M: Score Matrix B: trace back table (use letter a, b, c for ) n=A.length() m=B.length() // fill in M and B for (i=0;i<m+1;i++) for (j=0;j<n+1;j++) if (i==0) || (j==0) then M(i,j)=0; else if (A[i]==B[j]) M(i,j)=max {M[i-1,j-1]+1, M[i-1,j], M[i,j-1]} {update the entry in trace table B} else M(i,j)=max {M[i-1,j-1], M[i-1,j], M[i,j-1]} {update the entry in trace table B} then use trace back table B to print out the optimal alignment …
Global Alignment Global Alignment: Find the overall similarity between two sequences Mi,j = MAX { Mi-1, j-1 + S(ai,bj); Mi,j-1 + 0; Mi-1,j + 0 } s(ai,bj)= 1 if ai=bj s(ai,bj)= 0 if ai≠bjor any of them is a gap s(ai,bj) can be replaced by the similarity score between ai , bj
Local Alignment Local Alignment: Find all pairs of substrings that have similarity scores higher than a. Global Local
Global Alignment >seq1 ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDVDAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAAVLIAIIVAVITWCILLWSHFG >seq2 MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDALSYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGVVTIIKAGYSAWKG Global alignment score: -2 10 20 30 40 50 Seq1 ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWL---DVDAITRVL : . : . . .. : . : .:. . .:..... : :. . ::: Seq2 MINPNP-KRSDEPVFWGLFGAGGMW---SAIIAPVMILLVGILLPLGLFPGDALSYERVL 10 20 30 40 50 60 70 80 90 100 Seq1 LISSVML--VMIVEILNSAIEAVVDRIGSEYHEL-----SGRAKDMGSAAVLIAI-IVAV ... .. :.. .. . . :. .:.: .:. .: ::.: .. ...: Seq2 AFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGV 60 70 80 90 100 110 110 120 Seq1 ITWCILLWSHF-G .: .: . : Se2 VTIIKAGYSAWKG 120
Local Alignment >seq1 ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDVDAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAAVLIAIIVAVITWCILLWSHFG >seq2 MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDALSYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGVVTIIKAGYSAWKG Score = 19.6 bits Seq 1: 6 GFTRIIKAAGYSWKG 20 G IIKA +WKG Seq 2: 115 GVVTIIKAGYSAWKG 129 Score = 13.1 bits Seq 1: 103 IAIIVAVIT 111 +A I+ V+T Seq 2: 104 LAAILTVVT 112 Seq 1 Seq2
Local Alignment Global Local Mi,j = MAX { Mi-1, j-1 + S(ai,bj); Mi,j-1 + 0; Mi-1,j + 0 } Mi,j = MAX { 0; Mi-1, j-1 + S(ai,bj); Mi,j-1 + 0; Mi-1,j + 0 }
Gap Penalty There are three cases for (xn,yn) pair: Mi,j = MAX {Mi-1, j-1 + S (ai,bj) (match/mismatch) Mi,j-1 + 0 (gap in sequence #1) Mi-1,j + 0 (gap in sequence #2) } Mi,j is the score for optimal alignment between strings a[1…i] (substring of a from index 1 to i) and b[1…j] S1: a1a2a3…ai S2: b1b2b3…bj x1x2x3…xn-1xn y1y2y3…yn-1yn
Gap Penalty There are three cases for (xn,yn) pair: Mi,j = MAX {Mi-1, j-1 + S (ai,bj) (match/mismatch) Mi,j-1 + G (-, bj)(gap penalty) Mi-1,j + G (ai,-) (gap penalty) } Mi,j is the score for optimal alignment between strings a[1…i] (substring of a from index 1 to i) and b[1…j] S1: a1a2a3…ai S2: b1b2b3…bj x1x2x3…xn-1xn y1y2y3…yn-1yn
Gap Penalty 1. Constant gap weight: Equal penalty for each gap, regardless of the gap length. A gap of length q has a penalty of W, so is a gap of length 1 2. Affine gap weight: a constant weight to each additional space gap opening penalty: Wg (-10) gap extension penalty Ws (-1) A gap of length q will have a penalty of Wg+(q-1)*Ws How to modify the recursion function? 3.Convex gap weight: Each additional space contributes less than the proceeding space Wg+log (q) 4. Alphabet-weighted gap penalty: The penalty also depend on letter. Seq1 NSAIEAVVDRIGSEYHEL-----SGRWVFYGLAA Seq2 VLPLWCGLHRMHHAMHDLKLHSPAGKWVFYGLAA Seq1 NSAIEAVVDRIGSEYHE--L-S--GRWVFYGLAA Seq2 VLPLWCGLHRMHHAMHDLKLHSPAGKWVFYGLAA
BLAST http://www.ncbi.nlm.nih.gov/BLAST/ • Local alignment • Database search • Efficiency is vital
Local Alignment >seq1 ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDVDAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAAVLIAIIVAVITWCILLWSHFG >seq2 MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDALSYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWVFYGLAAILTVVTLIGVVTIIKAGYSAWKG Score = 19.6 bits Seq 1: 6 GFTRIIKAAGYSWKG 20 G IIKA +WKG Seq 2: 115 GVVTIIKAGYSAWKG 129 Score = 13.1 bits Seq 1: 103 IAIIVAVIT 111 +A I+ V+T Seq 2: 104 LAAILTVVT 112 Seq 1 Seq2
BLAST http://www.ncbi.nlm.nih.gov/BLAST/ • Local alignment • Database search • Efficiency is vital
Raw Score S The raw score S for an alignment is calculated by summing the scores for each aligned position and the scores for gaps
Bit Score S' Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years. By normalizing a raw score using the formulaone attains a "bit score" S', which has a standard set of units.
Bit Score S' The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.
Significance The significance of each alignment is computed as a P value or an E value • E value: Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. • P value:The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment.
E-value • In the limit of sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the formulaWe call this the E-value for the score S. This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.
P-value • The number of random HSPs with score >= S is described by a Poisson distribution. This means that the probability of finding exactly a HSPs with score >=S is given bywhere E is the E-value of S given by equation (1) above. Specifically the chance of finding zero HSPs with score >=S is e-E, so the probability of finding at least one such HSP isThis is the P-value associated with the score S. For example, if one expects to find three HSPs with score >= S, the probability of finding at least one is 0.95. The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995.
BLAST • The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. • Break the query and database sequences into fragments ("words"), and initially seek matches between fragments. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a given substitution matrix. • Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search.
BLAST Use keyword trees for find HSPs in a subject sequence. For each HSP, extend the local alignment at both ends as long as the alignment score is higher than threshold T. If a local alignment has score higher than C, than the there is “significant” similarity between query and subject sequences. Report a “hit”.
Inexact matching • Alignment • Motif/profile searching
PROSITE A profile or weight matrix is a table of position-specific alphabet weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence.
Motifs and Matching • Motif Finding: Given a set of protein sequences, to find the motif(s) that are shared by these proteins. • Motif Scanning Given a motif and a protein sequence, to find the occurrences (not necessary identical) of the motif on the protein sequences.
Motifs How significant is a motif? How different is it from a uniform distribution? Does it represent a biologically significant region? Information Content (IC) Probability of observing character j at position i
Motifs Usually, the background is not a uniform distribution. Thus, it is more useful to take into account the background distribution KL divergence score Probability of observing character j at position i Probability that character j occurs at position i based on background distribution. The higher the score, the more unlikely to obtain the motif by chance , i.e., the more significant is the motif.
Motifs How to find potential occurrences of a motif in a give string? Motif/profile • Counts: the entry in (j,i)th cell is the times character j occurs at ith column: Ni,j • Frequency (likelihood) : • Log-likelihood: • Log-odds:
Motifs For a given motif of length K, to scan an input sequence to find the occurrences of the motif. Assume that each entry in the motif is Pi(j) ATCGCTAGCTAAGTAGTGGGCTAAGCTAAGCTAAGTGTGTAGCGTA X=x1x2x3x…xk is the substring within the sliding window
Motifs Independent identical distribution (iid) Markov chain model (1st order or higher order)
MotifScanner http://homes.esat.kuleuven.be/~thijs/Work/MotifScanner.html
JASPAR Motifs JASPAR MA0112 1 1 7 2 0 0 0 6 1 2 3 1 1 5 0 0 2 3 5 5 1 0 0 0 7 0 7 5 2 1 0 1 8 9 4 4 1 1 1 7 9 0 2 2 1 1 4 1 8 3 0 0 0 1 2 2 0 0 0 9 0 1 0 1 0 6 0 0 1 0 3 1
Suffix Trees for Inexact Matching In a rooted tree T, a node u is anancestorof a node v is u is on the unique path from the root to v. A proper ancestor of v is an ancestor that is not v. In a rooted tree T, the lowest common ancestor (lca) of two nodes x and y is the deepest node in T that is an ancestor of both x and y. lca retrieval problem: Given two nodes, x and y, of a rooted tree, to find their lca. Let n be the number of nodes in a rooted tree T, then after a O(n)-time preprocessing, lca retrieval problem can be solved in constant time!!! (Independent of n)