1 / 60

CSE182-L4: Keyword matching

Learn about sequence alignment basics, optimal score computation, Blast database search, expected matches, and the efficiency of Blast compared to Smith-Waterman.

rodl
Download Presentation

CSE182-L4: Keyword matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L4: Keyword matching CSE 182

  2. Backward scoring • Defin Sb[i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m] • Q: What is the score of the best alignment of the two strings s and t? • HW: Write the recurrences for Sb CSE 182

  3. Forward/Backward computations j m 1 • F[j] = Score of the best scoring alignment of s[1..n/2] and t[1..j] • F[j] = S[n/2,j] • B[j] = Score of the best scoring alignment of s[n/2+1..n] and t[j+1..m] • B[j] = Sb[n/2,j] n/2 CSE 182

  4. Forward/Backward computations j m 1 • At the optimal coordinate, j • F[j]+B[j]=S[n,m] • In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path. n/2 CSE 182

  5. Forward, Backward computation • There exists a coordinate, j • F[j]+B[j]=S[n,m] • In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path. CSE 182

  6. Linear Space Alignment • Align(1..n,1..m) • For all 1<=j <= m • Compute F[j]=S(n/2,j) • For all 1<=j <= m • Compute B[j]=Sb(n/2,j) • j* = maxj {F[j]+B[j] } • X = Align(1..n/2,1..j*) • Y = Align(n/2..n,j*..m) • Return X,j*,Y CSE 182

  7. Linear Space complexity • T(nm) = c.nm + T(nm/2) = O(nm) • Space = O(m) CSE 182

  8. Summary • We considered the basics of sequence alignment • Opt score computation • Reconstructing alignments • Local alignments • Affine gap costs • Space saving measures • Can we recreate Blast? CSE 182

  9. Blast and local alignment • Concatenate all of the database sequences to form one giant sequence. • Do local alignment computation with the query. CSE 182

  10. Database (n) Query (m) Large database search Database size n=100M, Querysize m=1000. O(nm) = 1011 computations CSE 182

  11. Why is Blast Fast? CSE 182

  12. Silly Question! • True or False: No two people in new york city have the same number of hair CSE 182

  13. Observations • Much of the database is random from the query’s perspective • Consider a random DNA string of length n. • Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 • Assume for the moment that the query is all A’s (length m). • What is the probability that an exact match to the query can be found? CSE 182

  14. Basic probability • Probability that there is a match starting at a fixed position i = 0.25m • What is the probability that some position i has a match. • Dependencies confound probability estimates. CSE 182

  15. Total money you expect to earn Basic Probability:Expectation • Q: Toss a coin: each time it comes up heads, you get a dollar • What is the money you expect to get after n tosses? • Let Xi be the amount earned in the i-th toss CSE 182

  16. Expected number of matches • Expected number of matches can still be computed. i • Let Xi=1 if there is a match starting at position i, Xi=0 otherwise • Expected number of matches = CSE 182

  17. Expected number of exact Matches is small! • Expected number of matches = n*0.25m • If n=107, m=10, • Then, expected number of matches = 9.537 • If n=107, m=11 • expected number of hits = 2.38 • n=107,m=12, • Expected number of hits = 0.5 < 1 • Bottom Line: An exact match to a substring of the query is unlikely just by chance. CSE 182

  18. Suppose we are looking for a database string with greater than 90% identity to the query (length 100) • Partition the query into size 10 substrings. At least one much match the database string exactly Observation 2 • What is the pigeonhole principle? CSE 182

  19. Why is this important? • Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. • Assume that the mismatches are randomly distributed. • What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? • Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. • The above equation does not take dependencies into account • Reality is better because the matches are not randomly distributed CSE 182

  20. Just the Facts • Consider the set of all substrings of the query string of fixed length W. • Prob. of exact match to a random database string is very low. • Prob. of exact match to a true homolog is very high. • Keyword Search (exact matches) is MUCH faster than sequence alignment CSE 182

  21. BLAST Database (n) • Consider all (m-W) query words of size W (Default = 11) • Scan the database for exact match to all such words • For all regions that hit, extend using a dynamic programming alignment. • Can be many orders of magnitude faster than SW over the entire string CSE 182

  22. Why is BLAST fast? • Assume that keyword searching does not consume any time and that alignment computation the expensive step. • Query m=1000, random Db n=107,no TP • SW = O(nm) = 1000*107 = 1010 computations • BLAST, W=11 • E(#11-mer hits)= 1000* (1/4)11 * 107=2384 • Number of computations = 2384*100*100=2.384*107 • Ratio=1010/(2.384*107)=420 • Further speed improvements are possible CSE 182

  23. Keyword Matching • How fast can we match keywords? • Hash table/Db index? What is the size of the hash table, for m=11 • Suffix trees? What is the size of the suffix trees? • Trie based search. We will do this in class. AATCA 567 CSE 182

  24. Related notes • How to choose the alignment region? • Extend greedily until the score falls below a certain threshold • What about protein sequences? • Default word size = 3, and mismatches are allowed. • Like sequences, BLAST has been evolving continuously • Banded alignment • Seed selection • Scanning for exact matches, keyword search versus database indexing CSE 182

  25. P-value computation • How significant is a score? What happens to significance when you change the score function • A simple empirical method: • Compute a distribution of scores against a random database. • Use an estimate of the area under the curve to get the probability. • OR, fit the distribution to one of the standard distributions. CSE 182

  26. Z-scores for alignment • Initial assumption was that the scores followed a normal distribution. • Z-score computation: • For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation • Look up a table to get a P-value CSE 182

  27. Blast E-value • Initial (and natural) assumption was that scores followed a Normal distribution • 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution • Practical consequence: • Longer tail. • Previously significant hits now not so significant CSE 182

  28. Exponential distribution • Random Database, Pr(1) = p • What is the expected number of hits to a sequence of k 1’s • Instead, consider a random binary Matrix. Expected # of diagonals of k 1s CSE 182

  29. As you increase k, the number decreases exponentially. • The number of diagonals of k runs can be approximated by a Poisson process • In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). • As the score increases, the number of alignments that achieve the score decreases exponentially CSE 182

  30. Blast E-value • Choose a score such that the expected score between a pair of residues < 0 • Expected number of alignments with a particular score • For small values, E-value and P-value are the same CSE 182

  31. Blast Variants Longer seeds. Seeds with don’t care values Later Database pre-processing Seeds with don’t care values • What is mega-blast? • What is discontiguous mega-blast? • Phi-Blast/Psi-Blast? • BLAT? • PatternHunter? CSE 182

  32. Silly Quiz • Name a famous Bioinformatics Researcher • Name a famous Bioinformatics Researcher who is a woman CSE 182

  33. Scoring DNA • DNA has structure. CSE 182

  34. DNA scoring matrices • So far, we considered a simple match/mismatch criterion. • The nucleotides can be grouped into Purines (A,G) and Pyrimidines. • Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions) CSE 182

  35. Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA • Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. • “One size does not fit all” • Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant • Different proteins might evolve at different rates and we need to normalize for that CSE 182

  36. PAM 1 distance • Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] CSE 182

  37. PAM1 matrix • Align many proteins that are very similar • Is this a problem? • PAM1 distance is the probability of a substitution when 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) CSE 182

  38. PAM 1 CSE 182

  39. 1 PAM 1 PAM PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 2 PAM CSE 182

  40. Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 • = PAM1*PAM249 • = PAM1250 CSE 182

  41. Note: This is not the score matrix: What happens as you keep increasing the power? CSE 182

  42. Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb) CSE 182

  43. BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. • In practice BLOSUM62 seems to work very well. CSE 182

  44. PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 CSE 182

  45. Dictionary Matching, R.E. matching, and position specific scoring CSE 182

  46. Keyword search • Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. • Question: Given a collection of strings (keywords), find all occrrences in a database string where they keyword might match. CSE 182

  47. Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O • Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. • How fast can this be done? database dictionary CSE 182

  48. Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? • Trivial algorithm O(nm) time • Pre-processing O(m), Search O(n) time. • Dictionary matching • Trivial algorithm (l1+l2+l3…)n • Using a keyword tree, lpn (lp is the length of the longest pattern) • Aho-Corasick: O(n) after preprocessing O(l1+l2..) • We will consider the most general case CSE 182

  49. Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. CSE 182

  50. O A P M S T T T O T S I U A E The Trie Automaton • Construct an automaton A from the dictionary • A[v,x] describes the transition from node v to a node w upon reading x. • A[u,’T’] = v, and A[u,’S’] = w • Special root node r • Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v 1 r S 2 w 3 CSE 182

More Related