1 / 63

CSE182-L5: Scoring matrices Dictionary Matching

CSE182-L5: Scoring matrices Dictionary Matching. Scoring DNA. DNA has structure. DNA scoring matrices. So far, we considered a simple match/mismatch criterion. The nucleotides can be grouped into Purines (A,G) and Pyrimidines.

rivka
Download Presentation

CSE182-L5: Scoring matrices Dictionary Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L5: Scoring matrices Dictionary Matching CSE 182

  2. Scoring DNA • DNA has structure. CSE 182

  3. DNA scoring matrices • So far, we considered a simple match/mismatch criterion. • The nucleotides can be grouped into Purines (A,G) and Pyrimidines. • Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions) CSE 182

  4. Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA • Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. • “One size does not fit all” • Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant • Different proteins might evolve at different rates and we need to normalize for that CSE 182

  5. PAM 1 distance • Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] CSE 182

  6. PAM1 matrix • Align many proteins that are very similar • Is this a problem? • PAM1 distance is the probability of a substitution when 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) CSE 182

  7. PAM 1 CSE 182

  8. 1 PAM 1 PAM PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 2 PAM CSE 182

  9. Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 • = PAM1*PAM249 • = PAM1250 CSE 182

  10. PAM250 based scoring matrix • S250(a,b) = log10(Pab/PaPb) = log10(PAM250(b|a)/Pb) CSE 182

  11. Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb) • How does it help? • S250(A,V) >> S1(A,V) • Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. • An alignment with a smaller % identity could still have a higher score and be more significant hum mus dros CSE 182

  12. BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. • In practice BLOSUM62 seems to work very well. CSE 182

  13. PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 CSE 182

  14. The last step in Blast • We have discussed • Alignments • Db filtering using keywords • E-values and P-values • Scoring matrices • The last step: Database filtering requires us to scan a large sequence fast for matching keywords CSE 182

  15. Dictionary Matching, R.E. matching, and position specific scoring CSE 182

  16. Keyword search • Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. • Question: Given a collection of strings (keywords), find all occrrences in a database string where they keyword might match. CSE 182

  17. Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O • Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. • How fast can this be done? database dictionary CSE 182

  18. Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? • Trivial algorithm O(nm) time • Pre-processing O(m), Search O(n) time. • Dictionary matching • Trivial algorithm (l1+l2+l3…)n • Using a keyword tree, lpn (lp is the length of the longest pattern) • Aho-Corasick: O(n) after preprocessing O(l1+l2..) • We will consider the most general case CSE 182

  19. Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. CSE 182

  20. O A P M S T T T O T S I U A E The Trie Automaton • Construct an automaton A from the dictionary • A[v,x] describes the transition from node v to a node w upon reading x. • A[u,’T’] = v, and A[u,’S’] = w • Special root node r • Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v 1 r S 2 w 3 CSE 182

  21. Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node “success” Else Retract ‘current’ pointer Increment ‘start’ pointer Move to root & repeat An O(lpn) algorithm for keyword matching CSE 182

  22. c l O A T P S M T T O T S I U A E Illustration: P O T A S T P O T A T O v 1 S CSE 182

  23. Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match • Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM 3:TASTE Pattern j CSE 182

  24. O A S T M P T T O T S I U A E Improving speed of dictionary matching • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 2 3 4 5 1 S 11 6 7 9 10 8 CSE 182

  25. An O(n) alg. For keyword matching • Start with the first position in the db, and the root node. • If successful transition • Increment current pointer • Move to a new node • If terminal node “success” • Else (if at root) • Increment ‘current’ pointer • Mv ‘start’ pointer • Move to root • Else • Move ‘start’ pointer forward • Move to failure node CSE 182

  26. Illustration P O T A S T P O T A T O l c 1 P O T A T O v T S S I U M A S T E CSE 182

  27. Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O CSE 182

  28. Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E-value cutoff CSE 182

  29. Blast Steps • Generate an automaton of all query keywords. • Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. • Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. • For each alignment with score S, compute the bit-score, E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. • Output results. CSE 182

  30. Protein Sequence Analysis • What can you do if BLAST does not return a hit? • Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher P-value. • This increases the probability that the sequence similarity is a chance event. • How can we get around this paradox? • Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? CSE 182

  31. Silly Quiz CSE 182

  32. Silly Quiz CSE 182

  33. Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • How can we identify these key residues? CSE 182

  34. Prosite • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 CSE 182

  35. Basic idea • It is a heuristic approach. Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE 182

  36. EX: Zinc Finger domain CSE 182

  37. Proteins containing zf domains How can we find a motif corresponding to a zf domain CSE 182

  38. From alignment to regular expressions * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search Swissprot with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE 182

  39. The sequence analysis perspective • Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE 182

  40. Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE 182

  41. Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE 182

  42. Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or  • Suppose R is described by automaton A • S  R if and only if there is a path from start to end in A, labeled with s. CSE 182

  43. Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C CSE 182

  44.    Constructing automata from R.E  • R = {} • R = {},    • R = R1 + R2 • R = R1 · R2 • R = R1*      CSE 182

  45. Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? CSE 182

  46. Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA  D[1] D[2] D[c] CSE 182

  47. Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END u D[1] .. D[c-1] D[c] CSE 182

  48. D.P. to match regular expression u  v • Define: • A[u,] = Automaton node reached from u after reading  • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v  N[c]  u Eps(u) CSE 182

  49. D.P. to match regular expression • Q: when is v  N[c]? • A: If for some u  N[c-1], w = A[u,D[c]], • v  {w}+ Eps(w) CSE 182

  50. Algorithm CSE 182

More Related