1 / 52

CSE182-L6

CSE182-L6. Dicitionary matching Pattern matching. Today, we might look at R. Expr. In Assignment 1, you were asked to look for all mouse sequences. One way is to make a perl regular expression out all possibilities MATCH [Mm]us OR [Mm]ouse but DO NOT match [Ll][Ii][Kk][Ee]

hammer
Download Presentation

CSE182-L6

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L6 Dicitionary matching Pattern matching CSE182

  2. Today, we might look at R. Expr. • In Assignment 1, you were asked to look for all mouse sequences. • One way is to make a perl regular expression out all possibilities • MATCH [Mm]us OR [Mm]ouse but DO NOT match [Ll][Ii][Kk][Ee] • How can we do these searches? Are they relevant to bioinformatics? Stay tuned CSE182

  3. An O(n) alg. For keyword matching • Start with the first position in the db, and the root node. • If successful transition • Increment current pointer • Move to a new node • If terminal node “success” • Else (if at root) • Increment ‘current’ pointer • Mv ‘start’ pointer • Move to root • Else • Move ‘start’ pointer forward • Move to failure node CSE182

  4. Failure function • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 1 P n2 O n3 T n4 A n5 T n6 O n1 v T S S I U M n7 A n10 S T E n8 n9 CSE182

  5. Illustration • What is F(n10)? • What is F(n5)? • F(n3)? • Lp(n10)? 1 P n2 O n3 T n4 A n5 T n6 O n1 v T S S I U M n7 A n10 S T E n8 n9 CSE182

  6. l = 1 c = 1 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

  7. l = 1 c = 2 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

  8. l = 1 c = 6 Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S v S I U M n7 A n10 S T E n8 n9 CSE182

  9. l = 3 c = 6 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

  10. l = 3 c = 7 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 n11 CSE182

  11. l = 7 c = 7 Illustration P O T A S T P O T A T O v 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

  12. l = 7 c = 8 Illustration P O T A S T P O T A T O v 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

  13. l = 7 c = 7 Illustration P O T A S T P O T A T O v 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

  14. Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O CSE182

  15. Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E-value cutoff • Blast CSE182

  16. Blast Steps • Generate an automaton of all query keywords. • Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. • Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. • For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. • Output results. CSE182

  17. BLAST output • Look up Blast Results with RID • HA5YXH5C012 CSE182

  18. Distant hits CSE182

  19. B A C Protein Sequence Analysis • What can you do if BLAST does not return a hit? • Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher E-value. • This increases the probability that the sequence similarity is a chance event. • How can we get around this paradox? • Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? CSE182

  20. Silly Quiz Skin patterns Facial Features CSE182

  21. Not all features(residues) are important Skin patterns Facial Features CSE182

  22. Diverged family members provide key features CSE182

  23. Fam(B) A C Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • How can we identify these key residues? CSE182

  24. Prosite • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 CSE182

  25. Basic idea • It is a heuristic approach. Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE182

  26. EX: Zinc Finger domain CSE182

  27. Proteins containing zf domains How can we find a motif corresponding to a zf domain CSE182

  28. From alignment to regular expressions * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search Swissprot with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE182

  29. The sequence analysis perspective • Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE182

  30. Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE182

  31. Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE182

  32. Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or  • Suppose R is described by automaton A • S  R if and only if there is a path from start to end in A, labeled with s. CSE182

  33. Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C CSE182

  34.    Constructing automata from R.E  • R = {} • R = {},    • R = R1 + R2 • R = R1 · R2 • R = R1*      CSE182

  35. End of L6 CSE182

  36. Protein structure basics CSE182

  37. Side chains determine amino-acid type • The residues may have different properties. • Aspartic acid (D), and Glutamic Acid (E) are acidic residues CSE182

  38. Bond angles form structural constraints CSE182

  39. Various constraints determine 3d structure • Constraints • Structural constraints due to physiochemical properties • Constraints due to bond angles • H-bond formation • Surprisingly, a few conformations are seen over and over again. CSE182

  40. Alpha-helix • 3.6 residues per turn • H-bonds between 1st and 4th residue stabilize the structure. • First discovered by Linus Pauling CSE182

  41. Beta-sheet • Each strand by itself has 2 residues per turn, and is not stable. • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions. CSE182

  42. Domains • The basic structures (helix, strand, loop) combine to form complex 3D structures. • Certain combinations are popular. Many sequences, but only a few folds CSE182

  43. 3D structure • Predicting tertiary structure is an important problem in Bioinformatics. • Premise: Clues to structure can be found in the sequence. • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. • The PDB database is a compendium of structures PDB CSE182

  44. Searching structure databases • Threading, and other 3d Alignments can be used to align structures. • Database filtering is possible through geometric hashing. CSE182

  45. Trivia Quiz • What research won the Nobel prize in Chemistry in 2004? • In 2002? CSE182

  46. How are Proteins Sequenced? Mass Spec 101: CSE182

  47. Nobel Citation 2002 CSE182

  48. Nobel Citation, 2002 CSE182

  49. Mass Spectrometry CSE182

  50. Enzymatic Digestion (Trypsin) + Fractionation Sample Preparation CSE182

More Related