520 likes | 644 Views
CSE182-L6. Dicitionary matching Pattern matching. Today, we might look at R. Expr. In Assignment 1, you were asked to look for all mouse sequences. One way is to make a perl regular expression out all possibilities MATCH [Mm]us OR [Mm]ouse but DO NOT match [Ll][Ii][Kk][Ee]
E N D
CSE182-L6 Dicitionary matching Pattern matching CSE182
Today, we might look at R. Expr. • In Assignment 1, you were asked to look for all mouse sequences. • One way is to make a perl regular expression out all possibilities • MATCH [Mm]us OR [Mm]ouse but DO NOT match [Ll][Ii][Kk][Ee] • How can we do these searches? Are they relevant to bioinformatics? Stay tuned CSE182
An O(n) alg. For keyword matching • Start with the first position in the db, and the root node. • If successful transition • Increment current pointer • Move to a new node • If terminal node “success” • Else (if at root) • Increment ‘current’ pointer • Mv ‘start’ pointer • Move to root • Else • Move ‘start’ pointer forward • Move to failure node CSE182
Failure function • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 1 P n2 O n3 T n4 A n5 T n6 O n1 v T S S I U M n7 A n10 S T E n8 n9 CSE182
Illustration • What is F(n10)? • What is F(n5)? • F(n3)? • Lp(n10)? 1 P n2 O n3 T n4 A n5 T n6 O n1 v T S S I U M n7 A n10 S T E n8 n9 CSE182
l = 1 c = 1 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182
l = 1 c = 2 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182
l = 1 c = 6 Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S v S I U M n7 A n10 S T E n8 n9 CSE182
l = 3 c = 6 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182
l = 3 c = 7 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 n11 CSE182
l = 7 c = 7 Illustration P O T A S T P O T A T O v 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182
l = 7 c = 8 Illustration P O T A S T P O T A T O v 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182
l = 7 c = 7 Illustration P O T A S T P O T A T O v 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182
Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O CSE182
Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E-value cutoff • Blast CSE182
Blast Steps • Generate an automaton of all query keywords. • Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. • Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. • For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. • Output results. CSE182
BLAST output • Look up Blast Results with RID • HA5YXH5C012 CSE182
Distant hits CSE182
B A C Protein Sequence Analysis • What can you do if BLAST does not return a hit? • Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher E-value. • This increases the probability that the sequence similarity is a chance event. • How can we get around this paradox? • Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? CSE182
Silly Quiz Skin patterns Facial Features CSE182
Not all features(residues) are important Skin patterns Facial Features CSE182
Fam(B) A C Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • How can we identify these key residues? CSE182
Prosite • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 CSE182
Basic idea • It is a heuristic approach. Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE182
EX: Zinc Finger domain CSE182
Proteins containing zf domains How can we find a motif corresponding to a zf domain CSE182
From alignment to regular expressions * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search Swissprot with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE182
The sequence analysis perspective • Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE182
Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE182
Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE182
Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or • Suppose R is described by automaton A • S R if and only if there is a path from start to end in A, labeled with s. CSE182
Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C CSE182
Constructing automata from R.E • R = {} • R = {}, • R = R1 + R2 • R = R1 · R2 • R = R1* CSE182
End of L6 CSE182
Protein structure basics CSE182
Side chains determine amino-acid type • The residues may have different properties. • Aspartic acid (D), and Glutamic Acid (E) are acidic residues CSE182
Various constraints determine 3d structure • Constraints • Structural constraints due to physiochemical properties • Constraints due to bond angles • H-bond formation • Surprisingly, a few conformations are seen over and over again. CSE182
Alpha-helix • 3.6 residues per turn • H-bonds between 1st and 4th residue stabilize the structure. • First discovered by Linus Pauling CSE182
Beta-sheet • Each strand by itself has 2 residues per turn, and is not stable. • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions. CSE182
Domains • The basic structures (helix, strand, loop) combine to form complex 3D structures. • Certain combinations are popular. Many sequences, but only a few folds CSE182
3D structure • Predicting tertiary structure is an important problem in Bioinformatics. • Premise: Clues to structure can be found in the sequence. • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. • The PDB database is a compendium of structures PDB CSE182
Searching structure databases • Threading, and other 3d Alignments can be used to align structures. • Database filtering is possible through geometric hashing. CSE182
Trivia Quiz • What research won the Nobel prize in Chemistry in 2004? • In 2002? CSE182
Nobel Citation 2002 CSE182
Nobel Citation, 2002 CSE182
Mass Spectrometry CSE182
Enzymatic Digestion (Trypsin) + Fractionation Sample Preparation CSE182