CSE182-L6

CSE182-L6 Dicitionary matching Pattern matching CSE182

Today, we might look at R. Expr. • In Assignment 1, you were asked to look for all mouse sequences. • One way is to make a perl regular expression out all possibilities • MATCH [Mm]us OR [Mm]ouse but DO NOT match [Ll][Ii][Kk][Ee] • How can we do these searches? Are they relevant to bioinformatics? Stay tuned CSE182

An O(n) alg. For keyword matching • Start with the first position in the db, and the root node. • If successful transition • Increment current pointer • Move to a new node • If terminal node “success” • Else (if at root) • Increment ‘current’ pointer • Mv ‘start’ pointer • Move to root • Else • Move ‘start’ pointer forward • Move to failure node CSE182

Failure function • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 1 P n2 O n3 T n4 A n5 T n6 O n1 v T S S I U M n7 A n10 S T E n8 n9 CSE182

Illustration • What is F(n10)? • What is F(n5)? • F(n3)? • Lp(n10)? 1 P n2 O n3 T n4 A n5 T n6 O n1 v T S S I U M n7 A n10 S T E n8 n9 CSE182

l = 1 c = 1 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

l = 1 c = 6 Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S v S I U M n7 A n10 S T E n8 n9 CSE182

l = 3 c = 7 v Illustration P O T A S T P O T A T O 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 n11 CSE182

l = 7 c = 7 Illustration P O T A S T P O T A T O v 1 P n2 O n3 T n4 A n5 T n6 O n1 T S S I U M n7 A n10 S T E n8 n9 CSE182

Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O CSE182

Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E-value cutoff • Blast CSE182

Blast Steps • Generate an automaton of all query keywords. • Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. • Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. • For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. • Output results. CSE182

BLAST output • Look up Blast Results with RID • HA5YXH5C012 CSE182

Distant hits CSE182

B A C Protein Sequence Analysis • What can you do if BLAST does not return a hit? • Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher E-value. • This increases the probability that the sequence similarity is a chance event. • How can we get around this paradox? • Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? CSE182

Silly Quiz Skin patterns Facial Features CSE182

Not all features(residues) are important Skin patterns Facial Features CSE182

Diverged family members provide key features CSE182

Fam(B) A C Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • How can we identify these key residues? CSE182

Prosite • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 CSE182

Basic idea • It is a heuristic approach. Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE182

EX: Zinc Finger domain CSE182

Proteins containing zf domains How can we find a motif corresponding to a zf domain CSE182

From alignment to regular expressions * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search Swissprot with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE182

The sequence analysis perspective • Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE182

Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE182

Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE182

Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or  • Suppose R is described by automaton A • S  R if and only if there is a path from start to end in A, labeled with s. CSE182

Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C CSE182

    Constructing automata from R.E  • R = {} • R = {},    • R = R1 + R2 • R = R1 · R2 • R = R1*      CSE182

End of L6 CSE182

Protein structure basics CSE182

Side chains determine amino-acid type • The residues may have different properties. • Aspartic acid (D), and Glutamic Acid (E) are acidic residues CSE182

Bond angles form structural constraints CSE182

Various constraints determine 3d structure • Constraints • Structural constraints due to physiochemical properties • Constraints due to bond angles • H-bond formation • Surprisingly, a few conformations are seen over and over again. CSE182

Alpha-helix • 3.6 residues per turn • H-bonds between 1st and 4th residue stabilize the structure. • First discovered by Linus Pauling CSE182

Beta-sheet • Each strand by itself has 2 residues per turn, and is not stable. • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions. CSE182

Domains • The basic structures (helix, strand, loop) combine to form complex 3D structures. • Certain combinations are popular. Many sequences, but only a few folds CSE182

3D structure • Predicting tertiary structure is an important problem in Bioinformatics. • Premise: Clues to structure can be found in the sequence. • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. • The PDB database is a compendium of structures PDB CSE182

Searching structure databases • Threading, and other 3d Alignments can be used to align structures. • Database filtering is possible through geometric hashing. CSE182

Trivia Quiz • What research won the Nobel prize in Chemistry in 2004? • In 2002? CSE182

CSE182-L6

CSE182-L6

Presentation Transcript

CSE182-L10

CSE182-L16

CSE182-L12

CSE182-L11

CSE182-L12

CSE182-L9

CSE182-L6

L6- L7

CSE182-L12

CSE182-L9

CSE182-L7

CSE182-L11

CSE182-L10

CSE182-L13

CSE182-L18