What is a motif?

CZ5226: Advanced BioinformaticsLecture 4: Motifs and methods for generating motifsProf. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, National University of Singapore

What is a motif? • A motif is a sequence pattern that occurs repeatedly in a group of related DNA or RNA or protein or peptide sequences.

Types of motifs and what they mean • Motifs in protein sequences • Structure, function, evolution • Motifs in DNA and RNA sequences • Promoters, transcription factor binding sites, splicing signals • Motifs in MHC-binding peptides • Anchor residue positions, TCR recognition residues

Motifs in Protein Sequences • The leucine zipper may explain how some eukaryotic gene regulatory proteins work. • L-x(6)-L-x(6)-L-x(6)-L • The leucine side chains extending from one alpha-helix interact with those from a similar alpha helix of a second polypeptide, facilitating dimerization

Motifs in DNA Sequences

Motifs in DNA Sequences • Promoter regions, e.g. TATA box • Transcription factor binding sites, e.g. Eve in Drosophila: G-G-T-C-C-T-G-G • Cis-Regulatory regions

Motifs in RNA sequences

Motifs in Protein Structures • Protein structure patterns can encode information about protein function. • Structure motifs can be used to improve multiple alignments of protein sequences.

Active site recognition EXAMPLE:CATHEPSIN A PEPTIDASE FAMILY S10 EC # 3.4.16.5 3-D representation 3D profile (PROCAT)

1ac5 438LTFVSVYNASHMVPFDKS455 1ivy 419IAFLTIKGAGHMVPTDKP436

Motifs in MHC-Binding Peptide

Motifs in MHC Binding Peptides

What is the goal and method of motif detection? • Perform local multiple sequence alignment to find consensus sequences and common sequence patterns (motifs)

Macromolecular motif recognition 1-D representation: Primary amino acid sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPENSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTEVAQSNFEALQDFFRLFPEYKNNKL... Query secondary databases over the Internet Computational sequence analysis http://www.ebi.ac.uk/interpro/

The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.

Pattern Description Languages • Regular expressions • Profiles • Hidden Markov Models (HMMs) • Motif HMMs • Motif-based HMMs

Macromolecular motif recognition single motif exact regular expression (PROSITE) full domain alignment profile (PROSITE) Hidden Markov Model (Pfam, PROSITE) residue frequency matrices (PRINTS) multiple motifs

Regular Expressions • Regular expressions can be used to describe sequence motifs. • They use a simple syntax to describe patterns. • An example protein pattern: [DENG]-x-[DEN]-x(0,2)-[DENQK]-[LIVFY]

Regular expressions contd. • Basic rules for regular expressions • • Each position is separated by a hyphen “-” • • A symbol X is a regular expression matching itself • • x means ‘any residue’ • • [ ] surround ambiguities - a string [XYZ] matches any of the enclosed symbols • • A string [R]* matches any number of strings that match • • { } surround forbidden residues • • ( ) surround repeat counts • Model formation • Restricted to key conserved features in order to reduce the “noise” level • Built by hand in a stepwise fashion from multiple alignments

Motif modelling methods Prosite: Regular expressions CARBOXYPEPT_SER_HIS [LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA] Regular expressions represent features by logical combinations of characters. A regular expression defines a sequence pattern to be matched.

Regular expressions contd. Regular expressions, such as PROSITE patterns, are matched to primary amino acid sequences using finite state automata. “all-or-none”

G G G G A T Y C C C A 0 0 0 2 16 0 0 0 0 0 C 0 0 0 0 1 0 7 16 18 17 G 18 18 18 16 0 3 1 0 0 1 T 0 0 0 0 1 15 10 2 0 0 Profiles • Profiles give weights for each letter. • Example from TRANSFAC: NF-kappab1

Profiles • Profiles are usually created by aligning multiple instances of the motif. • Example: nuclear hormone receptor transcription factor binding site.

Motif modelling methods Prints: Residue frequency matrices Motif 1 NPESWTNFANMLW NPYSWVNLTNVLW REYSWHQNHHMIY NEGSWISKGDLLF NPYSWTNLTNVVY NEYSWNKMASVVY NDFGWDQESNLIY NENSWNNYANMIY NEYGWDQVSNLLY NPYAWSKVSTMIY NPYSWNGNASIIY NEYAWNKFANVLF NPYSWNRVSNILY NPYSWNLIANVLY NEYRWNKVANVLF Motif 2 LDQPFGTGYSQ VDNPVGAGFSY VDQPVGTGFSL VDQPGGTGFSS IDNPVGTGFSF IDQPTGTGFSV VDQPLGTGYSY IDQPAGTGFSP LESPIGVGFSY LDQPVGSGFSY LDQPVGSGFSY LDQPINTGFSN LDQPIGAGFSY LDAPAGVGFSY LDQPVGAGFSY Motif 3 FFQHFPEYQTNDFHIAGESYAGHYIP FFNKFPEYQNRPFYITGESYGGIYVP WVERFPEYKGRDFYIVGESYAGNGLM FLSKFPEYKGRDFWITGESYAGVYIP WFQLYPEFLSNPFYIAGESYAGVYVP FFEAFPHLRSNDFHIAGESYAGHYIP FFRLFPEYKDNKLFLTGESYAGIYIP FLTRFPQFIGRETYLAGESYGGVYVP FFNEFPQYKGNDFYVTGESYGGIYVP WMSRFPQYQYRDFYIVGESYAGHYVP FFRLFPEYKNNKLFLTGESYAGIYIP FFRLFPEYKNNKLFLTGESYAGIYIP WLERFPEYKGREFYITGESYAGHYVP WMSRFPQYRYRDFYIVGESYAGHYVP WFEKFPEHKGNEFYIAGESYAGIYVP Motif 4 LAFTLSNSVGHMAP LQFWWILRAGHMVA LMWAETFQSGHMQP LTYVRVYNSSHMVP LQEVLIRNAGHMVP LTFVSVYNASHMVP LTFARIVEASHMVP LTFSSVYLSGHEIP IDVVTVKGSGHFVP MTFATIKGSGHTAE MTFATIKGGGHTAE FGYLRLYEAGHMVP MTFATVKGSGHTAE ITLISIKGGGHFPA MTFATVKGSGHTAE • a collection of protein “fingerprints” that exploit groups of motifs to build characteristic family signatures • motifs are encoded in ungapped ”raw” sequence format • different scoring methods may be superimposed onto the data, e. .g. BLAST • improved diagnostic reliability • mutual context provided by motif neighbours

Motif modelling methods Prosite: Profiles Feature is represented as a matrix with a score for every possible character. Matrix is derived from a sequence alignment, e.g.: F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q

Profiles contd. Derived matrix: A -18 -10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18 Alignment positions

Profiles contd. • Inclusion of all possible information to maximise overall signal of protein/domain • i. e., a full representation of features in the aligned sequences • Able to detect distant relationships with only few well conserved residues • Position-dependent weights/penalties for all 20 amino acids, gaps, insertions • Dynamic programming algorithms for scoring hits

Hidden Markov Models (HMM) • HMMs generalize the idea of a profile. • They can model insertions and deletions in the sequence as well as the letters at conserved positions. • Profiles can be seen as simple HMMs.

Macromolecular motif recognition • Pfam and Prosite: Hidden Markov Models(HMMs) • Feature is represented by a probabilistic model of interconnecting match, delete or insert states • contains statistical information on observed and expected positional variation - “platonic ideal of protein family” Di Ii B Mi E

HMM example A possible HMM for the sequence “ACCY” which is represented as a sequence of probabilities. The probability of ACCY is shown as a highlighted path through the model. P that an amino acid occurs in a particular state P of transition state

Motif HMM M1 M2 M3 M4 M5 Motif-Based HMMs Motif-based HMMs are sequence models made by combining one or more motif models. Motif HMM: Motifs are modeled as profile HMMs without delete or insert states.

Sequence HMM Start Left Flank M1 M2 M3 M4 M5 Right Flank End A Simple Motif-Based HMM • Adding emitting states with self-loops, plus start and end states, turns a motif HMM into a sequence model. • The HMM below models sequences with one occurrence of the motif.

Motif-Based HMM for ModelingCis-regulatory Regions With two or more motif models we can make more complicated motif-based HMMs. This sequence model captures motifs on the + and – strand of DNA. It does not capture the order of the motifs.

Objective functions for Regular Expression Patterns • Possible objective functions are: • Perfect matches only (no mismatches) • Allow a given number of mismatches • Allow a given density of mismatches (or wildcards). • To be interesting, the motif must occur a certain minimum number of times in the data.

Objective functions for profiles and HMMs • Profile- and HMM-based motifs are usually ranked by statistical or information-theoretical measures: • Likelihood ratio (eg, forward-backward) • Information content (relative entropy) • Maximum a posteriori probability

Example for profiles: the likelihood ratio • Use the profile to compute the likelihood of the data: Pr(data | profile) • Use the background model to compute the likelihood of the data under the background model: Pr(data | bkgrnd) • The likelihood is: Pr(data | profile) / Pr(data | bkgrnd)

Objective functions for protein structure patterns • Structure motifs are usually evaluated based on the RMS distance • between the pattern and each instance, or, • among all the instances of the pattern.

Algorithms for discovering sequence motifs • Regular expression searches enumerate or use seeds. • Profile/HMM algorithms use Gibbs sampling or Expectation Maximization (EM). Forward-Backward is a form of EM.

Regular Expression Discovery: a simple algorithm • Look for DNA 16-mers where (up to) one wild card is allowed in the pattern: • E.g., “T-A-C-X-G-T-A-G-G-C-C-T-A-G-T-T” • There are possible patterns—a big number. • Idea: Instead of enumerating the possible patterns and counting, just update the counts of appropriate patterns for each 16-mer that actually occurs in the data.

Regular Expression Discovery: a simple algorithm (cont’d) • Run a window of width 16 along the data and, for each 16-mer in the data, e.g. “AGGGTAAAAGCCCCCT”, update the counts of the exact match pattern and each pattern with one wildcard: A-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, X-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, A-X-G-G-T-A-A-A-A-G-C-C-C-C-C-T, etc.

Profile discovery algorithms • Profile discovery algorithms for finding sequence motifs mostly use either EM (Expectation Maximization) or Gibbs sampling.

What is Gibbs sampling? • Stochastic optimization method • Works well with local multiple alignment without gaps (motif searching) • Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space

What is the program going to do? • Ask user for : • file containing multiple DNA or protein sequences • motif width • how many motifs wanted • Calculate the background frequencies of A,C,G,T from all the sequences. [0.34951456310679613, 0.17799352750809061, 0.21035598705501618, 0.23300970873786409]

What is the program going to do? • Generate random start positions for the motif in each sequence. Example: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width

What is the program going to do? 4. Construct position specific score matrix from all sequences except one.

What is the program going to do? 5. Score the left-out sequence according to the position specific score matrix:

What is a motif?

What is a motif?

Presentation Transcript

Multiple Sequence Alignment

Symmetry Elements

Mining Motifs from Biosequences

Finding Motifs in DNA

Macbeth Act II and III

Transcription Regulation Transcription Factor Motif Finding

Tools for Measuring System and Application Performance

Complex networks are found throughout biology

Control of Gene Expression

Motif Finding

Motif Finding

Linear motifs and phosphorylation sites

Agenda

Complex networks are found throughout biology

Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics