730 likes | 1.12k Views
CZ5226: Advanced Bioinformatics Lecture 4: Motifs and methods for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. What is a motif?.
E N D
CZ5226: Advanced BioinformaticsLecture 4: Motifs and methods for generating motifsProf. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, National University of Singapore
What is a motif? • A motif is a sequence pattern that occurs repeatedly in a group of related DNA or RNA or protein or peptide sequences.
Types of motifs and what they mean • Motifs in protein sequences • Structure, function, evolution • Motifs in DNA and RNA sequences • Promoters, transcription factor binding sites, splicing signals • Motifs in MHC-binding peptides • Anchor residue positions, TCR recognition residues
Motifs in Protein Sequences • The leucine zipper may explain how some eukaryotic gene regulatory proteins work. • L-x(6)-L-x(6)-L-x(6)-L • The leucine side chains extending from one alpha-helix interact with those from a similar alpha helix of a second polypeptide, facilitating dimerization
Motifs in DNA Sequences • Promoter regions, e.g. TATA box • Transcription factor binding sites, e.g. Eve in Drosophila: G-G-T-C-C-T-G-G • Cis-Regulatory regions
Motifs in Protein Structures • Protein structure patterns can encode information about protein function. • Structure motifs can be used to improve multiple alignments of protein sequences.
Active site recognition EXAMPLE:CATHEPSIN A PEPTIDASE FAMILY S10 EC # 3.4.16.5 3-D representation 3D profile (PROCAT)
1ac5 438LTFVSVYNASHMVPFDKS455 1ivy 419IAFLTIKGAGHMVPTDKP436
What is the goal and method of motif detection? • Perform local multiple sequence alignment to find consensus sequences and common sequence patterns (motifs)
Macromolecular motif recognition 1-D representation: Primary amino acid sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPENSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTEVAQSNFEALQDFFRLFPEYKNNKL... Query secondary databases over the Internet Computational sequence analysis http://www.ebi.ac.uk/interpro/
The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.
The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.
Pattern Description Languages • Regular expressions • Profiles • Hidden Markov Models (HMMs) • Motif HMMs • Motif-based HMMs
Macromolecular motif recognition single motif exact regular expression (PROSITE) full domain alignment profile (PROSITE) Hidden Markov Model (Pfam, PROSITE) residue frequency matrices (PRINTS) multiple motifs
Regular Expressions • Regular expressions can be used to describe sequence motifs. • They use a simple syntax to describe patterns. • An example protein pattern: [DENG]-x-[DEN]-x(0,2)-[DENQK]-[LIVFY]
Regular expressions contd. • Basic rules for regular expressions • • Each position is separated by a hyphen “-” • • A symbol X is a regular expression matching itself • • x means ‘any residue’ • • [ ] surround ambiguities - a string [XYZ] matches any of the enclosed symbols • • A string [R]* matches any number of strings that match • • { } surround forbidden residues • • ( ) surround repeat counts • Model formation • Restricted to key conserved features in order to reduce the “noise” level • Built by hand in a stepwise fashion from multiple alignments
Motif modelling methods Prosite: Regular expressions CARBOXYPEPT_SER_HIS [LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA] Regular expressions represent features by logical combinations of characters. A regular expression defines a sequence pattern to be matched.
Regular expressions contd. Regular expressions, such as PROSITE patterns, are matched to primary amino acid sequences using finite state automata. “all-or-none”
G G G G A T Y C C C A 0 0 0 2 16 0 0 0 0 0 C 0 0 0 0 1 0 7 16 18 17 G 18 18 18 16 0 3 1 0 0 1 T 0 0 0 0 1 15 10 2 0 0 Profiles • Profiles give weights for each letter. • Example from TRANSFAC: NF-kappab1
Profiles • Profiles are usually created by aligning multiple instances of the motif. • Example: nuclear hormone receptor transcription factor binding site.
Motif modelling methods Prints: Residue frequency matrices Motif 1 NPESWTNFANMLW NPYSWVNLTNVLW REYSWHQNHHMIY NEGSWISKGDLLF NPYSWTNLTNVVY NEYSWNKMASVVY NDFGWDQESNLIY NENSWNNYANMIY NEYGWDQVSNLLY NPYAWSKVSTMIY NPYSWNGNASIIY NEYAWNKFANVLF NPYSWNRVSNILY NPYSWNLIANVLY NEYRWNKVANVLF Motif 2 LDQPFGTGYSQ VDNPVGAGFSY VDQPVGTGFSL VDQPGGTGFSS IDNPVGTGFSF IDQPTGTGFSV VDQPLGTGYSY IDQPAGTGFSP LESPIGVGFSY LDQPVGSGFSY LDQPVGSGFSY LDQPINTGFSN LDQPIGAGFSY LDAPAGVGFSY LDQPVGAGFSY Motif 3 FFQHFPEYQTNDFHIAGESYAGHYIP FFNKFPEYQNRPFYITGESYGGIYVP WVERFPEYKGRDFYIVGESYAGNGLM FLSKFPEYKGRDFWITGESYAGVYIP WFQLYPEFLSNPFYIAGESYAGVYVP FFEAFPHLRSNDFHIAGESYAGHYIP FFRLFPEYKDNKLFLTGESYAGIYIP FLTRFPQFIGRETYLAGESYGGVYVP FFNEFPQYKGNDFYVTGESYGGIYVP WMSRFPQYQYRDFYIVGESYAGHYVP FFRLFPEYKNNKLFLTGESYAGIYIP FFRLFPEYKNNKLFLTGESYAGIYIP WLERFPEYKGREFYITGESYAGHYVP WMSRFPQYRYRDFYIVGESYAGHYVP WFEKFPEHKGNEFYIAGESYAGIYVP Motif 4 LAFTLSNSVGHMAP LQFWWILRAGHMVA LMWAETFQSGHMQP LTYVRVYNSSHMVP LQEVLIRNAGHMVP LTFVSVYNASHMVP LTFARIVEASHMVP LTFSSVYLSGHEIP IDVVTVKGSGHFVP MTFATIKGSGHTAE MTFATIKGGGHTAE FGYLRLYEAGHMVP MTFATVKGSGHTAE ITLISIKGGGHFPA MTFATVKGSGHTAE • a collection of protein “fingerprints” that exploit groups of motifs to build characteristic family signatures • motifs are encoded in ungapped ”raw” sequence format • different scoring methods may be superimposed onto the data, e. .g. BLAST • improved diagnostic reliability • mutual context provided by motif neighbours
Motif modelling methods Prosite: Profiles Feature is represented as a matrix with a score for every possible character. Matrix is derived from a sequence alignment, e.g.: F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q
Profiles contd. Derived matrix: A -18 -10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18 Alignment positions
Profiles contd. • Inclusion of all possible information to maximise overall signal of protein/domain • i. e., a full representation of features in the aligned sequences • Able to detect distant relationships with only few well conserved residues • Position-dependent weights/penalties for all 20 amino acids, gaps, insertions • Dynamic programming algorithms for scoring hits
Hidden Markov Models (HMM) • HMMs generalize the idea of a profile. • They can model insertions and deletions in the sequence as well as the letters at conserved positions. • Profiles can be seen as simple HMMs.
Macromolecular motif recognition • Pfam and Prosite: Hidden Markov Models(HMMs) • Feature is represented by a probabilistic model of interconnecting match, delete or insert states • contains statistical information on observed and expected positional variation - “platonic ideal of protein family” Di Ii B Mi E
HMM example A possible HMM for the sequence “ACCY” which is represented as a sequence of probabilities. The probability of ACCY is shown as a highlighted path through the model. P that an amino acid occurs in a particular state P of transition state
Motif HMM M1 M2 M3 M4 M5 Motif-Based HMMs Motif-based HMMs are sequence models made by combining one or more motif models. Motif HMM: Motifs are modeled as profile HMMs without delete or insert states.
Sequence HMM Start Left Flank M1 M2 M3 M4 M5 Right Flank End A Simple Motif-Based HMM • Adding emitting states with self-loops, plus start and end states, turns a motif HMM into a sequence model. • The HMM below models sequences with one occurrence of the motif.
Motif-Based HMM for ModelingCis-regulatory Regions With two or more motif models we can make more complicated motif-based HMMs. This sequence model captures motifs on the + and – strand of DNA. It does not capture the order of the motifs.
The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.
Objective functions for Regular Expression Patterns • Possible objective functions are: • Perfect matches only (no mismatches) • Allow a given number of mismatches • Allow a given density of mismatches (or wildcards). • To be interesting, the motif must occur a certain minimum number of times in the data.
Objective functions for profiles and HMMs • Profile- and HMM-based motifs are usually ranked by statistical or information-theoretical measures: • Likelihood ratio (eg, forward-backward) • Information content (relative entropy) • Maximum a posteriori probability
Example for profiles: the likelihood ratio • Use the profile to compute the likelihood of the data: Pr(data | profile) • Use the background model to compute the likelihood of the data under the background model: Pr(data | bkgrnd) • The likelihood is: Pr(data | profile) / Pr(data | bkgrnd)
Objective functions for protein structure patterns • Structure motifs are usually evaluated based on the RMS distance • between the pattern and each instance, or, • among all the instances of the pattern.
The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.
Algorithms for discovering sequence motifs • Regular expression searches enumerate or use seeds. • Profile/HMM algorithms use Gibbs sampling or Expectation Maximization (EM). Forward-Backward is a form of EM.
Regular Expression Discovery: a simple algorithm • Look for DNA 16-mers where (up to) one wild card is allowed in the pattern: • E.g., “T-A-C-X-G-T-A-G-G-C-C-T-A-G-T-T” • There are possible patterns—a big number. • Idea: Instead of enumerating the possible patterns and counting, just update the counts of appropriate patterns for each 16-mer that actually occurs in the data.
Regular Expression Discovery: a simple algorithm (cont’d) • Run a window of width 16 along the data and, for each 16-mer in the data, e.g. “AGGGTAAAAGCCCCCT”, update the counts of the exact match pattern and each pattern with one wildcard: A-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, X-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, A-X-G-G-T-A-A-A-A-G-C-C-C-C-C-T, etc.
Profile discovery algorithms • Profile discovery algorithms for finding sequence motifs mostly use either EM (Expectation Maximization) or Gibbs sampling.
What is Gibbs sampling? • Stochastic optimization method • Works well with local multiple alignment without gaps (motif searching) • Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space
What is the program going to do? • Ask user for : • file containing multiple DNA or protein sequences • motif width • how many motifs wanted • Calculate the background frequencies of A,C,G,T from all the sequences. [0.34951456310679613, 0.17799352750809061, 0.21035598705501618, 0.23300970873786409]
What is the program going to do? • Generate random start positions for the motif in each sequence. Example: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width
What is the program going to do? 4. Construct position specific score matrix from all sequences except one.
What is the program going to do? 5. Score the left-out sequence according to the position specific score matrix: