Sequence Motifs

Sequence Motifs

Motifs • Motifs represent a short common sequence • Regulatory motifs (TF binding sites) • Functional site in proteins (DNA binding motif)

Regulatory Motifs • Transcription Factors bind to regulatory motifs • Motifs are 6 – 20 nucleotides long • Activators and repressors • Usually located near target gene, mostly upstream Transcription Start Site SBF MCM1 Gene X MCM1 motif SBF motif

E. Coli promoter sequences

DNA binding Motif Zn finger C2H2

Challenges • How to recognize a regulatory motif? • Can we identify new occurrences of known motifs in genome sequences? • Can we discover new motifs within upstream sequences of genes?

1. Motif Representation CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACGG CGGCCCTAACGG ------------ CGGNNNTANCGG • Exact motif: CGGATATA • Consensus: represent only deterministic nucleotides. • Example: HAP1 binding sites in 5 sequences. • consensus motif: CGGNNNTANCGG • N stands for any nucleotide. • Representing only consensus loses information. How can this be avoided?

2 3 4 5 6 1 2 3 4 5 6 1 A A 0.1 0.1 0.1 0.5 0.2 0.5 T 0.7 0.7 0.2 0.2 0.2 0.2 0.1 0.7 0.2 0.6 0.5 0.1 T 0.7 0.1 0.5 0.2 0.2 0.8 G 0.1 0.1 0.5 0.1 0.1 0.2 G 0.1 0.1 0.1 0.1 0.1 0.0 C 0.1 0.1 0.2 0.2 0.5 0.1 C 0.1 0.1 0.2 0.1 0.1 0.1 -35 -10 Based on ~450 known promoters Representing the motif as aprofile Transcription start site -35 -10 TTGACA TATAAT

PSPM – Position Specific Probability Matrix • Represents a motif of length k (5) • Count the number of occurrence of each nucleotide in each position

PSPM – Position Specific Probability Matrix • Defines Pi{A,C,G,T} for i={1,..,k}. • Pi (A) – frequency of nucleotide A in position i.

Graphical Representation – Sequence Logo • Horizontal axis: position of the base in the sequence. • Vertical axis: amount of information. • Letter stack: order indicates importance. • Letter height: indicates frequency. • Consensus can be read across the top of the letter columns.

Identification of Known Motifs within Genomic Sequences • Motivation: • identification of new genes controlled by the same TF. • Infer the function of these genes. • enable better understanding of the regulation mechanism.

PSPM – Position Specific Probability Matrix • Each k-mer is assigned a probability. • Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

Detecting a Known Motif within a Sequence using PSPM • The PSPM is moved along the query sequence. • At each position the sub-sequence is scored for a match to the PSPM. • Example: sequence = ATGCAAGTCT…

Detecting a Known Motif within a Sequence using PSPM • The PSPM is moved along the query sequence. • At each position the sub-sequence is scored for a match to the PSPM. • Example: sequence = ATGCAAGTCT… • Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10-4

Detecting a Known Motif within a Sequence using PSPM • The PSPM is moved along the query sequence. • At each position the sub-sequence is scored for a match to the PSPM. • Example: sequence = ATGCAAGTCT… • Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10-4 • Position 2: TGCAA0.5*0.25*0.8*0.7*0.6=0.042

Detecting a Known Motif within a Sequence using PSSM Is it a random match, or is it indeed an occurrence of the motif? PSPM -> PSSM (Probability Specific Scoring Matrix) • odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k} • defined as Pi(n)/P(n), where P(n) is background frequency. Oi(n) increases => higher odds that n at position i is part of a real motif.

PSSM as Odds Score Matrix • Assumption: the background frequency of each nucleotide is 0.25. • Original PSPM (Pi): • Odds Matrix (Oi): • Going to log scale we get an additive score,Log odds Matrix (log2Oi):

Calculating using Log Odds Matrix • Odds  0 implies random match; Odds> 0 implies real match (?). • Example: sequence = ATGCAAGTCT… • Position 1: ATGCA -1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15 • Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8

Calculating the probability of a Match ATGCAAG • Position 1 ATGCA = 0.15

Calculating the probability of a Match ATGCAAG • Position 1 ATGCA = 0.15 • Position 2 TGCAA = 42.3

Calculating the probability of a Match ATGCAAG • Position 1 ATGCA = 0.15 • Position 2 TGCAA = 42.3 • Position 3 GCAAG =0.18

Calculating the probability of a match ATGCAAG • Position 1 ATGCA = 0.15 • Position 2 TGCAA = 42.3 • Position 3 GCAAG =0.18 P (1)= 0.003 P (2)= 0.993 P (3) =0.004 P (i) = S / (∑ S) Example 0.15 /(.15+42.8+.18)=0.003

Building a PSSM • Collect all known sequences that bind a certain TF. • Align all sequences (using multiple sequence alignment). • Compute the frequency of each nucleotide in each position (PSPM). • Incorporate background frequency for each nucleotide (PSSM).

Finding new Motifs • We are given a group of genes, which presumably contain a common regulatory motif. • We know nothing of the TF that binds to the putative motif. • The problem: discover the motif.

Example Predicting the cAMP Receptor Protein (CRP) binding site motif

Extract experimentally defined CRP Binding Sites GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGTTAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGAACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACGCG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGATTTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCGACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCACATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TTTATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCTCCCACTCGCA TGTGATTCGATTCACA

Create a Multiple Sequence Alignment GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCACT TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCACC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATCACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGTCATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCATCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACACATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA

Generate a PSSM

XXXXXTGTGAXXXXAXTCACAXXXXXXX XXXXXACACTXXXXTXGATGTXXXXXXX

PROBLEMS… • When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place ->The motif is considered real if found in the vicinity of a gene. • Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

Computational Methods • This problem has received a lot of attention from CS people. • Methods include: • Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc. • Enumeration methods – problematic for inexact motifs of length k>10. … • Current status: Problem is still open.

Tools on the Web • MEME – Multiple EM for Motif Elicitation. http://meme.sdsc.edu/meme/website/ • metaMEME- Uses HMM method http://meme.sdsc.edu/meme • MAST-Motif Alignment and Search Tool http://meme.sdsc.edu/meme • TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/ • eMotif - allows to scan, make and search for motifs in the protein level. http://motif.stanford.edu/emotif/

Sequence Motifs

Sequence Motifs

Presentation Transcript

Sequence motifs

Sequence motifs, information content, logos, and HMM’s

Motifs

Motifs

Local Multiple Sequence Alignment Sequence Motifs

Sequence motifs, information content, logos, and HMM’s

Finding sequence motifs in PBM data Workshop Project

Sequence motifs, information content, logos, and HMM’s

Motifs

Universal Motifs

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling

Motifs

Sequence motifs, information content, and sequence logos

MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Motifs, Motifs, Motifs

Sequence motifs, information content, and sequence logos

Sequence motifs, and sequence logos, Neural networks

Protein Sequence Motifs

Sequence motifs, information content, logos, and HMM’s

Motifs

Sequence motifs, information content, and sequence logos

Motifs