1 / 43

Sequence motifs

Sequence motifs. What are sequence motifs?. Sequences are translated into electron densities with different affinities of interacting with other molecules. Motifs represent a short common sequence Regulatory motifs (TF binding sites) Functional site in proteins (DNA binding motif).

deirdre
Download Presentation

Sequence motifs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence motifs

  2. What are sequence motifs? • Sequences are translated into electron densities with different affinities of interacting with other molecules. • Motifs represent a short common sequence • Regulatory motifs (TF binding sites) • Functional site in proteins (DNA binding motif)

  3. DNA Regulatory Motifs • Transcription Factors bind to regulatory motifs with high affinity • TF binding motifs are usually 6 – 20 nucleotides long • Usually located near target gene, mostly upstream the transcription start site Transcription Start Site SBF MCM1 Gene X MCM1 motif SBF motif

  4. Identification of Known Motifs within Genomic Sequences • Main Motivation: - Identifying the target of regulatory proteins (e.g. Transcription Factors) in the cell In many cancers specific TFs are known to be mutated. How do we identify the genes that are affected downstream?

  5. P53 the guardian of the cell

  6. How can we start looking for p53 (or any other transcription factor) targets using bioinformatics? Scenario 1 : Binding motif is known (easier case) Scenario 2 : Binding motif is unknown (hard case)

  7. Challenges • How to recognize a regulatory motif? • Can we identify new occurrences of known motifs in genome sequences? • Can we discover new motifs within upstream sequences of genes?

  8. Scenario 1 : Binding targets are known

  9. 1. Motif Representation GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA • Consensus: represent only ‘common’ nucleotides • NANCATNNCCTTTTTATACAGNNNTTNNNTNN • N stands for any nucleotide. • Representing only consensus loses information. How can this be avoided?

  10. Entropy - Definition Claude E. Shannon 1948, “A mathematical theory of communication”.

  11. Entropy - Definition

  12. Entropy - Example

  13. Relative EntropyThe Kullback-Leibler distance D

  14. Information content

  15. Information content

  16. Information content

  17. Frequency-logo: GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Count nucleotides at each position: Convert to frequencies: Logo plots - HowTo

  18. GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Sequence-logo Biological information Exon Intron Exon • Multiple alignment of acceptor sites from 268 yeast DNA sequences • What is the biological signal around the site ? • What are the important positions • How can it be visualized ? • Logo plot with Information Content

  19. Logo plots - Information Content Sequence-logo Calculate Information Content I = apalog2pa + log2(4), Maximal value is 2 bits Completely conserved ~0.5 each • X axis – Relative position. Y axis – Cross Entropy. • Total height at a position is the Information Content measured in bits. • Height of letter is the proportional to the frequency of that letter. • Stack order indicates importance, consensus is read at the top. • A Logo plot is a visualization of a multiple alignment.

  20. Pseudocounts

  21. PSSM – Position Specific Scoring Matrix • Besides Entropy and Information content there are other ways to express a motif

  22. Example: Predicting the cAMP Receptor Protein (CRP) binding site motif by using a logo plot

  23. Extract experimentally defined CRP Binding Sites GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGTTAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGAACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACGCG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGATTTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCGACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCACATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TTTATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCTCCCACTCGCA TGTGATTCGATTCACA

  24. Create a Multiple Sequence Alignment GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCACT TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCACC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATCACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGTCATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCATCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACACATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA

  25. XXXXXTGTGAXXXXAXTCACAXXXXXXX XXXXXACACTXXXXTXAGTGTXXXXXXX Generate a Logo plot

  26. WebLogo - Input • http://weblogo.berkeley.edu

  27. Genes: Proteins: WebLogo - Outputs

  28. PROBLEMS… • When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place • The motif is considered real if found in the vicinity of a gene. • Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

  29. Scenario 2 : Binding targets are unknown

  30. Finding new Motifs • We are given a group of genes, which presumably contain a common regulatory motif. • We know nothing of the TF that binds to the putative motif. • The problem: discover the motif.

  31. Motif Discovery Motif Discovery

  32. Computational Methods • This problem has received a lot of attention from CS people. • Methods include: • Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc. • Enumeration methods – problematic for inexact motifs of length k>10. … • Current status: Problem is still open.

  33. MEME "We need a name for the new replicator, a noun that conveys the idea of a unit of cultural transmission, or a unit of imitation. 'Mimeme' comes from a suitable Greek root, but I want a monosyllable that sounds a bit like 'gene'. I hope my classicist friends will forgive me, if I abbreviate mimeme to meme...“ Richard Dawkins

  34. An (unsupervised) machine learning approach to motif discovery. • Input: • Set of unaligned sequences. • Possible width of motifs. • Output: • A set of gapless motifs. • Classifier for each motif. • Alignment of the occurrences of the motif to the input set. Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.

  35. MEME: Expectation Maximization • Goal: Find motif profile and positionsthat have maximum likelihood • Iteratively estimates a probabilistic model for a random motif to be statistically overrepresented in the dataset. Converges at local optimum.

  36. MEME result example

  37. MEME Pros and cons • The number of motifs or their occurrences are not required in the input. • Only allows exact matches. • High time complexity. • Very pessimistic, can miss signals.

  38. DRIM is a tool for discovering short motifs in a ranked list of nucleic acid sequences. • From a mathematical point of view, DRIM identifies subsequences that tend to appear at the top of the list more often than in the rest of the list. • The definition of TOP in this context is flexible and driven by the data. E. Eden, D. Lipson, S. Yogev & Z. Yakhini. Discovering Motifs in Ranked Lists of DNA Sequences, PLoS Computational Biology, 2007.

  39. The HyperGeometric (HG) score • The HG score estimates the significance of the intersection (of size b) b N genes B n • N all genes, ranked according to some criterion • Bof them contain the motif • nof them are located at the top of the list • b contain the motif and are located at the top of the list

  40. The mHG score • DRIM checks all the possibilities for n, in order to optimize the significance of the intersection. • It chooses the ni which has the minimal HG score – denoted as the mHG score. ni bi N genes B The mHG score reflects the surprise of seeing the observed density of motif occurrences at the top of the list compared with the rest of the list. (STILL NEEDS TO BE CORRECTED FOR MULTIPLE HYPOTHESIS)

  41. Puf2 – an RNA binding protein Yeast 3’UTR sequences were ranked according to Puf2 binding affinity. >YDR222W, affinity = 5.962 ACAAAAGCGUGAACACUUCCACAUGAAAUUCGUUUUUUGUCCUUUUUUUUCUCUUC UUUUUCUCUCCUGUUUCU >YLR297W, affinity = 5.937 AAUAAAAAUAGAUAUAAUAGAUGGCACCGCUCUUCACGCCCGAAAGUUGGACAUUUU AAAUUUUAAUUCUCAUGA >YOL109W, affinity = 5.763 UCACACUUGAAUGUGCUGCACUUUACUAGAAGUUUCUUUUUCUUUUUUUAAAAAUA AAAAAAGAGGAGAAAAAUGC >YGR138C, affinity = 5.498 GCUGGUGCAAGUUUCCGGUAAAAAUAAUGAUGUUCUAGUCAUUCAUAUAUACGAUA CAAAAAUAACA >YGL035C, affinity = 5.091 UACGCUGACAAGUUUUUGGCGGUGCAGAUAAAUCAAAAGACAAUAGACAAGAAUUAA UAAUAUUAACAAUUAA ... DRIM (mHG p-value= 9.9∙10-49)

  42. DRIM pros and cons • Finds relations between ranking variable and motifs (enrichment). • Returns best possible match without the need of a significance threshold. • Impossible to build a dictionary for motifs of > ~10-mers.

  43. Tools on the Web • MEME – Multiple EM for Motif Elicitation. http://meme.sdsc.edu/meme • metaMEME- Uses HMM method • MAST-Motif Alignment and Search Tool • Etc… • TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/ • eMotif - allows to scan, make and search for motifs at the protein level. http://motif.stanford.edu/emotif/ • DRIM – Finds short motifs enriched in ranked lists.http://bioinfo.cs.technion.ac.il/drim/

More Related