1 / 88

Recognition of regulatory signals

Recognition of regulatory signals. Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001. Why?. Additional annotation tool (e.g. specificity of transporters and enzymes from large families) Important for practice (in addition to metabolic reconstruction)

roden
Download Presentation

Recognition of regulatory signals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001

  2. Why? • Additional annotation tool (e.g. specificity of transporters and enzymes from large families) • Important for practice (in addition to metabolic reconstruction) • Interesting from the evolutionary point of view

  3. Overview 0. Biological introduction 1. Algorithms • Representation of signals • Deriving the signal • Site recognition 2. Comparative genomics • Phylogenetic footprinting • Consistency filtering

  4. Some biology • Transcription (DNA  RNA) • Splicing (pre-mRNA  mRNA) • Translation (mRNA  protein) • Regulation of transcription in prokaryotes • … and eukaryotes • Initiation of translation

  5. Transcription and translation in prokaryotes

  6. Initiation of transcription (bacteria)

  7. Translation in prokaryotes

  8. Translation (details)

  9. Splicing (eukaryotes)

  10. Regulation of transcriptionin prokaryotes

  11. Structure of DNA-binding domain. Example 1

  12. Structure of DNA-binding domain. Example 2

  13. Protein-DNA interactions

  14. Regulation of transcriptionin eukaryotes

  15. Representation of signals • Consensus • Pattern (consensus with degenerate positions) • Positional weight matrix (PWM, or profile) • Logical rules • RNA signals

  16. Consensus codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT

  17. Pattern codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT pattern aCGmAAACGtTTkCkT

  18. Frequency matrix Information content I = j bf(b,j)[log f(b,j) / p(b)]

  19. Sequence logo

  20. Positional weight matrix (PWM)

  21. Probabilistic motivation: log-likelihood (up to a linear transformation) • More probabilistic motivation: z-score (with the suitable base of the logarithm) • Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation) • Pseudocounts

  22. Logical rules, trees etc.

  23. Compilation of samples • Initial sample: • GenBank • specialized databases • literature (reviews) • literature (original papers) • Correction of GenBank errors • Checking the literature • removal of predicted sites • Removal of duplicates

  24. Re-alignment approaches • Initial alignment by a biological landmark • start of transcription for promoters • start codon for ribosome binding sites • exon-intron boundary for splicing sites • Deriving the signal within a sliding window • Re-alignment • etc. etc. until convergence

  25. Gene starts of Bacillus subtilis dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG

  26. dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. aaagtatataagggagggttaataATG num. 001000000000110110000000111 760666658967228106888659666

  27. dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. tacataaaggaggtttaaaaat num. 0000000111111000000001 5755779156663678679890

  28. Positional information content before and after re-alignment

  29. Positional nucleotide frequencies after re-alignment (aGGAGG pattern)

  30. Enhancement of a weak signal

  31. Deriving the signal ab initio • “Discrete” (pattern-driven) approaches: word counting • “Continuous” (profile-driven) approaches: optimization

  32. Word counting. Short words • Consider all k-mers • For each k-mer compute the number of sequences containing this k-mer • (maybe with some mismatches) • Select the most frequent k-mer

  33. Problem:Complete search is possible only for short words Assumption: if a long word is over-represented, its subwords also are overrepresented Solution: select a set of over-represented words and combine them into longer words

  34. Word counting. Long words • Consider somek-mers • For each k-mer compute the number of sequences containing this k-mer • (maybe with some mismatches) • Select the most frequent k-mer

  35. Problem:what k-tuples to start with? 1st attempt: those actually occurring in the sample. But: the correct signal (the consensus word) may not be among them.

  36. 2nd attempt: those actually occurring in the sample and some neighborhood. But: • again, the correct signal (the consensus word) may not be among them; • the size of the neighborhood grows exponentially

  37. Graph approach Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k). Thus we obtain an n-partite graph (n is the number of sequences). A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.

  38. A simple algorithm • Remove vertices that cannot be extended to complete subgraphs • that is, do not have arcs to all parts of the graph • Remove pairs that cannot be extended … • that is, do not form triangles with the third vertex in all parts of the graph • Etc. (will not work “as is” for dense subgraphs)

  39. Optimization. EM algorithms • Generate an initial set of profiles (e.g. seed with all k-mers) • For each profile • find the best (highest scoring) representative in each sequence • update the profile • Iterate until convergence

  40. This algorithm converges. However, it cannot leave the basin of attraction. Thus, if the initial approximation is bad, it will converge to nonsense. Solution: stochastic optimization.

  41. Simulated annealing • Goal: maximize the information content I I = j bf(b,j)[log f(b,j) / p(b)] • or any other measure of homogeneity of the sites

  42. Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content. Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B)be its information content. • if I(B)  I(A), B is accepted • if I(B) < I(A), B is accepted with probability P =exp [(I(B) – I(A)) / T] The temperature T decreases exponentially, but slowly; the initial temperature is chosen such that almost all changes are accepted.

  43. Gibbs sampler Again, A is a signal (set of sites), and I(A) is its information content. At each step a new site is selected in one sequence with probability P ~exp [(I(Anew)] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time)

  44. Use of symmetry • DNA-binding factors and their signals • Co-operative homogeneous • Palindromes • Repeats • Co-operative non-homogeneous • Cassetes • Others • RNA signals

  45. Recognition: PWM/profiles The simplest technique: positional nucleotide weights are W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5) Score of a candidate site b1…bk is the sum of the corresponding positional nucleotide weights: S(b1…bk ) = j=1,…,kW(bj,j)

  46. Distribution of RBS profile scores on sites (green) and non-sites (red)

  47. Pattern recognition • Linear discriminant analysis • Logical rules • Syntactic analysis • Context-sensitive grammars • Perceptron • Neural networks

  48. Neural networks: architecture • 4kinput neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC) • One or more layers of hidden neurons • One output neuron

  49. Each neuron is connected to all neurons of the next layer • Each connection is ascribed a numerical weight A neuron • Sums the signals at incoming connections • Compares the total with the threshold (or transforms it according to a fixed function) • If the threshold is passed, excites the outcoming connections (resp. sends the modified value)

  50. Training: • Sites and non-sites from the training sample are presented one by one. • The output neuron produces the prediction. • The connection weights and thresholds are modified if the prediction is incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule

More Related