Recognition of regulatory signals

Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001

Why? • Additional annotation tool (e.g. specificity of transporters and enzymes from large families) • Important for practice (in addition to metabolic reconstruction) • Interesting from the evolutionary point of view

Overview 0. Biological introduction 1. Algorithms • Representation of signals • Deriving the signal • Site recognition 2. Comparative genomics • Phylogenetic footprinting • Consistency filtering

Some biology • Transcription (DNA  RNA) • Splicing (pre-mRNA  mRNA) • Translation (mRNA  protein) • Regulation of transcription in prokaryotes • … and eukaryotes • Initiation of translation

Transcription and translation in prokaryotes

Initiation of transcription (bacteria)

Translation in prokaryotes

Translation (details)

Splicing (eukaryotes)

Regulation of transcriptionin prokaryotes

Structure of DNA-binding domain. Example 1

Structure of DNA-binding domain. Example 2

Protein-DNA interactions

Regulation of transcriptionin eukaryotes

Representation of signals • Consensus • Pattern (consensus with degenerate positions) • Positional weight matrix (PWM, or profile) • Logical rules • RNA signals

Consensus codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT

Pattern codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT pattern aCGmAAACGtTTkCkT

Frequency matrix Information content I = j bf(b,j)[log f(b,j) / p(b)]

Sequence logo

Positional weight matrix (PWM)

Probabilistic motivation: log-likelihood (up to a linear transformation) • More probabilistic motivation: z-score (with the suitable base of the logarithm) • Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation) • Pseudocounts

Logical rules, trees etc.

Compilation of samples • Initial sample: • GenBank • specialized databases • literature (reviews) • literature (original papers) • Correction of GenBank errors • Checking the literature • removal of predicted sites • Removal of duplicates

Re-alignment approaches • Initial alignment by a biological landmark • start of transcription for promoters • start codon for ribosome binding sites • exon-intron boundary for splicing sites • Deriving the signal within a sliding window • Re-alignment • etc. etc. until convergence

Gene starts of Bacillus subtilis dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. aaagtatataagggagggttaataATG num. 001000000000110110000000111 760666658967228106888659666

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. tacataaaggaggtttaaaaat num. 0000000111111000000001 5755779156663678679890

Positional information content before and after re-alignment

Positional nucleotide frequencies after re-alignment (aGGAGG pattern)

Enhancement of a weak signal

Deriving the signal ab initio • “Discrete” (pattern-driven) approaches: word counting • “Continuous” (profile-driven) approaches: optimization

Word counting. Short words • Consider all k-mers • For each k-mer compute the number of sequences containing this k-mer • (maybe with some mismatches) • Select the most frequent k-mer

Problem:Complete search is possible only for short words Assumption: if a long word is over-represented, its subwords also are overrepresented Solution: select a set of over-represented words and combine them into longer words

Word counting. Long words • Consider somek-mers • For each k-mer compute the number of sequences containing this k-mer • (maybe with some mismatches) • Select the most frequent k-mer

Problem:what k-tuples to start with? 1st attempt: those actually occurring in the sample. But: the correct signal (the consensus word) may not be among them.

2nd attempt: those actually occurring in the sample and some neighborhood. But: • again, the correct signal (the consensus word) may not be among them; • the size of the neighborhood grows exponentially

Graph approach Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k). Thus we obtain an n-partite graph (n is the number of sequences). A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.

A simple algorithm • Remove vertices that cannot be extended to complete subgraphs • that is, do not have arcs to all parts of the graph • Remove pairs that cannot be extended … • that is, do not form triangles with the third vertex in all parts of the graph • Etc. (will not work “as is” for dense subgraphs)

Optimization. EM algorithms • Generate an initial set of profiles (e.g. seed with all k-mers) • For each profile • find the best (highest scoring) representative in each sequence • update the profile • Iterate until convergence

This algorithm converges. However, it cannot leave the basin of attraction. Thus, if the initial approximation is bad, it will converge to nonsense. Solution: stochastic optimization.

Simulated annealing • Goal: maximize the information content I I = j bf(b,j)[log f(b,j) / p(b)] • or any other measure of homogeneity of the sites

Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content. Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B)be its information content. • if I(B)  I(A), B is accepted • if I(B) < I(A), B is accepted with probability P =exp [(I(B) – I(A)) / T] The temperature T decreases exponentially, but slowly; the initial temperature is chosen such that almost all changes are accepted.

Gibbs sampler Again, A is a signal (set of sites), and I(A) is its information content. At each step a new site is selected in one sequence with probability P ~exp [(I(Anew)] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time)

Use of symmetry • DNA-binding factors and their signals • Co-operative homogeneous • Palindromes • Repeats • Co-operative non-homogeneous • Cassetes • Others • RNA signals

Recognition: PWM/profiles The simplest technique: positional nucleotide weights are W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5) Score of a candidate site b1…bk is the sum of the corresponding positional nucleotide weights: S(b1…bk ) = j=1,…,kW(bj,j)

Distribution of RBS profile scores on sites (green) and non-sites (red)

Pattern recognition • Linear discriminant analysis • Logical rules • Syntactic analysis • Context-sensitive grammars • Perceptron • Neural networks

Neural networks: architecture • 4kinput neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC) • One or more layers of hidden neurons • One output neuron

Each neuron is connected to all neurons of the next layer • Each connection is ascribed a numerical weight A neuron • Sums the signals at incoming connections • Compares the total with the threshold (or transforms it according to a fixed function) • If the threshold is passed, excites the outcoming connections (resp. sends the modified value)

Training: • Sites and non-sites from the training sample are presented one by one. • The output neuron produces the prediction. • The connection weights and thresholds are modified if the prediction is incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule

Recognition of regulatory signals

Recognition of regulatory signals

Presentation Transcript

Signals

Finding Regulatory Signals in Genomes 24.11.5 60 min.

Characteristics of Audio Signals Sampling of Audio Signals

Signals

Finding Regulatory Signals in Genomes

RECOGNITION OF NONSTATIONARY SIGNALS

Signals

Signals

Finding Regulatory Signals in Genomes

Signals

Elements of Pattern Recognition Theory in the Analysis of Biosensor Signals

Adapted representations of audio signals for music instrument recognition

signals

Signals

Types of signals

Signals

Types of Signals

Crude oil signals, comex signals, wti signals

signals

Signals

Signals