160 likes | 274 Views
An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem. Martin Tompa University of Washington. Slides courtesy Yoonkyong Lee. Ribosome Binding Site Problem.
E N D
An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem Martin Tompa University of Washington Slides courtesy Yoonkyong Lee
Ribosome Binding Site Problem • Identifying the short mRNA motif in the 5’ untranslated region, called the ribosome binding site, of a typical prokaryote • Why Ribosome? • The SD site is complementary to a short sequence near the 3’ end of the ribosome’s 16S RNA • Why Prokaryote? • The great similarity among SD sites in several prokaryotes • Shine-Dalgarno sequence: AAGGAGG
Ribosome Binding Site Problem Shine-Dalgarno sequence: AAGGAGG
Problem Definition • Search for instances of a motif of length 5 within 20-mer just 5’ to the translation start site of each N≈4000 open reading frame • Instances of the motif maymatch inexactly • Given 4000 sequences, each of length 20,search the approximately equal sequence, s, of length 5
Contributions • The given solution is applicable of other sequence analysis problems involving the identification of short motifs • This problem is important as a step in the validation of true genes and in the identification of the correct translation start sites
Statistical Significance of Motif Occurrences • Observation • A good measure for comparing sequences should take into account both the absolute number of occurrences and the background distribution • Solution • For each k-mer s, record the number Ns of sequences containing s, where c substitutions of residues allowed • Estimate how unlikely to have Ns occurrences if the sequences were generated based on the background distribution
How to estimate “how unlikely” • X: single random sequence of the specified length L, according to the background distribution • ps: the probability that X contains at least one occurrence of the k-mer s, allowing for c substitutions • Assumption: N sequences are independent • The associated z-score: • Measuring how unlikely it is to have Ns occurrences of s, given the background distribution Expected number containing at least 1 occurrence of s Standard deviation
How to estimate ps – Step 1 • Construct a deterministic FA, M, accepting strings containing a substring matching s with at most one substitution • States: One for every string u matching a prefix of s with at most one substitution → 1.5 |s|2 + O(|s|) states • Transition function: Given the string u and the input char σ, transit to the state corresponding to the longest suffix of uσ • Construction time: O(|s|2)
How to estimate ps – Step 2 • Given the transition probabilities aij of the Markov chain G that generates X, transform M into a Markov chain M’ • by assigning aij to those transitions of M labeled j out of those states whose corresponding string u ends with i • Estimate ps, the prob. of going from the start state to the accepting state in |X| steps in M’, through the product of a vector and a matrix of size Θ(|s|4). Since the matrix is sparse, this can be done in O(|X|·|s|2)
Why O(|X|·|s|2)? Transition Probability Matrix: Θ(|s|4) 4 non-zero entries(Σ={A,T,G,C}) 1.5 |s|2 + O(|s|) … Comp. of matrix-vector product:O(|s|2) 1.5 |s|2 + O(|s|) |X| products required O(|X|·|s|2)
Experimental Results • 14 prokaryotic genomes • 10 bacteria • 9 of 10: strong dominance of SD sequence, AAGGAGG • One exceptional case: M. genitalium • 4 archaea • Predominance of GGTGA or GGTG AAGGAGG Archaea GGTGA
Bacterial Genomes- 1 • H. influenzae • TAAGGAGGTGATCCAA • The highest simulated statistical significance score: 4 TAAGGAGGTGATCCAA
Bacterial Genomes- 2 • M. genitalium • GAGGTGATCCAC • The simulated statistical significance score: 5 - 7 No Significance Lechel[1991] Describing a possible alternative ribosome recognition site specifically in M. genitalium
Interesting Motifs • Synechocystis sp. • 2nd highest scoring 7-mer: CATCGCC (Ms=16) • Results of highest scoring 7-mers of sequences (L=40), allowing no sub.: GGCGATCGCC (HIP1) • H. influenzae • Results of highest scoring 7-mers of sequences (L=40), allowing no sub.: AAGTGCCGGT
Archaea • M. jannaschii • GGAGGTGATCCAG GGAGGTGATCCAG
Conclusion • Enumerating short motifs together with exact z-score • Exhaustive and exact • Not efficient for longer and more complex motifs allowing multiple insertions, deletions, and substitutions