Martin Tompa University of Washington

An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem Martin Tompa University of Washington Slides courtesy Yoonkyong Lee

Ribosome Binding Site Problem • Identifying the short mRNA motif in the 5’ untranslated region, called the ribosome binding site, of a typical prokaryote • Why Ribosome? • The SD site is complementary to a short sequence near the 3’ end of the ribosome’s 16S RNA • Why Prokaryote? • The great similarity among SD sites in several prokaryotes • Shine-Dalgarno sequence: AAGGAGG

Ribosome Binding Site Problem Shine-Dalgarno sequence: AAGGAGG

Problem Definition • Search for instances of a motif of length 5 within 20-mer just 5’ to the translation start site of each N≈4000 open reading frame • Instances of the motif maymatch inexactly • Given 4000 sequences, each of length 20,search the approximately equal sequence, s, of length 5

Contributions • The given solution is applicable of other sequence analysis problems involving the identification of short motifs • This problem is important as a step in the validation of true genes and in the identification of the correct translation start sites

Statistical Significance of Motif Occurrences • Observation • A good measure for comparing sequences should take into account both the absolute number of occurrences and the background distribution • Solution • For each k-mer s, record the number Ns of sequences containing s, where c substitutions of residues allowed • Estimate how unlikely to have Ns occurrences if the sequences were generated based on the background distribution

How to estimate “how unlikely” • X: single random sequence of the specified length L, according to the background distribution • ps: the probability that X contains at least one occurrence of the k-mer s, allowing for c substitutions • Assumption: N sequences are independent • The associated z-score: • Measuring how unlikely it is to have Ns occurrences of s, given the background distribution Expected number containing at least 1 occurrence of s Standard deviation

How to estimate ps – Step 1 • Construct a deterministic FA, M, accepting strings containing a substring matching s with at most one substitution • States: One for every string u matching a prefix of s with at most one substitution → 1.5 |s|2 + O(|s|) states • Transition function: Given the string u and the input char σ, transit to the state corresponding to the longest suffix of uσ • Construction time: O(|s|2)

How to estimate ps – Step 2 • Given the transition probabilities aij of the Markov chain G that generates X, transform M into a Markov chain M’ • by assigning aij to those transitions of M labeled j out of those states whose corresponding string u ends with i • Estimate ps, the prob. of going from the start state to the accepting state in |X| steps in M’, through the product of a vector and a matrix of size Θ(|s|4). Since the matrix is sparse, this can be done in O(|X|·|s|2)

Why O(|X|·|s|2)? Transition Probability Matrix: Θ(|s|4) 4 non-zero entries(Σ={A,T,G,C}) 1.5 |s|2 + O(|s|) … Comp. of matrix-vector product:O(|s|2) 1.5 |s|2 + O(|s|) |X| products required O(|X|·|s|2)

Experimental Results • 14 prokaryotic genomes • 10 bacteria • 9 of 10: strong dominance of SD sequence, AAGGAGG • One exceptional case: M. genitalium • 4 archaea • Predominance of GGTGA or GGTG AAGGAGG Archaea GGTGA

Bacterial Genomes- 1 • H. influenzae • TAAGGAGGTGATCCAA • The highest simulated statistical significance score: 4 TAAGGAGGTGATCCAA

Bacterial Genomes- 2 • M. genitalium • GAGGTGATCCAC • The simulated statistical significance score: 5 - 7 No Significance Lechel[1991] Describing a possible alternative ribosome recognition site specifically in M. genitalium

Interesting Motifs • Synechocystis sp. • 2nd highest scoring 7-mer: CATCGCC (Ms=16) • Results of highest scoring 7-mers of sequences (L=40), allowing no sub.: GGCGATCGCC (HIP1) • H. influenzae • Results of highest scoring 7-mers of sequences (L=40), allowing no sub.: AAGTGCCGGT

Archaea • M. jannaschii • GGAGGTGATCCAG GGAGGTGATCCAG

Conclusion • Enumerating short motifs together with exact z-score • Exhaustive and exact • Not efficient for longer and more complex motifs allowing multiple insertions, deletions, and substitutions

Martin Tompa University of Washington

Martin Tompa University of Washington

Presentation Transcript

UNIVERSITY OF WASHINGTON

University of Washington

Washington University School of Medicine

B.Spivak University of Washington

University of Washington Genome Center

University of Washington Tacoma

Dan Jaffe University of Washington

University of Washington Police Department

University of Washington Construction Safety

Martin Weber University of Mannheim

University of Washington Yongmin Kim

Martin Cloonan (University of Glasgow)

NATIONAL UNIVERSITY OF SAN MARTIN

Profile of Washington University

University of Washington PKU Clinic

University of Washington

UNIVERSITY OF WASHINGTON CV-580

Paul Beame University of Washington

Martin Luther King 1963 Washington