620 likes | 1.15k Views
MicroRNA. The Computational Challenge. Bioinformatics Seminar, March 9, 2005 By Yaron Levy. Tree of RNA Types . miRNA Biological Process. Micro RNA – Computational Approach. Problem 1: Finding putative microRNA from a sequence Horesh et al, using suffix trees data structure
E N D
MicroRNA The Computational Challenge Bioinformatics Seminar, March 9, 2005 By Yaron Levy
Micro RNA – Computational Approach • Problem 1: Finding putative microRNA from a sequence • Horesh et al, using suffix trees data structure • Problem 2: Computing secondary structure of a given sequence • Zuker & Steigler, minimum free energy, using dynamic programming • Problem 3: miRNA predicting algorithms • Lim et al, MiRscan • Problem 4: Predicting miRNA target genes • Lewis et al, TargetScan
Problem 1 Find these
Problem 1: Finding putative microRNA from a sequence • A naïve idea: slide a “window” of size L over the sequence of size N, looking for stems of size S. • Computationally O(NL+NS) – too much • A better approach – using a suffix tree.
What is a suffix tree? S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 A $ M LA YALAM$ AL 5 10 $M YALAM$ YALAM$ $M $ ALAYALAM$ 3 8 4 7 $M YALAM$ 1 9 6 2
Suffix tree properties • For a string S of length n, there are n leaves and at most n internal nodes. • therefore requires only linear space • Each leaf represents a unique suffix. • Concatenation of edge labels from root to a leaf spells out the suffix. • Each internal node represents a distinct common prefix to at least two suffixes.
Finding a (short) Patternin a (long) String • Build a suffix tree of the string. • Starting from the root, traverse a path matching characters of the pattern. • If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string.
Finding a Pattern in a String Find “ALA” A $ M LA YALAM$ AL 5 10 M$ YALAM$ YALAM$ M$ $ ALAYALAM$ 3 8 4 7 M$ YALAM$ 1 9 Two matches - at 6 and 2 6 2
Generalized Suffix Tree WINDOW$ INDIGO$ 1234567 1234567 $ D ND I $OG O W (1, 7) (2, 7) (2, 5) ND OW$ $ $OGI OW$ $OGI $OG $W INDOW$ $ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2)
Horesh et al – using a generalized suffix tree for finding putative microRNA’s • Assumptions: • At least a triple repeat is necessary: • 2 for the stems of the hairpin – close to each other in the sequence, and as inverted repeat of each other • The rest are target genes – can be anywhere • The repeats must be fully matched – no mismatches are allowed • This is more of a constraint
Horesh et al – the algorithm • Construct a generalized suffix tree of the original sequence and the inverted repeat sequence. • Preprocess the suffix tree for calculating: • Length of suffixes • Number of repeats • Index of suffix in sequence • With these attributes for each node, along with the indices of the suffixes in the sequence, it is possible to find the requested triple (or more) repeats. • Computationally efficient O(N)
a na banana na na na 1. Build a suffix tree 0 1 2 6 3 2 1 4 5 3 2. Scan the tree in a PreOrder traversal (all parents are visited before their sons) The length of a prefix a node represent is: node.len = father.len + node.Length of the sequence fragment it carries (root is 0)
a na banana na na na 6 3 2 1 2 1 1 1 1 1 3. Scan the tree in a PostOrder traversal (all sons are visited before their parents) The number of repeats of a prefix a node represent is: node.repeats = Sum of sons repeats (leaf is 1)
Now every node carries the length of the prefix It represents and the number of leaves below it. (the number of repeats it is their prefix). 4. Scan the tree again, For every node that represents a prefix longer than SIZE (22 for example), and has two repeats or more; Print its length and repeats and print the indexes of its leaves. 1 All sections are done in linear time ! a 3 na 1 5 3 na
Problem 1 conclusions • The problem is not trivial! • Suffix trees are an elegant solution, providing: • No mismatches are allowed (not really biologically realistic) • Enough memory to store the large data structure
Problem 2 How do these fold?
Problem 2: Computing secondary structure of a given sequence • Approaches to RNA secondary structure prediction: • comparative sequence analysis • prediction from base sequence • find minimum free energy (MFE) structure
Free energy model • free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies • standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops
Free energy model • free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies • standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops
On the MFE approach • MFE approach ignores folding pathway, metal ions, nonstandard bonds • “some species can remain kinetically trapped in nonequilibrium states… we expect that most RNA’s exist naturally in their thermodynamically most stable configurations” –Tinoco and Bustamante, J. Mol. Biol. 1999.
Why is MFE secondary structure prediction hard? • MFE structure can be found by calculating free energy of all possible structures • but, number of potential structures grows exponentially with the number, n, of bases • structures can be arbitrarily complex • success for restricted classes of structures
Predicting MFE pseudoknot free structures • Dynamic programming avoids explicit enumeration of all pseudoknot free structures (Zuker & Stiegler 1981) • Suboptimal folds, probabilities of base pairings can also be calculated • software: mfold, Vienna package
Dynamic programming (Zuker & Steigler) • Based on the “more is less” principle: by calculating more than you need, less work is needed overall • Construct MFE structure for whole strand from MFE structures for substrands • Running time is O(n3)
RNA folding with dynamic programming • Assume a function W(i,j) which is the MFE for the sequence starting at i and ending at j (i<j) • Define sigma as the MFE function for the simple cases, where, for example a base pair’s score is less than a non-pair • Consider 4 recursion possibilities: • i,j are a base pair, added to the structure for i+1..j-1 • Define this as V(i,j) • i is unpaired, added to the structure for i+1..j • j is unpaired, added to the structure for i..j-1 • i,j are paired, but not to each other; the structure for i..j adds together sub-structures for 2 sub-sequences: i..k and k+1..j a bifurcation (i<k<j)
V(i,j) i j Dynamic programming (Zuker and Steigler) • W(i,j): MFE structure of substrand from i to j • V(i,j): MFE structure of substrand from i to j, in which i-th and j-th bases are paired W(i,j) i j
W(i,j) i j Recurrences = min V(i,j) W(i,k) W(k+1,j) i j i k k+1 j
i j i+1 k k+1 j-1 j i i k l j Recurrences = min i j
Recurrences = min i j i j i k k+1 j = min i+1 k k+1 j-1 j i j i j i i k l j
What is actually being done? • Simple base pair maximization is a poor scoring scheme for RNA structure prediction. • It is more plausible that an RNA adopts a globally minimum energy structure, not the structure with the maximum number of base pairs. • Developed the thermodynamic model in conjunction with the development of DP • independence assumptions in the thermodynamic model's terms have been made compatible with the independence assumptions needed for recursive dynamic programming algorithms to work. • Energy minimization algorithms become somewhat complex, with more detailed recursions that distinguish different lengths and types of loops, and which score base pairs according to nearest-neighbor stacking interactions with adjacent base pairs. • Nonetheless, the mechanics of the algorithm are pretty much the same
Problem 2 conclusions • RNA secondary structure finding is a hard problem – exponential number of possibilities • Several heuristics claim to achieve relatively good success rates • Specifically, MFE based algorithms are believed to be ~70% accurate on structures without pseudoknots.
Problem 3 How to predict these?
Problem 3: miRNA predicting algorithms • Lim et al. developed a machine learning tool called MiRscan to help identify new miRNA genes • This program looks at hairpin sequences conserved between species (C. elegans and C. briggsae) • The program is given a training set of known miRNAs in C. elegans • This data is then used to identify which conserved hairpin sequences are most similar to the training data.
MiRscan Algorithm • The MiRscan algorithm examines several features of the hairpin • The total score computed by summing the score of each feature • The score for each feature is computed by dividing the frequency of the given value in the training set to its overall frequency
MiRscan – Relative importance of hairpin features • Certain features were found to be more useful than others in distinguishing miRNAs
MiRscan – Testing the algorithm • In order to test their algorithm, Lim et al. ran MiRscan on the ~36,000 conserved hairpins in the C. elegans and C. briggsae genomes • The 50 known miRNA genes conserved between C. elegans and C. briggsae were used as a training set • 35 sequences received a MiRscan score greater than the mean score of the known genes • These sequences were given special attention in the experimental portion of this research
MiRscan – Results example Flanking sequence of control and real matches in the UTRs.
Problem 3 conclusions • Predicting miRNA genes is a hot subject! • Algorithms use machine learning techniques to predict genes • Candidate genes can be biologically verified to be miRNA genes. Although this process may be slow, it gives feedback and allows refinement of techniques and better predictions • Hundreds (thousands?) of new miRNA genes are suspected to be found in the (near?) future! • Commercial companies are performing these kinds of processes for money…
Problem 4 What are the targets these bind to?
Problem 4: Predicting target genes • Mammals/vertebrates • Lots of known miRNAs • Mostly unknown target genes • Initial method outline • Look at conserved miRNAs • Look for conserved target sites
miRNAs in animals • 0.5%-1.0% of predicted genes encode miRNA (!!) • One of the more abundant regulatory classes • Tissue-specific or developmental stage-specific expression • High evolutionary conservation
TargetScan Algorithm by Lewis et al 2003 The Goal – a ranked list of candidate target genes • Stage 1: Search UTRs in one organism • Bases 2-8 from miRNA = “miRNA seed” • Perfect Watson-Crick complementarity • No wobble pairs (G-U) • 7nt matches = “seed matches”
TargetScan Algorithm • Stage 2: Extend seed matches • Allow G-U (wobble) pairs • Both directions • Stop at mismatches
TargetScan Algorithm • Stage 3: Optimize basepairing • Remaining 3’ region of miRNA • 35 bases of UTR 5’ to each seed match • RNAfold program (Hofacker et al 1994)
TargetScan Algorithm • Stage 4: Folding free energy (G) assigned to each putative miRNA:target interaction • Assign rank to each UTR • Repeat this process for each of the other organisms with UTR datasets
TargetScan - Results for mammals • Database of 79 miRNA’s searched against human, mouse, and rat orthologous 3’ UTRs • 451 miRNA:target interactions predicted for 400 unique genes • Average 5.7 targets per miRNA • Signal:noise ratio of 3.2:1
TargetScan - Biological relevance • Hypothesis: 5’ conservation of miRNAs is important for mRNA target recognition • Highest signal:noise ratio observed when seed positioned close to 5’ end • Hypothesis: highly conserved miRNAs are more involved in regulation • High degree of conservation -> more predicted targets • Membership in large miRNA family -> more predicted targets