420 likes | 575 Views
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar. PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). NAIST. Table of Contents.
E N D
RNA Structure Prediction Including PseudoknotsBased on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST)
Table of Contents • Background • Grammatical approach to RNA structure modeling • Model • Stochastic multiple context-free grammar • Algorithms • Parsing and parameter estimation • Experimental results • RNA pseudoknot prediction • Summary
RNA Secondary Structure:Stem-Loop Complementary base pairs A•U G•C Connect base pairs with arcs. U C A A nested Loop C•G U•A U•A Stem 5’—C A A U G A C—3’ C U U C A U C A G A A A A U G A C
Modeling RNA Secondary Structure by Context-Free Grammar (CFG) • RNA secondary structure can be modeled by parse structure of CFG. Structure predictionParsing • Example of CFG rules: S S S S u u c a u c a g a a U U C A U C A G A A Secondary structure Derivation tree
RNA Secondary Structure:Pseudoknot • CFGs cannot represent pseudoknots. Connect base pairs with arcs. crossed A 5’—C U U C A A G A C U U G A C—3’ • • • • • • A C U U C A U C A G A A A A U G A C A
Early Studies n: sequence length
Early Studies (cont.) • Grammars for fully describing RNA pseudoknots: • SL-TAG and ESL-TAG [Uemura et al., 1999] • RPG [Rivas and Eddy, 2000] • These grammars have been identified as subclasses ofmultiple context-free grammars. [Kato et al., 2005]
Motivation • Multiple context-free grammar (MCFG): • Natural extension of CFG • Easy to compare generative power and design algorithms • Generative power to represent pseudoknots • Polynomial time parsing algorithm • We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots. [Kato et al., 2005]
What’s New in the Present Work • Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG) • Design of polynomial timeparsing andparameter estimationalgorithms for the subclass of SMCFGs • Experiments on RNApseudoknot prediction
Table of Contents • Background • Grammatical approach to RNA structure modeling • Model • Stochastic multiple context-free grammar • Algorithms • Parsing and parameter estimation • Experimental results • RNA pseudoknot prediction • Summary
A G A C U U Pseudoknot A G A C U Stem-loop genes Gene finding Relation between SMCFG and Major Probabilistic Models Probabilistic extension Strong SMCFG MCFG CFG SCFG Generative power HMM FA Weak
Stochastic Multiple Context-Free Grammar (SMCFG) • G = (N, T, F, P, S) N: finite set of nonterminals, T: finite set of terminals, F: finite set offunctions, P: finite set of rules with probabilities, S N: start symbol
Functions of SMCFG • Example:
Rules of SMCFG • Rule: • : probability that the rule is applied • The sum of the probabilities of the rules with the same left hand side should be one. • Example:
A1 Ak Prob. p1 Prob. pk A: f Ak A1 … Prob. Derivation Trees in SMCFG …
A Prob. 0.7 (a g ,c u) B Prob. 0.35 (a g ,ac u) A Prob. 0.28 (a g ,ac uu) Modeling Pseudoknot by SMCFG UP2La[(x1, x2)] = (x1, ax2) UP2Ru[(x1, x2)] = (x1, x2u)
SMCFG for RNA Pseudoknot Modeling • W1,…,Wm:nonterminals • Note: W1 is the start symbol. • For each rule, two real values called transition probabilityp1(0 < p11) and emission probabilityp2(0 < p21) are specified. • Probability of each rule is defined as
Table of Contents • Background • Grammatical approach to RNA structure modeling • Model • Stochastic multiple context-free grammar • Algorithms • Parsing and parameter estimation • Experimental results • RNA pseudoknot prediction • Summary
Algorithms for SMCFG • CYK algorithm calculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree). • Inside algorithm calculates the probability of a sequence given an SMCFG. • Inside-outside algorithm estimates optimal probability parameters for an SMCFG given a set of example sequences.
CYK Algorithm • Input: • The following are calculated by dynamic programming: • : log maximum probability that Wv generates • : log maximum probability that Wy generates
CYK Algorithm (cont.) • Output: log maximum probability that W1 generates i.e. • : the most likely derivation tree • : entire set of probability parameters
Algorithm [CYK] • Initialization: fori←1ton+1, j←iton+1, v←1tom do if// : empty sequence then else • Iteration: fori←ndownto1, j←i1ton, k←n+1downtoj+1, l←k1ton, v←1tom // Some examples are shown.
Wv Wy Wz i h k 1 h+1 j l n Algorithm [CYK] (cont.) • if x1 x21 x22
Wv Wy l1 i k 1 i+1 j l n Algorithm [CYK] (cont.) • if ai x1 x2 al
Complexity of CYK Algorithm • m: # of nonterminals (m = a+b) • n: sequence length • Time complexity: O(amn4+bn5) • Space complexity: O(mn4)
Table of Contents • Background • Grammatical approach to RNA structure modeling • Model • Stochastic multiple context-free grammar • Algorithms • Parsing and parameter estimation • Experimental results • RNA pseudoknot prediction • Summary
Experimental Method • Construction of a model CUACUGUUC SMCFG Sample sequences with structure annotation RNA family database CYK algorithm Secondary structure prediction CUAGUCUUA Test sequence parsing
Data Sets for Experiments • Three viral RNA families including pseudoknots from Rfam ver. 7.0
Corona_pk_3 in Rfam ver. 7.0 • Coronavirus 3' UTR pseudoknot • Sequence length: 6264 Consensus structure
HDV_ribozyme in Rfam ver. 7.0 • Hepatitis delta virus ribozyme • Sequence length: 8791 Consensus structure
Tombus_3_IV in Rfam ver. 7.0 • Tombusvirus 3' UTR region IV • Sequence length: 8992 Consensus structure
Evaluation for Prediction Results • precision = • recall = # of correct base pairs predicted by the algorithm # of predicted base pairs # of correct base pairs predicted by the algorithm # of base pairs specified by the annotation
Experimental Results • Prediction accuracy
Experimental Results (cont.) • Running time *: Implementation in ANSI C on a machine with Intel Pentium D CPU 2.80GHZ and 2.00GB RAM
Pair Stochastic Tree Adjoining Grammar (PSTAG)[MSS05] CUACUGUUC Sample sequences with structure annotation Derivation tree representing known structure RNA family database PSTAG algorithm Secondary structure prediction CUAGUCUUA alignment Test sequence [MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures,” Bioinformatics, 2005.
Summary • A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling. • Polynomial time parsing and parameter estimation algorithms have been designed. • Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.