620 likes | 639 Views
Michal Ziv-Ukelson New Tools for Comparative Structural RNAomics. RNA Structure, Dimensions 1- 3 : Folding. Bioinformatic Structural witnesses for RNA functionality. Witness 1: Structure Stability. Witness 2: Sequence/Structure Conservation.
E N D
Michal Ziv-Ukelson New Tools for Comparative Structural RNAomics
Bioinformatic Structural witnesses for RNA functionality Witness 1: Structure Stability. Witness 2: Sequence/Structure Conservation. (within the structural context). Witness 3: Structure Conservation.
Structural Cis-Elements: Purine Riboswitch “GGUAU” “CCGUA” GGUAU [Mandal et al., 2003] predicted a potential pseudoknot between the two arms of the purine riboswitch aptamer. CCGUA
Witness 1: Stablity of Structure (2D, predicted) AUCCCCGUAUCGAUC AAAAUCCAUGGGUACCCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA • RNA Secondary Structure Prediction: O(N3): • [Nusssinov-Jacobson 1980, Zuker-Stiegler-1981] • MFOLD:http://www.rpi.edu/~zukerm • Vienna RNA Package:http://www.tbi.univie.ac.at/~ivo/RNA
Witness 2: Sequence Conservation (e.g in binding sites) Lactobacillus acidophilus Lactobacillus delbrueckii GGUAU GGUAU CCGUA CCGUA
Witness 3: Compensatory Mutations (in stems) Lactobacillus acidophilus Lactobacillus delbrueckii G-U U-A
Witness 3: Compensatory Mutations (in stems) Lactobacillus acidophilus Lactobacillus delbrueckii G-C C-G
Three Approaches to Structural RNA Comparative Analysis Homologous RNA sequences Fold Sequences Crystallography/NMR MFE prediction Sequence alignment T-coffee Clustalw Prm A B C Sankoff locaRNA, foldAlign dynAlign, Carnac pmcomp Aligned Sequences Simultaneous Fold and Alignment Homologous RNA secondary Structures Fold alignment RNAalifiold Pfold ilm Structure Alignment RNAforester maRNA Aligned Structures
Approach A to Structural RNA Comparative Analysis [Giegrich-2004] Homologous RNA sequences Sequence alignment T-coffee Clustalw Prm A Witness 3 :Sequence Conservation… But without the Structural Context !!! A C G T G G A G A A C G G A C C C T A A A G G G G A T A T A G C A A T T A T C C G G A T T A G T T C C G G A T T G G A C G A A T A G G G C T A A A T G C C A .Witness 2:Structural Conservation Aligned Sequences Fold alignment RNAalifiold Pfold ilm Witness 1:Structure Stability. Aligned Structures
Approach A to Structural RNA Comparative Analysis [Giegrich-2004] Homologous RNA sequences Sequences need to be similar enough so that they can be initially… aligned Yet sequences should be dissimilar enough for co-varying substitutions ! to be detected Sequence alignment T-coffee Clustalw Prm A A C G T G G A G A A C G G A C C C T A A A G G G G A T A T A G C A A T T A T C C G G A T T A G T T C C G G A T T G G A C G A A T A G G G C T A A A T G C C A Aligned Sequences Fold alignment RNAalifiold Pfold ilm Aligned Structures
Three Approaches to Structural RNA Comparative Analysis Homologous RNA sequences Fold Sequences Crystallography/NMR MFE prediction Sequence alignment T-coffee Clustalw Prm A B C Sankoff locaRNA, foldAlign dynAlign, Carnac pmcomp Aligned Sequences Simultaneous Fold and Alignment Homologous RNA secondary Structures Fold alignment RNAalifiold Pfold ilm Structure Alignment RNAforester maRNA Aligned Structures
Approach C to Structural RNA Comparative Analysis [Giegrich-2004] Homologous RNA sequences Fold Sequences Crystallography/NMR MFE prediction C Machine Learning Homologous RNA secondary Structures Structure Alignment RNAforester maRNA Aligned Structures
Approach C to Structural RNA Comparative Analysis [Giegrich-2004] AUCCCCGUAUCGAUC AAAAUCCAUGGGUACCCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA Homologous RNA sequences Fold Sequences Crystallography/NMR MFE prediction C Machine Learning Witness 1:Structure Stability Witnesses separated to two stages (can’t consult) !!! Homologous RNA secondary Structures Structure Alignment RNAforester maRNA R R M M Witnesses 2: Structural Conservation Witnesses 3: Sequence Conservation within the structural context). H B I B H Aligned Structures H H H
The problem Target RNA sequence Structure not known Consider top-ranking suboptimal folding predictions Query RNA known Sequence\structure
Outline • Previously: RNA folding Now: RNA search • RNA’s structure representations • Approaches to Tree Comparisons • Algorithm for Approximate Labelle Subtree Isomorphism\Homeomorphism • Results
M i j Unordered Unrooted Subtree Homeomorphism (example) T2 ∆[i,j] T1 Whole subtrees from T2 are deleted for free -1 is the cost of a homeomorphic deletion from T2
M i j Unordered Unrooted Subtree Homeomorphism T2 ∆[i,j] T1 LSH score = 12 LSH score = 5
M i j Unordered Unrooted Subtree Homeomorphism T2 ∆[i,j] T1 LSH score = 10 LSH score = 5
Our Goal Genome Sequence millions of nucleotides QUERY ACGCUGACGUAGUCAGUAGACGAC AGACAGAUACGUCACCGCAGAUAC GCAUAGUAGCAGUAGCAGAUGACG ACGCUGACGUAGUCAGUAGACGAC AGACAGAUACGUCACCGCAGAUAC GCAUAGUAGCAGUAGCAGAUGACG …………………………………………… …………………………………………… Are there any appearances of this structure in the genome? Discover ncRNA templatess in a sequence database.
Comparison of ordered rooted trees • Trees are among the most common and well-studied combinatorial structures in computer science. In particular, the problem of comparing trees occurs in several diverse areas such as: • computational biology • structured text databases • image analysis • automatic theorem proving • compiler optimization.
Ordered rooted tree Shapiro, 1988: • The nodes correspond to elements of secondary structure (hairpin loop, bulge, internal loop or multi-loop). • The edges correspond to base-paired (stem) regions. Zhang, 1998: • The nodes of the tree represent either unpaired bases (leaves) or paired bases (internal nodes). Each node is labeled with a base or a pair of bases, respectively. • Two kinds of edges, alternatively connecting either consecutive stem base-pairs or a leaf base with the last base-pair in the corresponding stem.
Comparison of ordered rooted trees • Ordered tree comparison is generally computed by tree edit distance, which allows various forms of deletions and insertions in both query and target. • The search for small non-coding RNAs naturally yields a more specific tree search formulation since we do not allow deletions in the query. • In our method we apply a weighted pattern matching algorithm for finding the best homeomorphic mapping between two rooted ordered trees. • Specific constraints on the searched structure can be defined in the input to the search: structural constraints (lengths), allowing or forbidding element deletion in the target, sequence constraints ( local conserved sequence segments, etc).
The Algorithm • Thesubtree isomorphism problem [Matula, 1968,1978]: Given a pattern tree P and a text tree T, find a subtree of T which is isomorphic to P, i.e. find if some subtree of T that is identical in structure to P can be obtained by removing entire subtrees of T, or decide that there is no such tree. • Thesubtree homeomorphism problem[Chung, 1987]: Is a variant of the former problem, where degree-2 nodes can be deleted from the text tree. Homeomorphism Example
Sutree Homeomorphism - Motivation • Point-mutation events could easily result in an extra bulge in an RNA structure. • However, in some cases the functional homology to the original, non-mutated structure is still preserved. • The suggested alignment should be flexible enough to allow the deletion of degree-2 nodes from the target tree. bulge riboswitch and its functional homologue
The Algorithm - Motivation • In some cases subtrees may be deleted from the target tree but not from the query tree, as in tRNA case. Subtree homeomorphism on ordered rooted trees is more efficient (quadratic in input size) than tree edit distance (cubic in input size).
Problem Definition (Rooted Ordered Appprximate Labeled Subtree Homeomorphism): • Decision version: Given two undirected trees T and P, find whether T has a subtreet that can be transformed into P by: • Removing entire subtrees • Removing a degree-2 node and adding the edge joiningits two neighbors • Node relabeling • Optimization version: • Find the best subtree of T that matches P
Subtree Homeomorphism Score • Let T1 and T2 be two ordered, rooted, homeomorphic trees. • A mapping µ : T1→ T2 is a one-to-one function from the nodes of T1 to the nodes of T2 that preserves the ancestor relations of the nodes and their relative order. • The subtree homeomorphism score of the mapping, denoted S(µ,v), is S(u,v) a user defined node-to-node similarity score function The penalty for deleting any other node. The penalty of deleting a degree-2-node from T2 edge-to-edge similarity score function where euT1, evT2 are corresponding edges.
Subtree Homeomorphism Score • The cost function varies from one application to another, depending upon the amount of information supplied with the query. • The simplest one just compares the topology of the structures. • More complex functions include length differences of the structural elements and sequence conservation. • The node deletion score (i.e., gap penalty) reflects the tradeoff between a gap and a mismatch. As the gap penalty increases, the algorithm tends to match distant nodes to avoid gaps. As different values may suit different needs, the tool enables users to set this parameter for each run.
Subtree Homeomorphism Score • Given two rooted ordered trees, P and T, the approximate labeled subtree homeomorphism problemis to find a homeomorphism-preserving mapping µ : P→ t from P to some subtree t of T, such that S(µ) is maximal.
The Tree Alignment Algorithm • A bottom-up two level dynamic programming (DP): • computing optimal alignments between P and any similar subtree t of T which maximizes the similarity score between P and t(where P is the query tree and T is the text tree) • O(mn) algorithm, where m and n are the number of vertices in P and T respectively.
A Subtree Homeomorphism Recursion • We define score(u,v) to be: S(u,v) a subtree of P rooted in node uP a subtree of T rooted in node vT T P S(u,v)
A Subtree Homeomorphism Recursion Isomorphism Homeoomorphism T P S(u,v)
Rooted Ordered Subtree Homeomorphism P T u v y3 y2 y2 y1 y1 In order to compute the matching score between two ordered rooted trees, we start at their leaves. We calculate the matching score between the subtrees of two internal nodes using previous values obtained from our dynamic programming table (in post-order)
The two-stage DP approach The compared trees = score(a,1) Large DP - m*n table Activated during computation of each non-leaf entry (u,v) in the LDPin order to compute the optimal mapping between the children of u and the children of v. Small DP - comparing subtrees of f and 9 ( second-level dynamic programming )
Rooted Ordered Subtree Homeomorphism Algorithm The algorithm returns a vertex v*T that maximizes the score S(µ:P→ tv*) (found in the last row of LDP). V*
Rooted Ordered Subtree Homeoomorphism: Time Complexity • Suppose |T1| = m, |T2| = n • Naïve calculation: • Need to fill mn cells • Each cell takes O(mn) to fill – total of O(m2n2) • BUT: we notice that each pair of cells participates only once in the sequence alignment stage – when their parents’ matching score is calculated • algorithm takes O(mn) time
Taking into account sequence considerations Variety of sequence considerations: • Sequence alignment criterion on the single strand regions like bulges and loops (tRNA and riboswitches) • Sequence alignment scoring on the compared stems (miRNA) • Sequence comparisons are performed on the small number of filtered candidates the effect of its runtime on the overall search is negligible. Target database Filtering by structure constraints Relatively small number of structures Applying sequence constraints Final set of candidates
Experimental Results • Riboswitches • Purine Riboswitch • tRNA
Purine Riboswitch • Riboswitches: • Part of an RNA molecule. • Directly bind a small target molecules with high affinity and as a consequence they respond with conformational switching that affects the gene’s activity. • Purine riboswitch - binds guanine/adenine to regulate purine metabolism and transport. G
Purine Riboswitch The secondary structure: A three-stem junction with a multiloop connecting two hairpins and the 5’-3’ end. Significant sequence conservation occurs within P1 and in the unpaired regions. Some base-pairing potential exists between the two stem-loop sequences, which might permit the formation of a pseudoknot.
Results – First dataset • FN=0 • Sensetivity = TP/(TP+FN) =1 • PPV = TP/(TP+FP) = 1 except for Clostridium perfringens
Results – Second dataset • The search was conducted in three stages: • Based only on topological similarity, as computed via subtree homeomorphism (S1). • Enhancing the structural comparison with edge and loop length criteria (S2). • Combining the sequence considerations into the search (S3). This reduced the number of false positives to zero or one. • This shows the importance of additional constraints supported by our tool in false positives control.
Searching for Riboswitches in Newly Sequenced Data Lactobacillus family Lactobacillus acidophilus at c(237640..237705) Lactobacillus delbrueckii at c(251482..251547) Lactobacillus salivarius at c(1357553..1357618) Sequential conservation of nucleotides in the functionally critical positions. [Mandal et al., 2003]
Searching for riboswitches in newly sequenced data Structural functionality was further asserted by running RNAAlifold multiple structural alignment program with the three candidate sequences as input: consistent mutations Consistent mutations - mutations that conserve the stem structure. Compensatory mutations - joint events where a mutation in one nucleotide was compensated by a corresponding mutation in the paired nucleotide in order to conserve the stem structure. high sequence conservation compensatory mutations