460 likes | 536 Views
Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit , Rolf Backofen , Steffen Heyne , Gad M. Landau, Mathias Mohl , Christina Schmiedl , Sebastian Will. RNA. RNA R is an ordered pair (S,B) where:. C. A. G. U. A. C. U. A.
E N D
Local Exact Pattern Matching for Non-fixed RNA StructuresMika Amit,Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian Will
RNA RNA R is an ordered pair (S,B) where: C A G U A C U A S is a sequence defined over 𝚺 = {A,C,G,U} G C G C U B is a set of base pairs C-G, G-C, A-U, or U-A C U base pair singlebase U backbone connection G G U A G C A U C A C C C U U U CPM 2012, Helsinki
RNA RNA R is an ordered pair (S,B) where: C A G U A C U A S presents the primary structure of R G C G C B presents the secondary structure of R U C U U G G U A G C A U C A C C C U U U CPM 2012, Helsinki
RNA Representations C A G U A C C U U U GC U A G C GC Tree G C UA U C U A G C A U C U G G U A G C A U C A C C C U U U Arc annotated string CPM 2012, Helsinki
RNASecondaryStructure • Determines the activity and functionality of the RNA C A G U A C U A G C • Usually more preserved during evolution C A G C G C C C C U A C U U C G A G G G A A C U A G G A C U A U G C G The secondary structures of RNA is highly researched A CPM 2012, Helsinki
RNAStructure • Predicting the secondary structure of RNA molecule is a difficult task C A G U A C U A G C C A G C G C C C C U A C U U C G A • The structure is sometimes given in a non-fixed form, where each base pair has a probability ≤ 1 to exist in the RNA G G G A A C U A G G A C U A U G C G A CPM 2012, Helsinki
Nested Structure In all of these examples, the structure of R is Nested: Each base can be connected by a bond connection to at most one other base, and there are no crossing arcs C A G U A C C U U U GC U A G C GC G C UA U C U A G C A U C U G G U A G C A U C A C C C U U U CPM 2012, Helsinki
Unlimited Structure Arc annotated substrings can represent Unlimited structures, as well G G U A G C A U C A C C C U U C C A G A C U G A A CPM 2012, Helsinki
Bounded-Unlimited Structure Arc annotated substrings can represent Bounded-Unlimited structures: Each base can be connected to a constant number of other bases, G G U A G C A U C A C C C U U C C A G A C U G A A and crossing arcs are allowed CPM 2012, Helsinki
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms • Tree Edit Distance: • Tai (’79) O(n6) • Zhang & Shasha (‘89) O(n4) • Klein (‘98) O(n3logn) • Ma et al. (‘99) O(n3logn) • Demaine et al. (‘07) O(n3) GC UA AU GC CG GC GC UA GC UA UA GC CG A G C A U C U C A G C CPM 2012, Helsinki A C A G A C U
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms • Tree Alignment: • Jiang et al. (’95) • Schirmer & Giegerich (‘11) • Backofen et al. (‘07) • Mohl et al. (’09) GC UA AU GC CG GC GC UA GC UA UA GC CG A G C A U C U C A G C CPM 2012, Helsinki A C A G A C U
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms • Longest Arc Preserving Common Subsequence: • Evans (’99) • Lin et al. (’02) • Alber et al. (’04) • Jiang et al. (’04) GC UA AU GC CG GC GC UA GC UA UA GC CG A G C A U C U C A G C CPM 2012, Helsinki A C A G A C U
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms • Similar Subforests • Jansson & Peng (’11) GC UA AU GC CG GC GC UA GC UA UA GC CG A G C A U C U C A G C CPM 2012, Helsinki A C A G A C U
Exact Pattern Matching Problem In this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules Pattern CPM 2012, Helsinki
Patterns in RNAs In this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules CPM 2012, Helsinki
Exact Pattern Matching Problem Finding all maximal common structure-sequence regions between two RNAs Solved by Backofen & Siebert in O(n2) for fixed Nested x Nested Structures G A A C C U C A G G C U U U C C U A A single base match left endpoint match type mismatch G A A G A A C A G G C U U A C C C U U C G CPM 2012, Helsinki
Exact Pattern Matching Problem In this work, we solve the problem for non-fixedNested x Nested Structures arc breaking G A A C C U C A G G C U U U C C U A A G A A G A A C A G G C U U A C C C U U C G CPM 2012, Helsinki
Arc Breaking Operation • We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty base pair G U A G U C U G A C C C A G G G A C single bases CPM 2012, Helsinki
Arc Breaking Operation • We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty base pair A G C U C C C U A G A G G G U A G C single bases CPM 2012, Helsinki
Arc Breaking • We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty GC UA U AU GC CG A GC GC UA GC UA UA GC CG A G C A U C U C A G C A C A G A C U CPM 2012, Helsinki
Arc Breaking Patterns are now less restricting: CPM 2012, Helsinki
Exact Pattern Matching Algorithms We describe three algorithms for finding the local exact pattern matching between two RNAs: • A simple O(n4) algorithm • (using ideas from Zhang & Shasha (‘89) ) • An improved O(n3logn) algorithm • (using ideas from Klein (‘98) ) • An O(n3) algorithm • (using ideas from Demaine, Weimann et al. (‘07) ) CPM 2012, Helsinki
Exact Pattern Matching Algorithm Input: R1=(S1,B1) and R2=(S2,B2), |R1|=n, |R2|=m, n>m Output: Local exact pattern matching between R1 and R2 R1: R2: CPM 2012, Helsinki
Exact Pattern Matching Algorithm We compare each base pair from R1 with each base pair from R2, in increasingorder of their sizes R1: R2: CPM 2012, Helsinki
Exact Pattern Matching Algorithm For each two base pairs we compute the matching inside the base pairs, and the extensions to their outsides … … … … CPM 2012, Helsinki
Matching Inside the Base Pairs • Dynamic programming algorithm • Similar to the LCS\Edit distance algorithms of strings CPM 2012, Helsinki
Matching Inside the Base Pairs On each comparison we compute only prefixes of the substrings and select the maximal score over 4 expressions : Match base pairs bp1 i 1 + S1(i)==S2(j) ? + 1 j bp2 CPM 2012, Helsinki
Matching Inside the Base Pairs Match single bases bp1 1 i S1(i)==S2(j) ? 1 j bp2 CPM 2012, Helsinki
Matching Inside the Base Pairs Delete from R1 Delete from R2 bp1 1 i-1 i 1 j bp2 CPM 2012, Helsinki
Matching Inside the Base Pairs On each comparison we compute the maximal match from left-to-right … … C A A G U A G C U A U A U G C C G A C 1 i j 1 … … C G A C A A G C U U A U A U A U A U G C C CPM 2012, Helsinki
Matching Inside the Base Pairs On each comparison we compute the maximal match from right-to-left … … C A A G U A G C U A U A U G C C G A C 1 i j 1 … … C G A C A A G C U U A U A U A U A U G C C CPM 2012, Helsinki
Matching Inside the Base Pairs • There are two tricky parts here: • What happens when a mismatch occurs? … … C A A G U A G C U A U A U G C C G A C C 1 i j 1 … … C G A C A A G C U U A U A U A U A U G C C G CPM 2012, Helsinki
Matching Inside the Base Pairs • There are two tricky parts here: • What happens when the matchings overlap? … … C A A G U A G C U A U A U G C C G A C 1 i j 1 … … C G A C A A G C U U A U A U A U A U G C C CPM 2012, Helsinki
Matching Inside the Base Pairs The solution: on each comparison we compute the best score going from both right-to-left and left-to-right … … C A A G U A G C U A U A U G C C G A C 1 i j 1 … … C G A C A A G C U U A U A U A U A U G C C CPM 2012, Helsinki
Time Complexity • We only compare prefixes of the base pairs • There are O(n2) prefixes for each RNA • Each comparison is computed in O(1) time • The total time is O(n4) CPM 2012, Helsinki
Extending the Match We compute the maximal pattern extension for all bases in R1 and all bases in R2 in one run. The time complexity: O(n2) R1: … n i j m … R2: CPM 2012, Helsinki
Total Time Complexity Computing the pattern match inside all base pairs is done in O(n4) Computing the pattern match extensions to the right and to the left is done in O(n2) The total time complexity is O(n4) + = CPM 2012, Helsinki
An O(n3logn)Algorithm We use Klein’s Tree Edit Distance (‘98) ideas:we decompose the largest RNA into heavy paths: The root base pair is marked light, and continue recursively: Select the maximal child base pair and mark it as heavy, mark the rest of the children as light C C G A A U C C G A G U U C G G G U C C C A G G CPM 2012, Helsinki
Special Substrings For each base pair we define its specialsubstrings bp The no. of special substrings of a base pair is: |bp| - |hp| + 1 hp U U C C A C G G G U C C C A G G a x y b U C G G G U C C C A Lemma (Sleator & Tarjan ‘83): There are O(nlog n) special substring in R of size n U U C G G G U C C C A U U C C G G G U C C C A U U C C A G G G U C C C A C U U C C A G G G U C C C A A C U U C G G G U C C C A C G U U C C A C G G G U C C C A G G CPM 2012, Helsinki
An O(n3logn)Algorithm We compare all O(n2) substrings of R2 with O(nlogn)specialsubstrings of R1 bp hp U U C C A C G G G U C C C A G G a x y b U C G G G U C C C A U U C G G G U C C C A U U C C G G G U C C C A U U C C A G G G U C C C A C U U C C A G G G U C C C A A C U U C G G G U C C C A C G U U C C A C G G G U C C C A G G CPM 2012, Helsinki
An O(n3logn)Algorithm The comparisons are made between the rightmost or leftmost bases, according to the special substring bp hp U U C C A C G G G U C C C A G G a x y b U C G G G U C C C A U U C G G G U C C C A U U C C G G G U C C C A U U C C A G G G U C C C A C U U C C A G G G U C C C A A C U U C G G G U C C C A C G U U C C A C G G G U C C C A G G CPM 2012, Helsinki
An O(n3logn)Algorithm The total number of compared substrings is O(n3logn), each one computed in O(1) time, which gives a total of O(n3logn) running time. bp hp This algorithm works for Nested x Bounded-Unlimited structures also. U U C C A C G G G U C C C A G G a x y b U C G G G U C C C A U U C G G G U C C C A U U C C G G G U C C C A U U C C A G G G U C C C A C U U C C A G G G U C C C A A C U U C G G G U C C C A C G U U C C A C G G G U C C C A G G CPM 2012, Helsinki
An O(n3)Algorithm Based on Demaine et al. (‘07) algorithm we decompose both RNAs to heavy paths, the special substrings are decided on each base pairs comparison: the base pair that has the largest root light base pair, is the dominant one 1 R1: 4 2 3 6 8 5 9 7 C C G A A U C C G A G U U C G G G U C C C A G G A R2: D C B F E C C U A C U C U G C C U U G C U U G C A G A CPM 2012, Helsinki
An O(n3)Algorithm The number of compared substrings is O(n3) This algorithm can work with Nested X Nested structures only R1: 1 4 2 6 8 3 5 9 7 C C G A A U C C G A G U U C G G G U C C C A G G R2: A D C B E F C C U A C U C U G C C U U G C U U G C A G G CPM 2012, Helsinki
More Algorithms • Find the local approximate pattern matching between Nested x Nested structures in O(n3k2) • for k allowed mismatches • Find the local approximate pattern matching between Nested x Bounded-Unlimited structures in O(n3k2logn) for k allowed mismatches • Find the most similar sibling substructures between Nested x Nested structures in O(n3) CPM 2012, Helsinki
T H A N K Y O U !