240 likes | 410 Views
Pattern Discovery in RNA Secondary Structure Using Affix Trees. (when computer scientists meet real molecules) Giulio Pavesi & Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy. Why Is RNA So Interesting?.
E N D
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi & Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy
Why Is RNA So Interesting? • After the completion of various genome projects, the attention of many researchers has shifted from coding to non – coding parts • More than 95% of our genome is not coding: what about the rest? • Non – coding RNA: RNA that is transcribed from DNA, but does not encode directly for a protein(tRNA, microRNA, etc.) Giulio Pavesi
A Motivating Example Post-transcriptional regulation of gene expression Giulio Pavesi
The Problem • Functionally related RNA sequences present structural similarity, at least in some parts • Given two or more RNA molecules, find similar (supposedly functional) structural elements in them • Sequence similarity implies structure similarity, but this is not always that true for RNA..... • Given two or more RNA sequences of unknown structure, find similar structural elements in them (motifs) • Low sequence similarity can anyway correspond to high structure similarity Giulio Pavesi
“Know Thine Enemy” • RNA secondary structure: list of the base pairs among nucleotides in the sequences, such that: • No nucleotide takes part in more than a single base pair (usually, Watson – Crick pairs and wobble pairs G – T, i.e. canonical base pairs) • Base pairs never cross: if nucleotide i is bound to nucleotide j and k with l, then either i < j < k < lor i < k < l < j Giulio Pavesi
RNA Secondary Structure .((..(((.((....)))))...(((.(((...)))...))))) Giulio Pavesi
Motifs in RNA Secondary Structure • Many functional motifs can be described by secondary structure alone • Two types of similarity: • sequence similarity (in unpaired nucleotides, mainly) • structure similarity Giulio Pavesi
Data Structures? • When dealing with DNA or protein sequences, some significant advantages have been obtained by using suitable text—indexing structures (e.g. suffix trees) • RNA secondary structure can be described by a string • Is there a “good” structure that will do for RNA sequences, allowing us to consider sequence and structure at the same time? Giulio Pavesi
Affix Trees Affix tree for string S = ATATC Suffix and prefix edges Suffix edges spell the substrings of string S Prefix edges (dotted) spell substrings of S-1 (the reverse Built in linear time Takes linear space Giulio Pavesi
Affix Trees • The affix tree of a string S indexes all the substrings of bothS and S-1 • Once a substring of S has been located in the tree, we can extend it to the right (by following suffix edges) and to the left (by following prefix edges) • Good if we search for patterns in the sequences with some kind of symmetry Giulio Pavesi
The Hairpin • The basic element of RNA secondary structure is the hairpin (or stem—loop) structure • The hairpin is symmetric!!!! ((((( ...... ))))) AGGTC CAGTCA GATCT Giulio Pavesi
First Try • Predict the secondary structure of each input sequence • Build the affix tree for the folded sequences (in bracket notation) • Search exhaustively for patterns describing hairpin structures (possibly with differences) • Report those occurring in at least q sequences Giulio Pavesi
Searching for Hairpins in Affix Trees • For each loop size l: • Find l dots in the tree, on suffix edges (hairpin loop) • Add a base pair: • Find a )on suffix edges • Find a ( on prefix edges • If the result appears in at least q sequences, jump to 2, else return from jump • Add internal loops: • Find a dot on prefix edges: jump to 2; • Find a dot on suffix edges: jump to 2; Giulio Pavesi
Recursive Algorithm 1 .... (ok) 2 (....) (ok) 2 ((....)) (ok) 2 (((....))) (no) 3a .((....)) (ok) 2 (.((....))) (ok) • On each path, we keep a pointer for the prefix edge, and another for the suffix edge • Speed—up: represent the unpaired elements with a single symbol describing type and size, so to compare two symbols instead of two regions Giulio Pavesi
Approximate Search • We can allow some approximation: • Hairpin loops of different size (range value at step 1) • Internal loops of different size at the same position along the stem • Internal loops or bulges at different positions along the stem • Stems of different size (base pairs) • Any combination of the previous Giulio Pavesi
Complexity • Given a set of k folded sequences of overall length N : • Construction of the tree: O(N) • Annotation of the tree: O(kN) • Search: O(V(m)kN), where m is the length of the longest pattern found • V(m) depends on the degree of approximation • In practice, the most time consuming part is predicting the structure of the sequences Giulio Pavesi
Does It Work? • Test: Iron Responsive Element, located in the UTRs of mRNA coding for proteins involved in iron metabolism (e.g. ferritin, transferrin) • Does it appear in all the predicted structures? • Alas, it does not!!!!!! Giulio Pavesi
Why? The “real structure”often does not correspond to the optimal one!!!! The motif “disappears” from the (supposedly) optimal structure Giulio Pavesi
One Possible Solution • Idea: for each sequence, consider also a number of alternative sub-optimal structures • All the possible structures can be enumerated • Check whether a motif appears in at least one alternative structure per sequence • The affix tree can handle efficiently even hundreds of alternative structures per input sequence • Downside: the number of potential secondary structures for a sequence of length n is O(2n) • If similarity is not stringent, we have too many candidates Giulio Pavesi
But..... • If the same structure has to appear in a set of sequences, then the same pattern of complementary base pairs has to appear in the sequences ((((( ...... ))))) AGGTC CAGTCA GATCT GCGAG CAGTCT CTTGC CCCAG CAGTCA CTGGG Giulio Pavesi
Idea! • Instead of working on folded sequences, build the affix tree for the sequences alone, and find complementary base pairs on the fly • The search can be implemented with the same parameters of the folded case Giulio Pavesi
Building Hairpins on the Fly • By working on unfolded sequences, the theoretical time complexity is higher, since different paths correspond to the same structure • In practice it is much faster, since we do not have to run the prediction algorithm on the input sequences • We need to “validate” the candidate structures, e.g. according to their energy Giulio Pavesi
Post - Processing • So far we have considered structure alone • More than a single motif occurrence per sequence is often reported, especially if structural constraints are loose • Post processing: compare the candidate occurrences by evaluating sequence similarity in unpaired elements • Find the group of instances that are more similar at the sequence level Giulio Pavesi
Results and Work in Progress • The second approach gave better results, in terms of reliability and efficiency • Candidate hairpins can be validated according to their energy value (more reliable, in this case!) • Good results on “harder” tests • Too many input parameters yet • Extend to more complex structures Giulio Pavesi