170 likes | 183 Views
Explore methods to enhance the speed of non-coding RNA annotation without compromising accuracy. Discover the significance of secondary structures and the use of Covariance Models (CMs) for efficient identification. Learn about techniques like ERPIN and advanced improvements for increased sensitivity.
E N D
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of AccuracyZasha Weinberg, and Walter L. Ruzzo Presented by: Jeff Bonis CISC841 - Bioinformatics
What Are Non-Coding RNAs (ncRNA)? • “functional molecules that do not code for proteins” • Examples: transfer RNA (tRNA), spliceosomal RNA, microRNA, regulatory RNA elements • Over 100 known ncRNA families
Secondary Structure of ncRNAs • Conserved, therefore useful for identifying homologs • Secondary structure is functionally important to RNAs • Base pairing important in pattern searching • e.g. 16s RNA - part of small subunit of prokaryote ribosome
What Techniques Exist? • Two models that predict homologs in ncRNA families • Covariance Models (CMs) • Easy RNA Profile IdentificatioN (ERPIN) - http://tagc.univ-mrs.fr/erpin/ • Both use multiple alignment of family members with secondary structure annotation • Statistical model is built from this multiple alignment • Display high sensitivity and low specificity
What about ERPIN? • DP algorithm matches the statistical profile onto a target database and returns the solutions and their scores • Cannot take into account non-consensus bulges in helices (caused by indels) • Need user specified score thresholds which compromises accuracy
CMs • “specify a tree-like SCFG arcitecture suited for modelling consensus RNA secondary structures.” • Can’t accommodate pseudoknots • Very slow algorithm
Which model should be improved? • Covariance Model (CM) is chosen because it’s limitation, pseudoknots, contain little information anyway • Address slow speed without sacrificing accuracy • CMs used in Rfam - http://rfam.wustl.edu • 8 gigabase genome DB called RFAMSEQ • Takes over a year to search for tRNA on P4 • Over 100 ncRNA families
Previous improvements on speed • BLAST based heuristic • Known members are BLASTed against RFAMSEQ • CM is run on resulting set • BLAST misses family members, especially where there is low sequence conservation • tRNAscan-SE - http://www.genetics.wustl.edu/eddy/tRNAscan-SE/ • Uses 2 heuristic based programs for tRNA searches • CM is used on resulting set • May miss tRNAs that CMs would find
How to improve sensitivity? • Authors previously developed rigorous filters with 100% sensitivity of CM found set • Filters based on profile HMMs • Profile HMM is built from CM then run on DB • Much of DB is filtered out, CM runs on remaining set • HMM filter based on sequence conservation • Scanned for 126 of 139 ncRNA families in Rfam • Other 13 display low sequence conservation, but have strong conservation of secondary structure which HMM can’t take into account • Heuristic methods also miss these ncRNAs
How can these special biological situations be accounted for? • Authors propose 3 innovations to overcome these setbacks • 2 techniques to include secondary structure information in filtering at expense of CPU time • Sub-CMs • Hybrid filtering composed of CMs and profile HMMs • Store-Pair • Uses additional HMM states for modeling key base pairs • Third techique will help reduce scan time • Runs filters in series with quickest first ending with most selective • Shortest path problem
Results • Techniques worked for 11 of the 13 previously missed Rfams • Also found new hits missed by BLAST • In tRNAscan-SE, provided rigorous scan for 3 of 4 CMs finding missed hits • 100 times faster than raw CM on average • Uncovers members missed by heuristics
What are CMs anyway? • “statistical models that can detect when a positional sequence and secondary structure resemble a given multiple RNA alignment” • Described in terms of stochastic context-free grammars (SCFGs) • Transformational Grammars • Rules: describe grammar of the form Si -> xL Si+1 xR, xL and xR are left and right nucleotide • Terminals: symbols in the actual string (nucleotides) • Non-Terminals: abstract symbols (states) • Parse: series of steps to obtain final output • Example: • RNA molecules CAG or GAC • S1 -> c S2 g | g S2 C; S2 -> a • Parse: S1 -> c S2 g -> cag
How are CM’s used? • Each rule is assigned a probability • Rules more consistent w/ family have higher probability • The probability of a parse is the product of all the probability of the rules it used • CMs use a log-odds ratios and sum the scores instead of multiplying • CM Viterbi requires window length input which upper bounds the family member’s length and affects scan time
How are profile HMMs and CMs combined? • Given a CM, a profile HMM is created whose Viterbi score upper bounds the CM’s Viterbi score • Guarantees 100% sensitivity on CM • Filtering: • At each nucleotide position in the subsequences of the database, a HMM is used to compute the CM score upper bound • A CM scan is applied to all subsequences that produce an upper bound exceeding some threshold • Subsequences that are below the threshold are filtered out. • Profile HMMs are represented by regular grammars which cannot emit paired nucleotides, e.g. • CM: S1 -> a S2 u | c S S2 G; S2 -> e • HMM: S1L -> a S2L | C S2L; S2L -> S1R; S1R-> g | u • A CM is expanded into a left and right HMM
How can these be supplemented? • Selecting an optimal series of filters • Filtering fraction (fraction of DB left over) and run time are given by running an filter on a training sequence • Minimize expected total CPU time • Assumptions: • estimated fractions and CPU times are constant for all training sequences • A filter’s fraction is not affected by the previously run filters • Optimal sequence of filters is solved as a shortest graph problem • nodes are filters and the CM • Weight of edges are CPU time
Sub-CM technique • Exploit info in hairpins (bulges and internal loops) • Much info is stored in short hairpins that need only part of the CMs states • Grammar contains both HMM and CMs • Window length of sub-CM is crucial • HMMs are created manually after sub-CMs are found • Automation of this is a future project
Store-pair technique • A HMM with extra states can reflect base pairs • S1L[C] -> gS1L[C] has score neg. inf. • 5 states are added per HMM state, but can be reduced