Presented by: Jeff Bonis CISC841 - Bioinformatics

Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of AccuracyZasha Weinberg, and Walter L. Ruzzo Presented by: Jeff Bonis CISC841 - Bioinformatics

What Are Non-Coding RNAs (ncRNA)? • “functional molecules that do not code for proteins” • Examples: transfer RNA (tRNA), spliceosomal RNA, microRNA, regulatory RNA elements • Over 100 known ncRNA families

Secondary Structure of ncRNAs • Conserved, therefore useful for identifying homologs • Secondary structure is functionally important to RNAs • Base pairing important in pattern searching • e.g. 16s RNA - part of small subunit of prokaryote ribosome

What Techniques Exist? • Two models that predict homologs in ncRNA families • Covariance Models (CMs) • Easy RNA Profile IdentificatioN (ERPIN) - http://tagc.univ-mrs.fr/erpin/ • Both use multiple alignment of family members with secondary structure annotation • Statistical model is built from this multiple alignment • Display high sensitivity and low specificity

What about ERPIN? • DP algorithm matches the statistical profile onto a target database and returns the solutions and their scores • Cannot take into account non-consensus bulges in helices (caused by indels) • Need user specified score thresholds which compromises accuracy

CMs • “specify a tree-like SCFG arcitecture suited for modelling consensus RNA secondary structures.” • Can’t accommodate pseudoknots • Very slow algorithm

Which model should be improved? • Covariance Model (CM) is chosen because it’s limitation, pseudoknots, contain little information anyway • Address slow speed without sacrificing accuracy • CMs used in Rfam - http://rfam.wustl.edu • 8 gigabase genome DB called RFAMSEQ • Takes over a year to search for tRNA on P4 • Over 100 ncRNA families

Previous improvements on speed • BLAST based heuristic • Known members are BLASTed against RFAMSEQ • CM is run on resulting set • BLAST misses family members, especially where there is low sequence conservation • tRNAscan-SE - http://www.genetics.wustl.edu/eddy/tRNAscan-SE/ • Uses 2 heuristic based programs for tRNA searches • CM is used on resulting set • May miss tRNAs that CMs would find

How to improve sensitivity? • Authors previously developed rigorous filters with 100% sensitivity of CM found set • Filters based on profile HMMs • Profile HMM is built from CM then run on DB • Much of DB is filtered out, CM runs on remaining set • HMM filter based on sequence conservation • Scanned for 126 of 139 ncRNA families in Rfam • Other 13 display low sequence conservation, but have strong conservation of secondary structure which HMM can’t take into account • Heuristic methods also miss these ncRNAs

How can these special biological situations be accounted for? • Authors propose 3 innovations to overcome these setbacks • 2 techniques to include secondary structure information in filtering at expense of CPU time • Sub-CMs • Hybrid filtering composed of CMs and profile HMMs • Store-Pair • Uses additional HMM states for modeling key base pairs • Third techique will help reduce scan time • Runs filters in series with quickest first ending with most selective • Shortest path problem

Results • Techniques worked for 11 of the 13 previously missed Rfams • Also found new hits missed by BLAST • In tRNAscan-SE, provided rigorous scan for 3 of 4 CMs finding missed hits • 100 times faster than raw CM on average • Uncovers members missed by heuristics

What are CMs anyway? • “statistical models that can detect when a positional sequence and secondary structure resemble a given multiple RNA alignment” • Described in terms of stochastic context-free grammars (SCFGs) • Transformational Grammars • Rules: describe grammar of the form Si -> xL Si+1 xR, xL and xR are left and right nucleotide • Terminals: symbols in the actual string (nucleotides) • Non-Terminals: abstract symbols (states) • Parse: series of steps to obtain final output • Example: • RNA molecules CAG or GAC • S1 -> c S2 g | g S2 C; S2 -> a • Parse: S1 -> c S2 g -> cag

How are CM’s used? • Each rule is assigned a probability • Rules more consistent w/ family have higher probability • The probability of a parse is the product of all the probability of the rules it used • CMs use a log-odds ratios and sum the scores instead of multiplying • CM Viterbi requires window length input which upper bounds the family member’s length and affects scan time

How are profile HMMs and CMs combined? • Given a CM, a profile HMM is created whose Viterbi score upper bounds the CM’s Viterbi score • Guarantees 100% sensitivity on CM • Filtering: • At each nucleotide position in the subsequences of the database, a HMM is used to compute the CM score upper bound • A CM scan is applied to all subsequences that produce an upper bound exceeding some threshold • Subsequences that are below the threshold are filtered out. • Profile HMMs are represented by regular grammars which cannot emit paired nucleotides, e.g. • CM: S1 -> a S2 u | c S S2 G; S2 -> e • HMM: S1L -> a S2L | C S2L; S2L -> S1R; S1R-> g | u • A CM is expanded into a left and right HMM

How can these be supplemented? • Selecting an optimal series of filters • Filtering fraction (fraction of DB left over) and run time are given by running an filter on a training sequence • Minimize expected total CPU time • Assumptions: • estimated fractions and CPU times are constant for all training sequences • A filter’s fraction is not affected by the previously run filters • Optimal sequence of filters is solved as a shortest graph problem • nodes are filters and the CM • Weight of edges are CPU time

Sub-CM technique • Exploit info in hairpins (bulges and internal loops) • Much info is stored in short hairpins that need only part of the CMs states • Grammar contains both HMM and CMs • Window length of sub-CM is crucial • HMMs are created manually after sub-CMs are found • Automation of this is a future project

Store-pair technique • A HMM with extra states can reflect base pairs • S1L[C] -> gS1L[C] has score neg. inf. • 5 states are added per HMM state, but can be reduced

Presented by: Jeff Bonis CISC841 - Bioinformatics