370 likes | 539 Views
Noncoding RNA Genes Pt. 2 SCFGs. CS374 Vincent Dorie. Motivation. Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything. Location. rRNA, snRNA Exons? Introns Viral vectors. Function. Function, pt. 2. Overview.
E N D
Noncoding RNA Genes Pt. 2SCFGs CS374 Vincent Dorie
Motivation • Noncoding RNA genes can be anywhere • Noncoding RNA genes can do anything
Location • rRNA, snRNA • Exons? • Introns • Viral vectors
Overview • “RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003) • “Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)
RSEARCH DART (Stemloc) Comparison - Methodology Sequence
RSEARCH Find parts of a genome which may be homologous to query sequence More practical in comparative genomics DART (Stemloc) Investigate a specific sequence suspected of being homologous to query sequence Comparison, Pt. 2 - Uses
RSEARCH O((M - B)LD + BLD2) to scan O(M4) to calculate statistics DART (Stemloc) Between O(LM) and O(L3M3) Comparison, Pt. 3 - Complexity
Background:Context Free Grammars • Four-tuple {N, T, S, P} • N is a set of nonterminals • T is a set of terminals • S is the start symbol, S N • P is a set of productions
Context Free Grammars, pt. 2Sample Grammar • N = {S, A, B} • T = {a, u, c, g, } • P = { S -> A | B, A -> aAc | aBc | g, B -> g }
Context Free Grammars, pt. 3Parse Trees Parse: aagcc S S A A a A c a A c a c a c B g g
Stochastic CFG • Each production associated with a probability • Probabilities for all productions starting from a given nonterminal sum to one • Superset of HMM • Assigns a probability to a parse • E.g. S -> A, 0.3 | B, 0.7
Pairwise (profile) SCFG • Terminals in each production can exist in each of two strings • E.g. W -> xiykVxjyl
Each secondary structure specifies (most of) a grammar, creating a “Model Architecture” Eschews probabilistic interpretation Problem becomes fitting target to model architecture RSEARCH: pSCFG Simplified Sequence
Node Types vs. Node States • Nodes types are what we want to do given model (e.g. MATP is match pair) • Node state represents what happens when scanning a target sequence • E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap
Node States • Set of node states possible for node type
Gap Classes • Gap class per node type/state pair
Transition Scores • Gap class determines transition scores • Gap penalties are affine
Emission Scores • Emission scores determined empirically
Parameterizing the ModelEmission Scores Substitution Matrices Scores are observed / random
RIBOSUM Matrices • Start with MSA • Whose MSA? • RIBOSUM[X, Y] • Sequences X% identical are reweighted to sum to 1 • Only sequences Y% identical are counted in making matrices
Model Parameters • Gap open penalty (single and pair) • Gap extension penalty (single and pair) • Internal start penalty • Internal end penalty
Solution • Guess and check • “We might have been able to derive a more robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”
Digression: Biostatistics • Confidence intervals • Expectation values
Gumbel Distribution • Parameterized by and K • E = KNe-x, P = 1 - e-E
Gumbel Distriubtion, pt. 2 • K and depend on G+C content of target database • For database with heterogeneous G+C content, compute K and for G+C bins
Putting it All Together • Run against database substrings of length two times the query • Greedily take K best, non-overlapping hits • Recover alignments • Report: score, position in database, alignment, E-value, P-value • Statistics need to be calculated for every query and target database
Time • For a 113 nt sequence against 2.1 * 107 nt database, 2.9 CPU days. 2% computing statistics • For a 330 nt sequence against 2.1 * 107 nt database, 38 CPU days. 7% computing statistics • Parallelized to 33 minutes and 7.4 hours respectively
Pre-enumerates pSCFGs search space Presents conditional versions of dynamical programming algorithms User defined complexity Shifting GearsFold Envelopes
Fold Envelopes, pt. 2 • Conceptualize search over grammars and parse trees • Each node in tree accounts for subsequence … Outside sequence Accounts for X0..i and Xj..L Wu Inside sequence Accounts for Xi..j …
Analogy: Message Passing • Inside algorithm: likelihood of sequence over all possible parses • Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence • Inside-Outside algorithm: expected number each grammar production is used • Use fold envelopes to limit messages by restricting subsequences considered
The Inside Algorithm To compute a(i, j, V) = P(xi…xj, produced by V) a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY) V X Y j i k k+1 Batzolgou
Constructing Fold Envelopes • Constrain to possible 2ndary structures • Constrain to primary sequence alignment
Summary • RSEARCH to find a set of possible homologs, sorted by score and statistics • Fold Envelopes permit greater search depth in case of unfolded comparisons • RSEARCH employs simplified pSCFGs • Fold Envelopes are useful over full spectrum of comparisons but represent more computationally complex situations