Noncoding RNA Genes Pt. 2 SCFGs

Noncoding RNA Genes Pt. 2SCFGs CS374 Vincent Dorie

Motivation • Noncoding RNA genes can be anywhere • Noncoding RNA genes can do anything

Location • rRNA, snRNA • Exons? • Introns • Viral vectors

Function

Function, pt. 2

Overview • “RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003) • “Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)

RSEARCH DART (Stemloc) Comparison - Methodology Sequence

RSEARCH Find parts of a genome which may be homologous to query sequence More practical in comparative genomics DART (Stemloc) Investigate a specific sequence suspected of being homologous to query sequence Comparison, Pt. 2 - Uses

RSEARCH O((M - B)LD + BLD2) to scan O(M4) to calculate statistics DART (Stemloc) Between O(LM) and O(L3M3) Comparison, Pt. 3 - Complexity

Background:Context Free Grammars • Four-tuple {N, T, S, P} • N is a set of nonterminals • T is a set of terminals • S is the start symbol, S  N • P is a set of productions

Context Free Grammars, pt. 2Sample Grammar • N = {S, A, B} • T = {a, u, c, g, } • P = { S -> A | B, A -> aAc | aBc | g, B -> g }

Context Free Grammars, pt. 3Parse Trees Parse: aagcc S S A A a A c a A c a c a c B g g

Stochastic CFG • Each production associated with a probability • Probabilities for all productions starting from a given nonterminal sum to one • Superset of HMM • Assigns a probability to a parse • E.g. S -> A, 0.3 | B, 0.7

Pairwise (profile) SCFG • Terminals in each production can exist in each of two strings • E.g. W -> xiykVxjyl

Each secondary structure specifies (most of) a grammar, creating a “Model Architecture” Eschews probabilistic interpretation Problem becomes fitting target to model architecture RSEARCH: pSCFG Simplified Sequence

Node Types vs. Node States • Nodes types are what we want to do given model (e.g. MATP is match pair) • Node state represents what happens when scanning a target sequence • E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap

Node States • Set of node states possible for node type

Gap Classes • Gap class per node type/state pair

Transition Scores • Gap class determines transition scores • Gap penalties are affine

Emission Scores • Emission scores determined empirically

Parameterizing the ModelEmission Scores Substitution Matrices Scores are observed / random

RIBOSUM Matrices • Start with MSA • Whose MSA? • RIBOSUM[X, Y] • Sequences X% identical are reweighted to sum to 1 • Only sequences Y% identical are counted in making matrices

Model Parameters • Gap open penalty (single and pair) • Gap extension penalty (single and pair) • Internal start penalty • Internal end penalty

Solution • Guess and check • “We might have been able to derive a more robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”

Digression: Biostatistics • Confidence intervals • Expectation values

Gumbel Distribution • Parameterized by  and K • E = KNe-x, P = 1 - e-E

Gumbel Distriubtion, pt. 2 • K and  depend on G+C content of target database • For database with heterogeneous G+C content, compute K and  for G+C bins

Putting it All Together • Run against database substrings of length two times the query • Greedily take K best, non-overlapping hits • Recover alignments • Report: score, position in database, alignment, E-value, P-value • Statistics need to be calculated for every query and target database

Time • For a 113 nt sequence against 2.1 * 107 nt database, 2.9 CPU days. 2% computing statistics • For a 330 nt sequence against 2.1 * 107 nt database, 38 CPU days. 7% computing statistics • Parallelized to 33 minutes and 7.4 hours respectively

Pre-enumerates pSCFGs search space Presents conditional versions of dynamical programming algorithms User defined complexity Shifting GearsFold Envelopes

Fold Envelopes, pt. 2 • Conceptualize search over grammars and parse trees • Each node in tree accounts for subsequence … Outside sequence Accounts for X0..i and Xj..L Wu Inside sequence Accounts for Xi..j …

Analogy: Message Passing • Inside algorithm: likelihood of sequence over all possible parses • Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence • Inside-Outside algorithm: expected number each grammar production is used • Use fold envelopes to limit messages by restricting subsequences considered

The Inside Algorithm To compute a(i, j, V) = P(xi…xj, produced by V) a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V  XY) V X Y j i k k+1 Batzolgou

Constructing Fold Envelopes • Constrain to possible 2ndary structures • Constrain to primary sequence alignment

Summary • RSEARCH to find a set of possible homologs, sorted by score and statistics • Fold Envelopes permit greater search depth in case of unfolded comparisons • RSEARCH employs simplified pSCFGs • Fold Envelopes are useful over full spectrum of comparisons but represent more computationally complex situations

Noncoding RNA Genes Pt. 2 SCFGs

Noncoding RNA Genes Pt. 2 SCFGs

Presentation Transcript

2 pt

2 pt

2 pt

2 pt

2 pt

Ribosomal RNA ( rRNA ) Genes

RNA sequencing for differential expression genes

2 pt

2 pt

2 pt

2 pt

2 pt

2 pt

Searching genomes for noncoding RNA

2 pt

2 pt

2 pt

2 pt

2 pt

2 pt

RNA Bioinformatics Genes and Secondary Structure

Ribosomal RNA ( rRNA ) Genes