Speeding-up Parsing of Biological Context-Free Grammars

Speeding-up Parsing of Biological Context-Free Grammars D.F. Fredouille C.H. Bryant School of Computing, The Robert Gordon University Aberdeen, UK

Definitions: Grammars • Alphabet: {a,c,g,t} • Sequence: actttgtcgtaaatgg • Language: {actttgtcgtaaatgg, agtaactttgtcg, ctttgtatgccaag, ... } • Context-Free Grammar (CFG): • Rewriting rules - represent a language • { Start → Gap c t t t g t Gap, Gap → ε | X Gap, X → a | c | g | t }

Definitions: Bio. sequences Alphabet : {a,c,t,g} Example: agtaactttgtcg Alphabet : {a,c,u,g} Example: aguaacuuugucg Alphabet : 20 letters Example: pvypgdnaadssiekqvallk

Motivation for Fast Parsing • Grammar models are widely used as models for biological sequences: • Prosite motifs, SVG, BGG, … • Need fast parsers for molecular biology: • Many sequences to parse when searching for novel members of a biological family • Parsing in many grammars when annotating newly discovered sequences with their family

Parsing in biological CFGs • Basically, two parsing algorithms for CFGs • Depth-first, top-down parsing (DFTDP) • Chart parsing (CP) Many others exist but are restricted to subsets of CFGs • Should we use DFTDP or CP ? • Can we improve efficiency when dealing with biological grammars ?

Outline of Our Work • Preliminary experiments showed that parsing speed in biological CFGs is strongly dependent on gap rules. • Theoretical complexity study of the algorithms with respect to gap rules • Improved the algorithms’ management of gaps • Empirical comparison of the algorithms on biological sequences and grammars (which naturally contains gaps)

Definitions: Gap rules • An unlimited gap is a non-terminal which can match any sequence. • Right-rec. : { GapR → ε, GapR → X GapR } • Left-rec. : { GapL → ε, GapL → GapL X } • A limited gap is a non-terminal which can match any sequence s with lo ≤ |s| ≤up. • Form1: { Gap1 → Xlo Xe(up-lo), Xe → X, Xe → ε } • Form2: { Gap2 → Xi : lo ≤ i ≤ up }

Theoretical comparison • Unlimited gaps: • GapL: can not be parsed with DFTDP • GapR: O( |s|) • Limited gaps: • Form1: O( 2up-lo) • Form2: O((up-lo)2 ) • Unlimited gaps: • GapL: O(|s|2) but under some reasonable hypotheses O( |s| ) • GapR: O( |s|2 ) • Limited gaps: Form1 > Form2 + O( up-lo ) • Optimisations: O( |s|) andO( up-lo) DFTDP CP

Empirical Comparison • Protein grammars and sequences • Sequences from the Uniprot database. • Grammars from the Prosite database ( “simple” grammars). • Motivation: largest DBs of protein sequences and hand-validated protein grammars. • DNA grammars and sequences • Grammars from UTRsite (untranslated regions of RNA) • Sequences from the UTRdb • Motivation: one of the rare places where many DNA grammars are available

CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Unlimited gap: L = left recursive R = right recursive Limited gap: F1 = form1 F2 = form2

CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Optimised Conclusion 1:If you program a parsing algorithm, creating special treatments for gaps can speed-up parsing

CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Fact: Some curves not plotted due to very large running times: CP R+F1, CP R+F2 DFTDP R+F1 for one grammar

CP L+F1 CP L+F2 DFTD R+F1 DFTD R+F2 Parsing time in seconds CP Opt. DFTD Opt. Length of the parsed string Empirical Comparison - Proteins Conclusion 2: When using classical CFG parsing algorithms, design the gap rules carefully

Missing the “0.” in Table 2 Empirical Comparison - DNA L = unlimited gap, left recursive R = unlimited gap, right recursive F1 = limited gap, form1 F2 = limited gap, form2 Conclusion 3: CP is significantly faster than DFTDP when grammars start to be “complex”

Conclusions Theoretical and empirical studies show that: • If you program a parsing algorithm, creating special treatments for gaps can speed-up parsing. • When using classical CFG parsing algorithms, design the gap rules carefully. • DFTDP faster for “simple” grammars, but CP is significantly faster when grammars start to be “complex”.

Acknowledgements • Funding – EPSRC • Industrial Collaborator – GlaxoSmithKline • Simon Topp • Stephen Jupe Software and experiments material http://www.comp.rgu.ac.uk/staff/chb/research/data_sets/cpm05

Speeding-up Parsing of Biological Context-Free Grammars