150 likes | 256 Views
Logic-statistic modeling and analysis of biological sequence data: A research agenda. Henning Christiansen Roskilde University, Denmark henning @ ruc.dk, http:// www . ruc . dk / ~henning International Workshop on Abduction and Induction in AI and Bioinformatics Aix-en-Provence, 15 sep 2007.
E N D
Logic-statistic modeling and analysis of biological sequence data:A research agenda Henning Christiansen Roskilde University, Denmark henning@ruc.dk, http://www.ruc.dk/~henning International Workshop on Abduction and Induction in AI and Bioinformatics Aix-en-Provence, 15 sep 2007
Motivation and overall goal Computational analysis of biological sequence data traditionally based on • HMM, SCFG, ad hoc techniques • Each system has its particular type of models A bottle-neck which cannot be remedied (only) by faster and parallel computers We want to promote the application of more expressive and flexible models: logic-statistic methods a la PRISM (Sato, Kameya) I.e. stepping from regular and context-free languages to Turing-complete language
To reach this goal, we will... • Approach inherent computational problems • optimizations by program analysis & transformations • interface to existing and efficient software • Develop biologically relevant test cases • for biologist to learn how to use such models • to have relevant test cases for the first part
Project setup Funded by the NABIIT program under the Danish Strategic Research Council, 2007–2011 Main academic partners: • Roskilde U. Computer Science: H. Christiansen, J. Gallagher; 1 PhD student (still open!), postdocs • Roskilde U., Biology: O. Skovgaard; 1 PhD student • Aalborg U. Computer Science: M. Jaeger Academic associates: • Taisuke Sato, Tokyo Inst. of Techn. • A. Krogh, Copenhagen Univ. Industrial partners: • Chr. Hansen, Denmark. • Wordwide supplier of probiotic products for the dietary supplement industry • CLC bio • Leading supplier of bioinformatics software
PRISM (Sato, Kameya) for sequence analysis, introduced by a toy example PRISM extends Prolog with discrete random variables Includes machine learning and prediction methods: • learn best probabilities to explain training data • with learned prob’s, determine best answer to a query Example: Loop structures - a non-context-free phenomenon gggctgg gggctgg Assume a collection of sequences where loop structures have been identified in the lab Task: Build and train model so it can be used for prediction
Example model in PRISM • Assume (arbitrarily): • ‘noise’ ≈ a 1. order Markov model • ‘contact zone’ ≈ a 2. order sequence(...):- noise(...), contact(K,....), noise(....), contactCopy(K, ...), noise(...). values(moreNoise,[stop, continue]). values(moreContact,[stop,continue]). values(which(_),[a,c,g,t]). values(which(_,_),[a,c,g,t]). noise(F,S1,S2):- msw(moreNoise,YN), noise2(F,S1,S2,YN). noise2(_,S,S,stop). noise2(F,[B|S1],S2,continue):- msw(which(F),B), noise(B,S1,S2). contact(K,F1,F2,S1,S2):- msw(moreContact,YN), contact2(K,F1,F2,S1,S2,YN). contact2([],_,_,S,S,stop). contact2([B|K],F1,F2,[B|S1],S2,continue):- msw(which(F1,F2),B), contact(K,F2,B,S1,S2). contactCopy([],S,S). contactCopy([B|K],[B|S1],S2):- contactCopy(K,S1,S2). sequence(K,S):- noise(-,S,S1), contact(K,-,-, S1,S2), noise(-,S2,S3), contactCopy(K, S3,S4), noise(-,S4,[]). This is the entire model! Training data: sequence([c,c,g,g,g,t,c,g,c],[a,c,c,g,g,g,t,c,g,c,a,a,t,c,a,a,a,t,c,t,t,t,a,a,c,c,c,g,g,g,t,c,g,c,a,g,a,c,t,a,t,g,t,t,t,a,g,a,a,a,a,c,a,t]). sequence(......, ......). sequence(...., .....). .......
Using a the trained model for prediction ?- viterbig(sequence(K,[t,a,t,a,g,c,g,c,t,a,t,a,g,c,g,c,t,a,t,a])) K = [g,c,g,c] The answer to the query with highest probability. ... plus a lot of other facilities
Our first serious application of PRISM:Testing gene finders (MLDM 2007; with C.M.Dahmcke) Problems: Test data expensive; available test data already used for training gene finders; disagreement about what is a gene, ... Approach: • Develop and train PRISM model with known, annotated data • Use this to create artificial test data, • i.e., sequences with annotations about where-are-the-genes • Check if gene finder programs find the same genes Results: • Three different gene finders found too many and different genes ;-(
Overview of the model (intergenic only) GC-island GC-sparse GC-sparse Target predicate: sequence(sequence-of-ACGT, GC-islands, repeats) GC-islands: list of from-no–to-no repeats: list of from-no–to-no with indication of: type: simple, low-complexity, named,... for named: selected from catalogue; which part; forward, backward, transposed, backward+transposed plus one detailed description of »mutation«: [c,c,c,c,i,i,c,c,d,d,c,c,...] (to suppress complexity in the model; for training data generated by a best-match algorithm) ...
Implemented as a two-layer model Top-level: GC-islands/GC-sparse, length 200 + exponential decay Underlying layer: Mix of repeaters and coloured noise Two-level structure implemented by our own abstract datatype: • uses hidden msw’s to control GC-island/sparse • each RV maintained in two versions (hidden) • position, counter to produce annot. GC-islands msw(RV:random-var, value, GC-islands, position)
Lesson learned from gene finder experiment • A nontrivial model can be organized in a reasonable way by an experienced logic programmer • Preprocessing to freeze one mutation set reduced complexity of learning phase - general technique? • Model could be trained in minutes from marked up sequences of total 106 letters. • With Prolog’s list repr. for sequences we needed 64-bit architechture (sic!) and lo-o-o-o-ot of RAM • PRISM is a very flexible tool for combining and varying different models, inventing a little data structure etc., but keeping a model with well-defined semantics • We lacked, and would suggest to add to PRISM • Distributions over integers (normal d. or “generic smooth”) • Over-layered, and especially negative criteria • (ouf: random variables become dependent)
Anticipated problems for prediction with PRISM and possible solutions • Storage consumption, the sequence as array + PRISM’s explanation graphs (???) • Execution time • Systematic approach to pruning: Generalize known methods for semantics-preserving program transformations to semantics-approximating transformations • Integrate with existing and efficient software • Automatically and hidden??? If, e.g., analysis of PRISM program says “this-looks-like-a-HMM” • Clean interfaces ???? • Reduce complexity by splitting the sequence (-- how to integrate this with a nice semantics?)
Biological problems considered • Gene finding in health promoting bacteria • Phylogenetic gene prediction • Prediction of gene function and acquisition by orthology • (Gene finding for eukaryotic species)
Project hypotheses summarized • Logic-statistic models, a la PRISM or similar, have much higher expressibility and flexibility than traditional models used for sequence analysis (in formal as well as practical sense) • If we can solve some of the computational problems involved and learn how to use such powerful modeling tools, there is a potential for new discoveries in biology.
Thanks for your attention! PS. We are seeking (desperately) a good PhD student for the computational issues. Good salary and conditions offered!