Logic-statistic modeling and analysis of biological sequence data: A research agenda

Logic-statistic modeling and analysis of biological sequence data:A research agenda Henning Christiansen Roskilde University, Denmark henning@ruc.dk, http://www.ruc.dk/~henning International Workshop on Abduction and Induction in AI and Bioinformatics Aix-en-Provence, 15 sep 2007

Motivation and overall goal Computational analysis of biological sequence data traditionally based on • HMM, SCFG, ad hoc techniques • Each system has its particular type of models A bottle-neck which cannot be remedied (only) by faster and parallel computers We want to promote the application of more expressive and flexible models: logic-statistic methods a la PRISM (Sato, Kameya) I.e. stepping from regular and context-free languages to Turing-complete language

To reach this goal, we will... • Approach inherent computational problems • optimizations by program analysis & transformations • interface to existing and efficient software • Develop biologically relevant test cases • for biologist to learn how to use such models • to have relevant test cases for the first part

Project setup Funded by the NABIIT program under the Danish Strategic Research Council, 2007–2011 Main academic partners: • Roskilde U. Computer Science: H. Christiansen, J. Gallagher; 1 PhD student (still open!), postdocs • Roskilde U., Biology: O. Skovgaard; 1 PhD student • Aalborg U. Computer Science: M. Jaeger Academic associates: • Taisuke Sato, Tokyo Inst. of Techn. • A. Krogh, Copenhagen Univ. Industrial partners: • Chr. Hansen, Denmark. • Wordwide supplier of probiotic products for the dietary supplement industry • CLC bio • Leading supplier of bioinformatics software

PRISM (Sato, Kameya) for sequence analysis, introduced by a toy example PRISM extends Prolog with discrete random variables Includes machine learning and prediction methods: • learn best probabilities to explain training data • with learned prob’s, determine best answer to a query Example: Loop structures - a non-context-free phenomenon gggctgg gggctgg Assume a collection of sequences where loop structures have been identified in the lab Task: Build and train model so it can be used for prediction

Example model in PRISM • Assume (arbitrarily): • ‘noise’ ≈ a 1. order Markov model • ‘contact zone’ ≈ a 2. order sequence(...):- noise(...), contact(K,....), noise(....), contactCopy(K, ...), noise(...). values(moreNoise,[stop, continue]). values(moreContact,[stop,continue]). values(which(_),[a,c,g,t]). values(which(_,_),[a,c,g,t]). noise(F,S1,S2):- msw(moreNoise,YN), noise2(F,S1,S2,YN). noise2(_,S,S,stop). noise2(F,[B|S1],S2,continue):- msw(which(F),B), noise(B,S1,S2). contact(K,F1,F2,S1,S2):- msw(moreContact,YN), contact2(K,F1,F2,S1,S2,YN). contact2([],_,_,S,S,stop). contact2([B|K],F1,F2,[B|S1],S2,continue):- msw(which(F1,F2),B), contact(K,F2,B,S1,S2). contactCopy([],S,S). contactCopy([B|K],[B|S1],S2):- contactCopy(K,S1,S2). sequence(K,S):- noise(-,S,S1), contact(K,-,-, S1,S2), noise(-,S2,S3), contactCopy(K, S3,S4), noise(-,S4,[]). This is the entire model! Training data: sequence([c,c,g,g,g,t,c,g,c],[a,c,c,g,g,g,t,c,g,c,a,a,t,c,a,a,a,t,c,t,t,t,a,a,c,c,c,g,g,g,t,c,g,c,a,g,a,c,t,a,t,g,t,t,t,a,g,a,a,a,a,c,a,t]). sequence(......, ......). sequence(...., .....). .......

Using a the trained model for prediction ?- viterbig(sequence(K,[t,a,t,a,g,c,g,c,t,a,t,a,g,c,g,c,t,a,t,a])) K = [g,c,g,c] The answer to the query with highest probability. ... plus a lot of other facilities

Our first serious application of PRISM:Testing gene finders (MLDM 2007; with C.M.Dahmcke) Problems: Test data expensive; available test data already used for training gene finders; disagreement about what is a gene, ... Approach: • Develop and train PRISM model with known, annotated data • Use this to create artificial test data, • i.e., sequences with annotations about where-are-the-genes • Check if gene finder programs find the same genes Results: • Three different gene finders found too many and different genes ;-(

Overview of the model (intergenic only) GC-island GC-sparse GC-sparse Target predicate: sequence(sequence-of-ACGT, GC-islands, repeats) GC-islands: list of from-no–to-no repeats: list of from-no–to-no with indication of: type: simple, low-complexity, named,... for named: selected from catalogue; which part; forward, backward, transposed, backward+transposed plus one detailed description of »mutation«: [c,c,c,c,i,i,c,c,d,d,c,c,...] (to suppress complexity in the model; for training data generated by a best-match algorithm) ...

Implemented as a two-layer model Top-level: GC-islands/GC-sparse, length 200 + exponential decay Underlying layer: Mix of repeaters and coloured noise Two-level structure implemented by our own abstract datatype: • uses hidden msw’s to control GC-island/sparse • each RV maintained in two versions (hidden) • position, counter to produce annot. GC-islands msw(RV:random-var, value, GC-islands, position)

Lesson learned from gene finder experiment • A nontrivial model can be organized in a reasonable way by an experienced logic programmer • Preprocessing to freeze one mutation set reduced complexity of learning phase - general technique? • Model could be trained in minutes from marked up sequences of total 106 letters. • With Prolog’s list repr. for sequences we needed 64-bit architechture (sic!) and lo-o-o-o-ot of RAM • PRISM is a very flexible tool for combining and varying different models, inventing a little data structure etc., but keeping a model with well-defined semantics • We lacked, and would suggest to add to PRISM • Distributions over integers (normal d. or “generic smooth”) • Over-layered, and especially negative criteria • (ouf: random variables become dependent)

Anticipated problems for prediction with PRISM and possible solutions • Storage consumption, the sequence as array + PRISM’s explanation graphs (???) • Execution time • Systematic approach to pruning: Generalize known methods for semantics-preserving program transformations to semantics-approximating transformations • Integrate with existing and efficient software • Automatically and hidden??? If, e.g., analysis of PRISM program says “this-looks-like-a-HMM” • Clean interfaces ???? • Reduce complexity by splitting the sequence (-- how to integrate this with a nice semantics?)

Biological problems considered • Gene finding in health promoting bacteria • Phylogenetic gene prediction • Prediction of gene function and acquisition by orthology • (Gene finding for eukaryotic species)

Project hypotheses summarized • Logic-statistic models, a la PRISM or similar, have much higher expressibility and flexibility than traditional models used for sequence analysis (in formal as well as practical sense) • If we can solve some of the computational problems involved and learn how to use such powerful modeling tools, there is a potential for new discoveries in biology.

Thanks for your attention! PS. We are seeking (desperately) a good PhD student for the computational issues. Good salary and conditions offered!

Logic-statistic modeling and analysis of biological sequence data: A research agenda

Logic-statistic modeling and analysis of biological sequence data: A research agenda

Presentation Transcript

A novel interactive tool for multidimensional biological data analysis

Relational Data Model

Biological Sequence Analysis

Econometric Analysis of Panel Data

The Use of Secondary Data in Modeling of Biological Data

Logic Modeling

IMAG Futures Meeting

Scalable High Performance Dimension Reduction

Parallel Computation in Biological Sequence Analysis

Introduction to Analysis of Sequence Count Data using DESeq

Biological Sequence Comparison and Alignment

Sequence diagrams in UML (Unified Modeling Language )

Information Technology As A CATALYST in Basic Biological Research

DNA Sequence Analysis

C H A P T E R

CSE323 การวิเคราะห์และออกแบบระบบ ( Systems Analysis and Design )

Chapter 8

Data Modeling

Biological Network Analysis

Data Modeling with the Sequence Ontology

Parallel Computation in Biological Sequence Analysis

6. Homology Modeling