190 likes | 328 Views
Overview IST-2001-38344. Cells are a collection of protein nanomachines. A biological challenge. To build models of protein complexes & understand the function of each component, based upon available evidence.
E N D
Overview IST-2001-38344
A biological challenge • To build models of protein complexes & understand the function of each component, based upon available evidence. • However, to build evidence for each protein interaction, a biologist must find, integrate, compare & then validate the results from a number of separate resources.
DNA ‘chips’ Modelling HTP Sequencing SNP Gene prediction Proteomics Domain analysis Synchrotron Genomics & Proteomics Expression Folding PROTEIN STRUCTURES DNA
Interaction Space Expression Space Literature Space Genomics & Proteomics
The need for computerised information systems • New HTP methods produce orders of magnitude more data than before: • More than is interpretable manually. • Data are stored in a (semi-)structured format. • Much knowledge is in literature & patents: • 13,000,000 abstracts in MEDLINE. • Knowledge is stored in an unstructured format. • Solution: computerised information systems: • Enable data mining & visualisation of integrated resources, with text analysis.
Components of bioGrid • Gene expression: • ExpressionSpace: • Clustering of microarray data. • May require large memory. • Protein interaction: • PSIMAP: • Predict interactions between protein domains. • May pre-compute as relatively unchanging. • Literature: • GoPubMed-D: • Organises corpus of documents into the GO ontology. • Lexical analysis requires lengthy compute.
Expression Space: Space Explorer Interaction Space: PSIMAP LLNE YLEEVE EYEEDE LLNE YLEEVE EYEEDE LLNE YLEEVE EYEEDE Literature Space: Classification Server bioGrid: An integrated platform for gene expression data, protein interaction data, and literature
Workflow for use case - Part I • Search literature for papers about the experimental system studied: • Microarray & mitochondria. • Upload the gene expression data set. • Cluster the gene expression data set. • Identify a cluster that contains genes of interest, e.g. energy production. • Examine the expression profiles of the genes in the cluster.
Workflow for use case - Part II • Calculate an induced PSIMAP graph for the genes in the expression cluster. • Explore PSIMAP graph & nodes. • For pairs of genes predicted to interact: • Search literature for papers citing both genes. • Classify literature to assess possible function or metabolic processes of genes. • Assimilate evidence for components of a protein complex.
Distributed technology implementation • Globus, Unicore, Legion, … • Are geared towards computational complexity, not semantic complexity. • BioGrid’s approach: • Agent-based approach. • Integration of rules, reasoning, and messaging in a Java-environment. • Using meta-model. • Advantage: • Easy to maintain, easy to use, includes code distribution, architecture independent, geared towards farms of local and remote machines.
Prova-AA • Extensions to Prova for rule-based agent scripting. • Prova-AA introduces: • Messaging (local, JMS, and JADE). • Reaction rules. • Context-dependent inline reactions for asynchronous messaging. • Embedding of Prova agents in Java and Web app’s. • Advantages: • Cooperating agents vs. GRID RPC. • Ease of development and maintenance. • Platform independence and portability. • High level specification of communication protocols. • Native syntax integration with Java. • Low-cost creation of distributed workflows. And ad-hoc networks of computation nodes.
Distributed GoPubMed-D (2/3) BioGrid Prototype integrates with GoPubMed-D via embedded Prova-AA JADE agent.
Distributed computation with Prova-AA agents A flexible solution for a self-managing self-balancing distributed computation: • Manager and Workers architecture based on Prova-AA agents with Java computation modules. • Loosely synchronous interaction. • Minimal compact coding (30 lines for Manager and 20 lines for Worker). • Manager does not need to keep a registry of the Workers that can join in at any time. • Computation is divided in small atomic subtasks (4 or 5 proteins). • Manager dispatches a new subtask asynchronously upon receiving a ready message from a Worker. • Worker computes a subtask and responds with the results in a reply message and a new ready message. • Workers compute subtasks at their own pace so load balancing is automatic. • Workers extended with routing capabilities are available. • Can be easily extended with failover capabilities.
First results & infrastructure needs IST-2001-38344
Building an information system for biology is non-trivial • Molecular biology resources: • Are heterogeneous in content: • Genomics, proteomics, literature. • Exist in a large number: • Public, commercial, organisational, personal. • Variable quality: Curated vs. automatic. • Have different interfaces: Web, SQL, SOAP, etc. • Are geographically distributed w/o yellow pages. • Store data in different formats - few standards. • Change rapidly. • Confidentiality & IPR protection. • Are too large to transport conveniently.
Technology challenges in building bioGrid • Semantic Complexity: • Computer does not “understand” data. • DBs and systems cannot inter-operate. • Computational complexity: • Generating protein interaction map takes ca. 1 day. • Analysing large sets of gene expression data can take up to an hour. • Analysis of large text bodies complex.
Social challenges in building Grid • Over-hyped & scepticism. • Technology stability & reliability. • Security. • Usability. • Peer-reviewed results in major biomedical journals: • Science, Nature, Cell, BMJ, Lancet, etc.