1 / 19

Overview IST-2001-38344

Overview IST-2001-38344. Cells are a collection of protein nanomachines. A biological challenge. To build models of protein complexes & understand the function of each component, based upon available evidence.

joshua
Download Presentation

Overview IST-2001-38344

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview IST-2001-38344

  2. Cells are a collection of protein nanomachines

  3. A biological challenge • To build models of protein complexes & understand the function of each component, based upon available evidence. • However, to build evidence for each protein interaction, a biologist must find, integrate, compare & then validate the results from a number of separate resources.

  4. DNA ‘chips’ Modelling HTP Sequencing SNP Gene prediction Proteomics Domain analysis Synchrotron Genomics & Proteomics Expression Folding PROTEIN STRUCTURES DNA

  5. Interaction Space Expression Space Literature Space Genomics & Proteomics

  6. The need for computerised information systems • New HTP methods produce orders of magnitude more data than before: • More than is interpretable manually. • Data are stored in a (semi-)structured format. • Much knowledge is in literature & patents: • 13,000,000 abstracts in MEDLINE. • Knowledge is stored in an unstructured format. • Solution: computerised information systems: • Enable data mining & visualisation of integrated resources, with text analysis.

  7. Components of bioGrid • Gene expression: • ExpressionSpace: • Clustering of microarray data. • May require large memory. • Protein interaction: • PSIMAP: • Predict interactions between protein domains. • May pre-compute as relatively unchanging. • Literature: • GoPubMed-D: • Organises corpus of documents into the GO ontology. • Lexical analysis requires lengthy compute.

  8. Expression Space: Space Explorer Interaction Space: PSIMAP LLNE YLEEVE EYEEDE LLNE YLEEVE EYEEDE LLNE YLEEVE EYEEDE Literature Space: Classification Server bioGrid: An integrated platform for gene expression data, protein interaction data, and literature

  9. Workflow for use case - Part I • Search literature for papers about the experimental system studied: • Microarray & mitochondria. • Upload the gene expression data set. • Cluster the gene expression data set. • Identify a cluster that contains genes of interest, e.g. energy production. • Examine the expression profiles of the genes in the cluster.

  10. Workflow for use case - Part II • Calculate an induced PSIMAP graph for the genes in the expression cluster. • Explore PSIMAP graph & nodes. • For pairs of genes predicted to interact: • Search literature for papers citing both genes. • Classify literature to assess possible function or metabolic processes of genes. • Assimilate evidence for components of a protein complex.

  11. Distributed technology implementation • Globus, Unicore, Legion, … • Are geared towards computational complexity, not semantic complexity. • BioGrid’s approach: • Agent-based approach. • Integration of rules, reasoning, and messaging in a Java-environment. • Using meta-model. • Advantage: • Easy to maintain, easy to use, includes code distribution, architecture independent, geared towards farms of local and remote machines.

  12. Prova-AA • Extensions to Prova for rule-based agent scripting. • Prova-AA introduces: • Messaging (local, JMS, and JADE). • Reaction rules. • Context-dependent inline reactions for asynchronous messaging. • Embedding of Prova agents in Java and Web app’s. • Advantages: • Cooperating agents vs. GRID RPC. • Ease of development and maintenance. • Platform independence and portability. • High level specification of communication protocols. • Native syntax integration with Java. • Low-cost creation of distributed workflows. And ad-hoc networks of computation nodes.

  13. Distributed GoPubMed-D (2/3) BioGrid Prototype integrates with GoPubMed-D via embedded Prova-AA JADE agent.

  14. Distributed computation with Prova-AA agents A flexible solution for a self-managing self-balancing distributed computation: • Manager and Workers architecture based on Prova-AA agents with Java computation modules. • Loosely synchronous interaction. • Minimal compact coding (30 lines for Manager and 20 lines for Worker). • Manager does not need to keep a registry of the Workers that can join in at any time. • Computation is divided in small atomic subtasks (4 or 5 proteins). • Manager dispatches a new subtask asynchronously upon receiving a ready message from a Worker. • Worker computes a subtask and responds with the results in a reply message and a new ready message. • Workers compute subtasks at their own pace so load balancing is automatic. • Workers extended with routing capabilities are available. • Can be easily extended with failover capabilities.

  15. Proposed Architecture of integrated platform

  16. First results & infrastructure needs IST-2001-38344

  17. Building an information system for biology is non-trivial • Molecular biology resources: • Are heterogeneous in content: • Genomics, proteomics, literature. • Exist in a large number: • Public, commercial, organisational, personal. • Variable quality: Curated vs. automatic. • Have different interfaces: Web, SQL, SOAP, etc. • Are geographically distributed w/o yellow pages. • Store data in different formats - few standards. • Change rapidly. • Confidentiality & IPR protection. • Are too large to transport conveniently.

  18. Technology challenges in building bioGrid • Semantic Complexity: • Computer does not “understand” data. • DBs and systems cannot inter-operate. • Computational complexity: • Generating protein interaction map takes ca. 1 day. • Analysing large sets of gene expression data can take up to an hour. • Analysis of large text bodies complex.

  19. Social challenges in building Grid • Over-hyped & scepticism. • Technology stability & reliability. • Security. • Usability. • Peer-reviewed results in major biomedical journals: • Science, Nature, Cell, BMJ, Lancet, etc.

More Related