1 / 17

Integrated Microbial Genomes (IMG) System

Integrated Microbial Genomes (IMG) System. A Case Study in Biological Data Management. Different views on biological data management ( VLDB 2004 Panel on Biological Data Management) Computer Scientists Source of problems for database research Publication in database papers Prototypes

edmundj
Download Presentation

Integrated Microbial Genomes (IMG) System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrated Microbial Genomes (IMG) System A Case Study in Biological Data Management • Different views on biological data management (VLDB 2004 Panel on Biological Data Management) • Computer Scientists • Source of problems for database research • Publication in database papers • Prototypes • Biologists • Vehicle for rapid data analysis • Publication in biology papers • Immediate solutions Victor M. Markowitz Frank Korzeniewski Krishna Palaniappan Ernest Szeto Biological Data Management & Technology Center Lawrence Berkeley National Lab Nikos C. Kyrpides Natalia N. Ivanova Microbial Genome Analysis Program Joint Genome Institute

  2. Biological Data Management Problem Effective data analysis involves combining data from multiple sources • single data type data generation & collection • multiple data types data association in the context of inherently imprecise data

  3. Jan 04: 532 microbial genome projects Mar 05: 847 microbial genome projects Background: Microbial Genomes • Applications: • Healthcare, environmental cleanup, agriculture, industrial processes, alternative energy production

  4. Microbial Genome Data Analysis Context

  5. Genome X ? Genome Y y4 y3 y2 y1 Genes: x1 x2 x3 x4 Pathway R4 (e4) ? R3 (e3) Proteins from same cellular pathway are expected to co-occur in the majority of organisms from a phylogenetic branch R4 (e2) Functionally related genes tend to cluster on chromosome R1 (e1) Data Analysis Example: Occurrence Profiles • Key Challenges • Representing abstract concepts with experimental data • Specifying individual and composite operations • Data coherence, completeness, integration

  6. Process Raw data Small DNA sequence fragments Assembled sequence fragments (contigs) Complete (one contiguous) sequence Interpreted data Gene prediction (models) Functional prediction (annotations) Expert data validation (cleaning) Expert annotations Key Challenges Diversity of data sources Differences in models, depth/breadth of annotations Consistency of the data transformation process Evolution & diversity of Technology platforms Algorithms & parameters Experimental, data collection conditions Data Processing & Refinement Microbial Genomes: Data Generation & Collection

  7. Data Review Download Data For Review Replace Reference Genes Microbial Genome Annotation Review & Correction (JGI) IMG NR Report IMG Loading Final Review & Lock Data Cleansing Download Annotation Data Files Revised Annotation Data Files IMG Load Data Transformation Process Example Microbial Genome Annotation Pipeline (ORNL) Preliminary Functional Annotation Annotation Data Files ORF Calling Fetch Sequence Data Files Post

  8. Organisms Predicted Genes Functions Microbial Genomes: Data Association • Key Challenges • Data quality/precision for different types of data, sources • Transience of identifiers, relationships

  9. Challenging in academic settings Biological Data Management Problem Revisited Effective data analysis involves combining data from multiple sources in the context of inherently imprecise data while addressing • Data quality • Data semantics, precision, integrity, provenance • System quality • Comprehensibility, performance, reliability, scalability • Development strategy • Choice of technologies • Devising (cost, time) effective solutions

  10. Design & Planning Requirements Analysis Data Model Abstraction Develop System* Requirements Specification Time /Cost Constraints Stages Prototype Database, Tools System Tools Development Documents Requirement Examples Use Scenarios Case Studies Definitions Plans & Schedules Docs * System Development Program Test Document Preliminary Release Revise & Refine Final Release Needed: System Development Framework Deploy System

  11. Iterate Query construction Query results Chromosomal neighborhood analysis “Similar” gene analysis Collect genes of interest Requirement Analysis Example: IMG Data Analysis Find “unique” genes in a genome of interest Ψ0 wrt related genomes: Ψ1 , …, Ψk

  12. Data Model Abstraction • Motivation • Adds precision • Allows reasoning in an established framework • Analogies to traditional data domain • Biological data modeling • Data warehouse concepts • Proven technology for large scale biological data management applications • Data Structure • Multidimensional data space • Gene, genome, function/ pathway • Operations • Multidimensional space selections, projections, aggregations • Slice & dice, roll up, drill down… analogies

  13. Genes • “in” G1 • “in” G2 • “not in” G3 • “in” G4 • “in” G5 Gene occurrence profile across genomes Gene occurrence profiles across pathways g1 + + + + + g2 + + - + + g3 + - - - - G1 G2G3 G4 G5 Pathways shared by genomes Data Model Abstraction Example: IMG Operations Genes Genomes Functions/ Pathways

  14. Data Analysis Example: Searching for Unique Genes parasite in horses Causes human disease in tropical areas (melioidosis)

  15. Identifying Unique Genes of Interest Genes involved in adherence and invasion

  16. Exploring Unique Gene Details

  17. Summary • Needed Effective solutions for academic biological data management • Employing appropriate technologies and methods • Developed within (time, cost) constraints • IMG Case Study • System development process framework essential for • Continuously evolving content • aiming at coherence, completeness • Developing meaningful data analysis tools • Clarity of methods, parameters, results • Metric for success • Community adoption and support • Increase in analysis productivity and value

More Related