1 / 32

The Integrated Microbial Genome (IMG) systems

The Integrated Microbial Genome (IMG) systems. Nikos Kyrpides. Reddy. Bahador. Iain. Denis. Amrita. Billis. Peter. Marcel. OMICS GROUP. STANDARDS GROUP. ANNOTATION GROUP. Natalia. Dino. Kostas. Ioanna. Biological Data Management. Victor Markowitz. Yuri Grechkin. Ken Chu.

lukas
Download Presentation

The Integrated Microbial Genome (IMG) systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Integrated Microbial Genome (IMG) systems Nikos Kyrpides

  2. Reddy • Bahador • Iain • Denis • Amrita • Billis • Peter • Marcel • OMICS GROUP • STANDARDS GROUP • ANNOTATION GROUP • Natalia • Dino • Kostas • Ioanna Biological Data Management Victor Markowitz Yuri Grechkin Ken Chu Ernest Szeto Krishna Palaniappan Amy Chen Biju Jacob

  3. Science driven data generation and analysis ANALYSIS • User • Facility

  4. Science driven data generation and analysis ANALYSIS • User • Facility

  5. Data analysis Comparative Analysis Data Integration

  6. What is the Matrix? Data management system for comparative analysis of biological data Genomes Functions Genes IMG Clusters Metadata I SNPs M Proteomics G Regulons Transcriptomes

  7. Become the HOME of Microbial Genomes and Metagenomes • support comparative genome analysis • support community functional annotation provide a user friendly interface IMG’s Mission

  8. Integrated Microbial Genomes (IMG)[It’s easier to analyze 1000 genomes than a single one] Bacteria: 2780 Archaea: 107 Eukarya: 121 Plasmids: 1186 Viruses: 2697 http://img.jgi.doe.gov/ • What is IMG: • IMG is a data management system for comparative analysis and annotation of all publicly available genomes from three domains of life in a uniquely integrated context. • Mission: • To become the Home of Microbial Genome and Metagenome Analysis • Background: •  Launched on March 2005 •  3 Releases/Year, 20 releases so far • >5,000 unique visitors per month •  >350 citations • Current Status: • 6891 Genomes • 11.6 Million Genes • http://img.jgi.doe.gov/ • http://img.jgi.doe.gov/ • USERS CAN • Search data • Browse data • Compare data • Export data

  9. Why more data are neededfaster and more accurate function prediction Fructokinase family Ribokinase family 2-dehydro-3-deoxyglucokinase family

  10. Metagenomic Analysis Binning ? Soil Sargasso Sea Termite Hindgut Human Gut Acid Mine Drainage Reference Genomes Species complexity 1 10 1001000 1000s 10000 The road to success in Metagenomics is through Microbial Genomics Source: Susannah Tringe, JGI

  11. Availability of Reference Genomes ? Soil Human gut Termite Gut Marine Acid Mine Drainage Reference Genomes 100%60% 50% 40% 20% 1%

  12. Genes present inG1 and absent fromG2, G3, G4 and G5 Gene occurrence profile across genomes Gene occurrence profiles across pathways g1 + + + + + g2 + + - + + g3 + - - - - G1 G2G3 G4 G5 Pathways shared by genomes Data Model Abstraction Example: IMG Operations Genes Genomes Functions/ Pathways

  13. IMG Data Integration Genes • RNAs, Proteins • Sequence Clusters • Positional clusters • Regulatory clusters • Fusions • Operons • Expression • COG • GO • Pfam • TIGRfam • InterPro • KEGG • BioCyc • SEED • Protein product • MyIMG • IMG Terms • IMG Pathways • IMG Networks Genomes Functions • Groupings • Phylogenetic • Phenotypic • Ecotypic • Disease • Geographical • Isolation 11.6M 6891 1.1M

  14. IMG Toolkit Gene Synteny Functional Categories Projects Map Function Profile Abundance Profiles Chromosome Map Genome Clustering IMG Pathway Profile Metadata Search Compare Annotations Phylogenetic Profile VISTA KEGG Maps Phylogenetic Distribution Chromosomal Map Recruitment Plot Fragment Recruitment Artemis WRITE PAPER

  15. USERS CAN • Search data • Browse data • Compare data • Export data UNIQUE VISITS ~ 5,000 / month • USERS CAN • Submit data • Annotate data

  16. Informatics Steps & Servicessupport of a new user community INTEGRATION & COMPARATIVE ANALYSIS 2012 ASSEMBLY 2005 IMG 2008 IMG-ER

  17. Data Challenges & Opportunities • Metadata • Gene calling • Annotation • Quantity • Quality Data Analysis Integration • Number of Genes • All vs all Blast • Number of Datasets • How do we navigate through a sea of data

  18. Challenges we face • DATA SIZE • DATA QUALITY • DATA STANDARDS

  19. Challenges we face • 1. DATA SIZE • Number of Genes • Number of Datasets • How do we compare data • How do we find data • How do we navigate through data

  20. ii. Method dev for data reduction & comparison- Computation of Similarities Use clusters 2. Computation of similarities Reference genomes Metagenome Metagenome Metagenome Clusters • Common/unique genes • Rapid identification of best hit(s) • ….

  21. SCALINGComputation of Similarities

  22. Strain / species diversity

  23. 10 Prochlorococcus marinus Pangenome 17 Listeria monocytogenes Pangenome Staphylococcus aureus Pangenome 15 Pangenomes • We need better ways to • represent and browse through thousands of genomes • represent an organism

  24. Metagenome Analysiswith Pangenomes Best Blast Hit Reference Genome Pangenome

  25. Challenges we face • 2. DATA QUALITY • Did we generate enough data to support biological conclusions? • Did we introduce any biases during sequencing? • Is the quality of assembly comparable between different datasets? • Is the quality of predicted genes comparable between different datasets? • Is the quality of functional annotation comparable between different datasets

  26. Microbial Genomes Gene Prediction Quality Assurance GenePRIMP http://geneprimp.jgi-psf.org Gene Prediction Improvement Pipeline GenePRIMP is a pipeline that consists of a series of computational units that identify erroneous gene calls and missed genes and correct a subset of the identified defective features. APPLICATIONS • Identify gene prediction anomalies • Benchmark the quality of gene prediction algorithms • Benchmark the quality of combination / coverage of sequencing platforms • Improve the sequence quality Pati A. et al, (2010) Nature Methods Amrita Natalia

  27. Challenges we face • 3. DATA STANDARDS • Assembly • Gene Finding • Functional Annotation • Metadata

  28. Project Catalog & Metadata Genomes OnLine Database I. Pagani D. Liolios

  29. COMPUTATIONSM5: Pilot Project with ANL innovation through collaboration Building a roadmap for a scaleable and sustainable computing MetaInfrastructure for the metagenomics community • develop standards to share and process data more effectively • run data-intensive workflows once (reduce wasted cycles) • Develop a single QC data processing pipeline • Develop a single data submission entry • Develop a single data processing pipeline • Develop a common project catalog

  30. Standards in Genomic Scienceshttp://standardsingenomics.org

  31. Ongoing Developments New Data & Tools for Visualization & Analysis of • Integration of Expression data • Integration of Regulatory Data • Resequencing data (strain variation) • Pangenomes Data Processing • Short Read annotation • Bypass the all vs all Blast bottleneck

More Related