330 likes | 479 Views
The Integrated Microbial Genome (IMG) systems. Nikos Kyrpides. Reddy. Bahador. Iain. Denis. Amrita. Billis. Peter. Marcel. OMICS GROUP. STANDARDS GROUP. ANNOTATION GROUP. Natalia. Dino. Kostas. Ioanna. Biological Data Management. Victor Markowitz. Yuri Grechkin. Ken Chu.
E N D
The Integrated Microbial Genome (IMG) systems Nikos Kyrpides
Reddy • Bahador • Iain • Denis • Amrita • Billis • Peter • Marcel • OMICS GROUP • STANDARDS GROUP • ANNOTATION GROUP • Natalia • Dino • Kostas • Ioanna Biological Data Management Victor Markowitz Yuri Grechkin Ken Chu Ernest Szeto Krishna Palaniappan Amy Chen Biju Jacob
Science driven data generation and analysis ANALYSIS • User • Facility
Science driven data generation and analysis ANALYSIS • User • Facility
Data analysis Comparative Analysis Data Integration
What is the Matrix? Data management system for comparative analysis of biological data Genomes Functions Genes IMG Clusters Metadata I SNPs M Proteomics G Regulons Transcriptomes
Become the HOME of Microbial Genomes and Metagenomes • support comparative genome analysis • support community functional annotation provide a user friendly interface IMG’s Mission
Integrated Microbial Genomes (IMG)[It’s easier to analyze 1000 genomes than a single one] Bacteria: 2780 Archaea: 107 Eukarya: 121 Plasmids: 1186 Viruses: 2697 http://img.jgi.doe.gov/ • What is IMG: • IMG is a data management system for comparative analysis and annotation of all publicly available genomes from three domains of life in a uniquely integrated context. • Mission: • To become the Home of Microbial Genome and Metagenome Analysis • Background: • Launched on March 2005 • 3 Releases/Year, 20 releases so far • >5,000 unique visitors per month • >350 citations • Current Status: • 6891 Genomes • 11.6 Million Genes • http://img.jgi.doe.gov/ • http://img.jgi.doe.gov/ • USERS CAN • Search data • Browse data • Compare data • Export data
Why more data are neededfaster and more accurate function prediction Fructokinase family Ribokinase family 2-dehydro-3-deoxyglucokinase family
Metagenomic Analysis Binning ? Soil Sargasso Sea Termite Hindgut Human Gut Acid Mine Drainage Reference Genomes Species complexity 1 10 1001000 1000s 10000 The road to success in Metagenomics is through Microbial Genomics Source: Susannah Tringe, JGI
Availability of Reference Genomes ? Soil Human gut Termite Gut Marine Acid Mine Drainage Reference Genomes 100%60% 50% 40% 20% 1%
Genes present inG1 and absent fromG2, G3, G4 and G5 Gene occurrence profile across genomes Gene occurrence profiles across pathways g1 + + + + + g2 + + - + + g3 + - - - - G1 G2G3 G4 G5 Pathways shared by genomes Data Model Abstraction Example: IMG Operations Genes Genomes Functions/ Pathways
IMG Data Integration Genes • RNAs, Proteins • Sequence Clusters • Positional clusters • Regulatory clusters • Fusions • Operons • Expression • COG • GO • Pfam • TIGRfam • InterPro • KEGG • BioCyc • SEED • Protein product • MyIMG • IMG Terms • IMG Pathways • IMG Networks Genomes Functions • Groupings • Phylogenetic • Phenotypic • Ecotypic • Disease • Geographical • Isolation 11.6M 6891 1.1M
IMG Toolkit Gene Synteny Functional Categories Projects Map Function Profile Abundance Profiles Chromosome Map Genome Clustering IMG Pathway Profile Metadata Search Compare Annotations Phylogenetic Profile VISTA KEGG Maps Phylogenetic Distribution Chromosomal Map Recruitment Plot Fragment Recruitment Artemis WRITE PAPER
USERS CAN • Search data • Browse data • Compare data • Export data UNIQUE VISITS ~ 5,000 / month • USERS CAN • Submit data • Annotate data
Informatics Steps & Servicessupport of a new user community INTEGRATION & COMPARATIVE ANALYSIS 2012 ASSEMBLY 2005 IMG 2008 IMG-ER
Data Challenges & Opportunities • Metadata • Gene calling • Annotation • Quantity • Quality Data Analysis Integration • Number of Genes • All vs all Blast • Number of Datasets • How do we navigate through a sea of data
Challenges we face • DATA SIZE • DATA QUALITY • DATA STANDARDS
Challenges we face • 1. DATA SIZE • Number of Genes • Number of Datasets • How do we compare data • How do we find data • How do we navigate through data
ii. Method dev for data reduction & comparison- Computation of Similarities Use clusters 2. Computation of similarities Reference genomes Metagenome Metagenome Metagenome Clusters • Common/unique genes • Rapid identification of best hit(s) • ….
10 Prochlorococcus marinus Pangenome 17 Listeria monocytogenes Pangenome Staphylococcus aureus Pangenome 15 Pangenomes • We need better ways to • represent and browse through thousands of genomes • represent an organism
Metagenome Analysiswith Pangenomes Best Blast Hit Reference Genome Pangenome
Challenges we face • 2. DATA QUALITY • Did we generate enough data to support biological conclusions? • Did we introduce any biases during sequencing? • Is the quality of assembly comparable between different datasets? • Is the quality of predicted genes comparable between different datasets? • Is the quality of functional annotation comparable between different datasets
Microbial Genomes Gene Prediction Quality Assurance GenePRIMP http://geneprimp.jgi-psf.org Gene Prediction Improvement Pipeline GenePRIMP is a pipeline that consists of a series of computational units that identify erroneous gene calls and missed genes and correct a subset of the identified defective features. APPLICATIONS • Identify gene prediction anomalies • Benchmark the quality of gene prediction algorithms • Benchmark the quality of combination / coverage of sequencing platforms • Improve the sequence quality Pati A. et al, (2010) Nature Methods Amrita Natalia
Challenges we face • 3. DATA STANDARDS • Assembly • Gene Finding • Functional Annotation • Metadata
Project Catalog & Metadata Genomes OnLine Database I. Pagani D. Liolios
COMPUTATIONSM5: Pilot Project with ANL innovation through collaboration Building a roadmap for a scaleable and sustainable computing MetaInfrastructure for the metagenomics community • develop standards to share and process data more effectively • run data-intensive workflows once (reduce wasted cycles) • Develop a single QC data processing pipeline • Develop a single data submission entry • Develop a single data processing pipeline • Develop a common project catalog
Ongoing Developments New Data & Tools for Visualization & Analysis of • Integration of Expression data • Integration of Regulatory Data • Resequencing data (strain variation) • Pangenomes Data Processing • Short Read annotation • Bypass the all vs all Blast bottleneck