250 likes | 379 Views
Computational metagenome analysis. 1. 1. Soon, most data might NOT be stored in central databases (s. physics). Peer Bork, EMBL-Heidelberg. 1. 2. Soon, PI’s computing resources might become insufficient. 1. Context: Personal views on how the computational infrastructure might evolve.
E N D
Computational metagenome analysis 1. 1. Soon, most data might NOT be stored in central databases (s. physics) Peer Bork, EMBL-Heidelberg 1. 2. Soon, PI’s computing resources might become insufficient 1. Context: Personal views on how the computational infrastructure might evolve 1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges Metagenome analysis team, since Mar 04, aiming at discovery, mostly working on method development though
Animal genomes (>100Mb, published, >95% cov) 98 00 01 02 03 04 05 06 07 Metagenomics (>50Mb, not focussed, non-16S, published, deposited) Mammoth bones (454) Global Ocean Survey Acid mine drainage Soudan mine (454) Farm Soil Deep sea whale bones Human gut Sargasso sea Mouse gut Sludge N-pacific s-trop gyre 2007 2006 2005 2004 Data analysis: the signs before the flood Microbial genomes published per year
Accelerated exponential increase of ORF numbers (100 fold decrease in sequencing costs not visible yet) Log scale !!!
Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 1. 1. Soon, most data might NOT be stored in central databases (s. physics) 1. 2. Soon, PI’s computing resources might become insufficient 1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 3. Potential, challenges, examples
Standard analysis: Ca 14 Mio ORFs, all-against-all blast 12.000 days on one CPU (ca 5 days real time at BSC) 9 TB output, data transfer to EMBL 2 weeks Ongoing post-analysis… MareNostrum supercomputer Barcelona
Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 1. 1. Soon, most data might NOT be stored in central databases (s. physics) 1. 2. Soon, PI’s computing resources might become insufficient 1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges
Taxonomic census of diverse environments Great potential… >more quantitative than 16S RNA profiling Archea >reveals novelty, e.g. -stable habitat prefer. -water faster than soil Eykarya Bacteria but… >analysis of 1 marker can take hours on a single CPU Von Mering et al., Science 31(07) 1126
Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.2. Few basic analyses options (e.g. WEB, downloads) 2.3. Analyses driven by individual PI’s research focus and expertise 2.4. Missing standards even for basic steps of annotation 2.5. Difficulties for comparative meta-analyses 3. Potential and challenges
8 studies 4 assembly pipelines 6 ORF calling procedures 8 Function prediction protocols 8 Parameter settings
Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.2. Few basic analyses options (e.g. WEB, downloads) 2.3. Analyses driven by individual PI’s research focus and expertise 2.4. Missing standards even for basic steps of annotation 2.5. Difficulties for comparative meta-analyses 3. Potential and challenges
Increase of functional assignments (via orthologous groups) with coverage (out of a total of 20334 OGs) … [OGs in STRING now at 40 000] Reason for differences Biological issues GC content Genome sizes Phylogeny Evolutionary speed Orthologous groups (COGs + NOGs) Evenness/Richness Functionality Technical issues Sampling +preparation Sequencing method Assembly+annotation Coverage ….. Tringe*, von Mering* … Bork, Hugenholtz, Rubin Science 308(05)554
Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges 3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%) 3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry) 3.3. High demands for integration (e.g. temporal, spatial, and context information)
CHARACTERIZED UNCHARACTERIZED UNCHARACTERIZABLE The Unknown – by Donald Rumsfeld As we know, There are known knowns. There are things we know we know. We also know There are known unknowns. That is to say We know there are some things We do not know. But there are also unknown unknowns, The ones we don't know We don't know. Feb. 12, 2002, Department of Defense news briefing
CHARACTERIZED Next to CHARACT. UNCHARACTERIZED Next to UNCHARACT. UNCHARACTERIZABLE UNCHARACTERIZABLE Protein function prediction in metagenomics samples (taken from the STRING resource) Neighborhood Blast Overall function predictions for >70% of environmental data!
More on the unknown … Mining for novelty in environmental data Homology-based Novel antibiotics biosynthesis enzyme families (PKS1) Neighbourhood-based Coupling of fatty acid biosynthesis and degradation via new transcription regulator
Our functional knowledge: glass half full or half empty? Function prediction in gene families of 1.5mio proteins from 4 environments Our knowledge concentrates in large, well established families contributing 65% of the ORFs; However, many specialized functions in small gene families are to be discovered All against all, MCL clustering, (60bits, inflation factor 1.1)
Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges 3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%) 3.2. Meta-analysis essential to bridge to other fields (e.g. ecology, geochemistry) 3.3. High demands for integration (e.g. temporal, spatial, and context information)
Metagenome sequencing and analysis Substrate assays Substrate usage rank ab. curve Turnover Substrates Wide range of functions considered Limited to assayed substrates Functional indicators (richness, evenness►functional diversity) From Phenotypic to Genomic EcologyApplying ecology concepts to metagenomic data: Functional diversity Genomic repertoire rank ab. curve Abundance Functions Functional diversity indicators should reveal properties of community networks
Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges 3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%) 3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry) 3.3. High demands for integration (e.g. temporal, spatial, and context information)
Data integration: Reconciling species diversity with functional diversity oxic zone low H2S high H2S
Computational metagenome analysis Conclusions/Some thoughts… - Encourage development of widely usable analysis tools to increase impact of smaller scale studies - Establish working group on how to adapt important tools to data flood - Encourage meta-analysis to make multiple use of the data - Install not only centres for resources (data, cpu...) but also for analysis; requiring interactions with individual projects - At least, large projects would benefit from international collaborations and combination of different skill sets - Analysis infrastructure should go beyond human microbiome; there might not be clear-cut borders with environment - Stimulate interactions with other research communities (Chemical Biology, Ecology etc.) to incorporate novel concepts into analyses
Computational metagenomics analysis: Need for wide-spread tools and collaborations 1. Context: Personal views on how the computational infrastructure might evolve 1. 1. Soon, most data might NOT be stored in central databases (s. physics) 1. 2. Soon, PI’s computing resources might become insufficient 1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.2. Few basic analyses options (e.g. WEB, downloads) 2.3. Analyses driven by individual PI’s research focus and expertise 2.4. Missing standards even for basic steps of annotation 2.5. Difficulties for comparative meta-analyses 3. Potential, needs, and challenges 3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%) 3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry) 3.3. High demands for integration (e.g. temporal, spatial, and context information) 3.4. Spread of tools (e.g. via standards, compatibility etc.) and/or merge skill sets
Summary: Bioinformatics almost ready to go For comparative analysis, impact and interdependence of several factors still need to be determined, e.g. genome size, GC content, phylogenetic spread, functional richness, evenness and diversity etc. But this seems doable and thus it should be possible to adapt ecological concepts to (molecular) metagenomics data using computational tools to describe functional diversity with unprecedented resolution It will require many more parameters to be recorded though e.g. to cover temporal aspects (unlikely to be steady states everywhere) Computational analysis will be THE key to integrate with chemical, medical, ecological, geological etc. data