1 / 25

Computational metagenome analysis

Computational metagenome analysis. 1. 1. Soon, most data might NOT be stored in central databases (s. physics). Peer Bork, EMBL-Heidelberg. 1. 2. Soon, PI’s computing resources might become insufficient. 1. Context: Personal views on how the computational infrastructure might evolve.

siusan
Download Presentation

Computational metagenome analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational metagenome analysis 1. 1. Soon, most data might NOT be stored in central databases (s. physics) Peer Bork, EMBL-Heidelberg 1. 2. Soon, PI’s computing resources might become insufficient 1. Context: Personal views on how the computational infrastructure might evolve 1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges Metagenome analysis team, since Mar 04, aiming at discovery, mostly working on method development though

  2. Animal genomes (>100Mb, published, >95% cov) 98 00 01 02 03 04 05 06 07 Metagenomics (>50Mb, not focussed, non-16S, published, deposited) Mammoth bones (454) Global Ocean Survey Acid mine drainage Soudan mine (454) Farm Soil Deep sea whale bones Human gut Sargasso sea Mouse gut Sludge N-pacific s-trop gyre 2007 2006 2005 2004 Data analysis: the signs before the flood Microbial genomes published per year

  3. Accelerated exponential increase of ORF numbers (100 fold decrease in sequencing costs not visible yet) Log scale !!!

  4. Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 1. 1. Soon, most data might NOT be stored in central databases (s. physics) 1. 2. Soon, PI’s computing resources might become insufficient 1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 3. Potential, challenges, examples

  5. Standard analysis: Ca 14 Mio ORFs, all-against-all blast 12.000 days on one CPU (ca 5 days real time at BSC) 9 TB output, data transfer to EMBL 2 weeks Ongoing post-analysis… MareNostrum supercomputer Barcelona

  6. Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 1. 1. Soon, most data might NOT be stored in central databases (s. physics) 1. 2. Soon, PI’s computing resources might become insufficient 1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges

  7. Taxonomic census of diverse environments Great potential… >more quantitative than 16S RNA profiling Archea >reveals novelty, e.g. -stable habitat prefer. -water faster than soil Eykarya Bacteria but… >analysis of 1 marker can take hours on a single CPU Von Mering et al., Science 31(07) 1126

  8. Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.2. Few basic analyses options (e.g. WEB, downloads) 2.3. Analyses driven by individual PI’s research focus and expertise 2.4. Missing standards even for basic steps of annotation 2.5. Difficulties for comparative meta-analyses 3. Potential and challenges

  9. 8 studies 4 assembly pipelines 6 ORF calling procedures 8 Function prediction protocols 8 Parameter settings

  10. Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.2. Few basic analyses options (e.g. WEB, downloads) 2.3. Analyses driven by individual PI’s research focus and expertise 2.4. Missing standards even for basic steps of annotation 2.5. Difficulties for comparative meta-analyses 3. Potential and challenges

  11. Increase of functional assignments (via orthologous groups) with coverage (out of a total of 20334 OGs) … [OGs in STRING now at 40 000] Reason for differences Biological issues GC content Genome sizes Phylogeny Evolutionary speed Orthologous groups (COGs + NOGs) Evenness/Richness Functionality Technical issues Sampling +preparation Sequencing method Assembly+annotation Coverage ….. Tringe*, von Mering* … Bork, Hugenholtz, Rubin Science 308(05)554

  12. Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges 3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%) 3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry) 3.3. High demands for integration (e.g. temporal, spatial, and context information)

  13. CHARACTERIZED UNCHARACTERIZED UNCHARACTERIZABLE The Unknown – by Donald Rumsfeld As we know, There are known knowns. There are things we know we know. We also know There are known unknowns. That is to say We know there are some things We do not know. But there are also unknown unknowns, The ones we don't know We don't know. Feb. 12, 2002, Department of Defense news briefing

  14. CHARACTERIZED Next to CHARACT. UNCHARACTERIZED Next to UNCHARACT. UNCHARACTERIZABLE UNCHARACTERIZABLE Protein function prediction in metagenomics samples (taken from the STRING resource) Neighborhood Blast Overall function predictions for >70% of environmental data!

  15. More on the unknown … Mining for novelty in environmental data Homology-based Novel antibiotics biosynthesis enzyme families (PKS1) Neighbourhood-based Coupling of fatty acid biosynthesis and degradation via new transcription regulator

  16. Our functional knowledge: glass half full or half empty? Function prediction in gene families of 1.5mio proteins from 4 environments Our knowledge concentrates in large, well established families contributing 65% of the ORFs; However, many specialized functions in small gene families are to be discovered All against all, MCL clustering, (60bits, inflation factor 1.1)

  17. Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges 3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%) 3.2. Meta-analysis essential to bridge to other fields (e.g. ecology, geochemistry) 3.3. High demands for integration (e.g. temporal, spatial, and context information)

  18. Metagenome sequencing and analysis Substrate assays Substrate usage rank ab. curve Turnover Substrates Wide range of functions considered Limited to assayed substrates Functional indicators (richness, evenness►functional diversity) From Phenotypic to Genomic EcologyApplying ecology concepts to metagenomic data: Functional diversity Genomic repertoire rank ab. curve Abundance Functions   Functional diversity indicators should reveal properties of community networks

  19. Computational metagenome analysis 1. Context: Personal views on how the computational infrastructure might evolve 2. Current situation of basic metagenome analyses landscape 3. Potential and challenges 3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%) 3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry) 3.3. High demands for integration (e.g. temporal, spatial, and context information)

  20. Data integration: Reconciling species diversity with functional diversity oxic zone low H2S high H2S

  21. Computational metagenome analysis Conclusions/Some thoughts… - Encourage development of widely usable analysis tools to increase impact of smaller scale studies - Establish working group on how to adapt important tools to data flood - Encourage meta-analysis to make multiple use of the data - Install not only centres for resources (data, cpu...) but also for analysis; requiring interactions with individual projects - At least, large projects would benefit from international collaborations and combination of different skill sets - Analysis infrastructure should go beyond human microbiome; there might not be clear-cut borders with environment - Stimulate interactions with other research communities (Chemical Biology, Ecology etc.) to incorporate novel concepts into analyses

  22. METAHIT consortium

  23. Computational metagenomics analysis: Need for wide-spread tools and collaborations 1. Context: Personal views on how the computational infrastructure might evolve  1. 1. Soon, most data might NOT be stored in central databases (s. physics)  1. 2. Soon, PI’s computing resources might become insufficient  1. 3. Soon, analysis tools might be too complex for local use 2. Current situation of basic metagenome analyses landscape 2.1. Few emerging public resources (e.g. JGI, CALIT, EBI) 2.2. Few basic analyses options (e.g. WEB, downloads) 2.3. Analyses driven by individual PI’s research focus and expertise  2.4. Missing standards even for basic steps of annotation  2.5. Difficulties for comparative meta-analyses 3. Potential, needs, and challenges  3.1. Better tools dramatically improve efficiency (e.g. in function annotation up to 100%)  3.2. Meta-analysis essential to bridge to other fields (e,g, ecology, geochemistry)  3.3. High demands for integration (e.g. temporal, spatial, and context information) 3.4. Spread of tools (e.g. via standards, compatibility etc.) and/or merge skill sets

  24. Summary: Bioinformatics almost ready to go For comparative analysis, impact and interdependence of several factors still need to be determined, e.g. genome size, GC content, phylogenetic spread, functional richness, evenness and diversity etc. But this seems doable and thus it should be possible to adapt ecological concepts to (molecular) metagenomics data using computational tools to describe functional diversity with unprecedented resolution It will require many more parameters to be recorded though e.g. to cover temporal aspects (unlikely to be steady states everywhere) Computational analysis will be THE key to integrate with chemical, medical, ecological, geological etc. data

More Related