1 / 39

Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 12-02-10. Harvard School of Public Health Department of Biostatistics. What tools enable biological discoveries?. Our job is to create computational microscopes:

gilead
Download Presentation

Scalable data mining for functional genomics and metagenomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 12-02-10 Harvard School of Public Health Department of Biostatistics

  2. What tools enable biological discoveries? Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results

  3. Outline 2. Microbial biomarkers: Metagenomics in public health 1. Metagenomics: Network models of microbial communities 3. Data mining: Integrating very large genomic data compendia

  4. What’s metagenomics? Total collection of microorganisms within a community Also microbial communityormicrobiota Total genomic potential of a microbial community Study of uncultured microorganisms from the environment, which can include humans or other living hosts Total biomolecular repertoire of a microbial community

  5. What to do with your metagenome? Reservoir of gene and protein functional information Comprehensive snapshot of microbial ecology and evolution Who’s there? What are they doing? What do functional genomic data tell us about microbiomes? What can our microbiomes tell us about us?* (x1010) Public health tool monitoring population health and interactions Diagnostic or prognostic biomarker for host disease *Using terabases of sequence and thousands of experimental results

  6. The Human Microbiome Project All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, antibiotic resistant infection… • 300 “normal” adults, 18-40 • 16S rDNA + WGS • 5 sites/18 samples + blood • Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth • Skin: ears, inner elbows • Nasal cavity • Gut: stool • Vagina:introitus, mid, fornix • Reference genomes (~200+800) Kolenbrander, 2010 Hamady, 2009 2007 - ongoing

  7. Information provided by metagenomic assays Microbiome data Genomic data(Reference genomes) Functional data(Experimental models) 16S reads Taxa Binning WGS reads Orthologous clusters Functional roles Clustering Pathways/modules Pathway activity

  8. HMP: Data  features 16S reads Taxa Orthologous clusters Genes(KOs) Pathways/modules Pathways(KEGGs)

  9. HMP Organisms: Everyone andeverywhere is different ← Body sites + individuals → gut nose mouth arm vagina ear mucosa palate gingiva tonsils saliva sub. plaq. sup. plaq. throat tongue ← Organisms (taxa) → Aerobicity, interaction with the immune system, and extracellular medium appear to be major determinants Every microbiome is surprisingly different Even common organisms vary tremendously in abundance among individuals There are few, if any, organismal biotypes in health Most organisms are rare in most places

  10. HMP: Metabolic reconstruction Functional seq. KEGG + MetaCYC CAZy, TCDB,VFDB, MEROPS… 300 subjects 1-3 visits/subject ~6 body sites/visit 10-200M reads/sample 100bp reads BLAST Smoothing Witten-Bell BLAST → Genes Genes → Pathways MinPath(Ye 2009) WGS reads Genes(KOs) Taxonomic limitation Rem. paths in taxa < ave. ? Pathways(KEGGs) Pathways/modules Xipe Distinguish zero/low(Rodriguez-Mueller in review) Gap filling c(g) = max( c(g), median )

  11. HMP: Metabolic reconstruction Pathway coverage Pathway abundance

  12. HUMAnN: Evaluation on synthetic metagenomes High complexity, staggered, ≤90% identity LC, stg.

  13. HMP: Metabolic reconstruction Pathway abundance ← Samples → ← Pathways→

  14. HMP: Metabolic reconstruction Pathway coverage ← Samples → All body sites (“core”) ← Pathways→ Aerobic body sites Gastrointestinal body sites

  15. HMP: MetaCyc Coverage + Abundance

  16. HMP: Metabolism, host-microbiome interactions, and microbial taxa >3200 gene families differential in the mucosa >1500 upregulated outsidethe mucosa and not in anyActinobacterial genome WGS 16S

  17. Outline 2. Microbial biomarkers: Metagenomics in public health 1. Metagenomics: Network models of microbial communities 3. Data mining: Integrating very large genomic data compendia

  18. ~2000 AML/ALL Survival Mutation Batcheffects Geneexpression Functionalmodules

  19. ~2005 Healthy/Diabetes BMI M/F Populationstructure SNPgenotypes LD

  20. 2010 Intervention/perturbation Healthy/IBD Temperature Location Biological story? Independent sample ??? Cross-validate Taxa &Orthologs Niches &Phylogeny Test forcorrelates Confounds/stratification/environment Featureselectionp >> n Multiplehypothesiscorrection

  21. LEfSe: Metagenomic classcomparison and explanation LEfSe LDA +Effect Size Nicola Segata Coming soon to a URL near you!

  22. LEfSe: Evaluation on synthetic data

  23. LEfSe: The TRUC murine colitis microbiota With Wendy Garrett

  24. MetaHIT: The gut microbiome and IBD With Ramnik Xavier, Joshua Korzenik 124 subjects: 99 healthy 21 UC + 4 CD Taxa Qin 2010 PhymmBrady 2009 WGS reads ReBLASTed against KEGG since published data obfuscates read counts Genes(KOs) Pathways/modules Pathways(KEGGs)

  25. MetaHIT: Taxonomic CD biomarkers Up in CD Down in CD Firmicutes UC Enterobacteriaceae

  26. MetaHIT: Functional CD biomarkers Subset of enriched pathways in CD patients Subset of enriched modules in CD patients Up in CD Down in CD Growth/replication Motility Transporters Sugar metabolism

  27. MetaHIT: Enzymes and metabolites over/under-enriched in the CD microbiome Up in CD Enzymefamilies Inferredmetabolites Growth/replication Motility Transporters Sugarmetabolism Down in CD

  28. Outline 2. Microbial biomarkers: Metagenomics in public health 1. Metagenomics: Network models of microbial communities 3. Data mining: Integrating very large genomic data compendia

  29. A computational definition offunctional genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Gene Data ↓ Function Function ↓ Function

  30. A framework for functional genomics 100Ms gene pairs → ← 1Ks datasets P(G2-G5|Data) = 0.85 Frequency Low Correlation High Correlation = + Frequency Not let. Let. Frequency Dissim. Similar Low Similarity High Similarity Low Correlation High Correlation

  31. Functional networkprediction and analysis Global interaction network HEFalMp Currently includes data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases Carbon metabolism network Extracellular signaling network Gut community network

  32. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

  33. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 = +

  34. Unsupervised data integration:TB virulence and ESX-1 secretion With Sarah Fortune Graphle http://huttenhower.sph.harvard.edu/graphle/

  35. Unsupervised data integration:TB virulence and ESX-1 secretion With Sarah Fortune X ? Graphle http://huttenhower.sph.harvard.edu/graphle/

  36. Sleipnir: Software forscalable functional genomics Massive datasets require efficientalgorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. • Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) • And it’s fully documented! It’s also speedy: microbial data integration computation takes <3hrs.

  37. Outline • Metagenomics: structure and function of microbial communities • HMP: microbiome in health, 18 body sites in 300 subjects • HUMAnN: metagenomic metabolic and functional pathway reconstruction • LEfSe: biologically relevant community differences • Iron and sugar transport as key players in the IBD microbiota • Sleipnir: software for scalable genomic data mining 2. Microbial biomarkers: Metagenomics in public health 1. Metagenomics: Network models of microbial communities • Network framework for scalable data integration • HEFalMp: human data integration • Meta-analysis for unsupervised functional network integration 3. Data mining: Integrating very large genomic data compendia

  38. Thanks! Human Microbiome Project George Weinstock Jennifer Wortman Owen White MakedonkaMitreva Erica Sodergren VivienBonazzi Jane Peterson Lita Proctor SaharAbubucker Yuzhen Ye Beltran Rodriguez-Mueller Jeremy Zucker QiandongZeng MathangiThiagarajan Brandi Cantarel Maria Rivera Barbara Methe Bill Klimke Daniel Haft Dirk Gevers Jacques Izard Nicola Segata PinakiSarder Ramnik Xavier HMP Metabolic Reconstruction Wendy Garrett Sarah Fortune Bruce Birren Mark Daly Doyle Ward Eric Alm Ashlee Earl Lisa Cosimi Levi Waldron LarisaMiropolsky Interested? We’re recruiting students and postdocs! http://huttenhower.sph.harvard.edu http://huttenhower.sph.harvard.edu/sleipnir

More Related