1 / 63

Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 09-16-10. Harvard School of Public Health Department of Biostatistics. Greatest discoveries in biology?. Our job is to create computational microscopes:

Download Presentation

Scalable data mining for functional genomics and metagenomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 09-16-10 Harvard School of Public Health Department of Biostatistics

  2. Greatest discoveries in biology? Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results

  3. Outline 1. Data mining: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

  4. A computational definition offunctional genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Gene Data ↓ Function Function ↓ Function

  5. A framework for functional genomics 100Ms gene pairs → ← 1Ks datasets P(G2-G5|Data) = 0.85 Frequency Low Correlation High Correlation = + Frequency Not let. Let. Frequency Dissim. Similar Low Similarity High Similarity Low Correlation High Correlation

  6. Functional networkprediction and analysis Global interaction network HEFalMp Currently includes data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases Carbon metabolism network Extracellular signaling network Gut community network

  7. Functional network prediction from diverse microbial data 486 bacterial expression experiments 310 postprocessed datasets 304 normalized coexpression networks in 27 species 876 raw datasets 307 bacterial interaction experiments 114786 postprocessed interactions Integrated functional interaction networks in 15 species 154796 raw interactions E. Coli Integration ← Precision ↑, Recall ↓

  8. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

  9. Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 = +

  10. Unsupervised data integration:TB virulence and ESX-1 secretion With Sarah Fortune Graphle http://huttenhower.sph.harvard.edu/graphle/

  11. Unsupervised data integration:TB virulence and ESX-1 secretion With Sarah Fortune X ? Graphle http://huttenhower.sph.harvard.edu/graphle/

  12. Predicting gene function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  13. Predicting gene function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

  14. Predicting gene function Predicted relationships between genes Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

  15. Comprehensive validation of computational predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Laboratory Experiments Growth curves Petite frequency Confocal microscopy

  16. Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

  17. Evaluating the performance of computational predictions Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months

  18. Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Chemotaxis

  19. Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis

  20. Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Chemotaxis Flagellar assembly

  21. Functional mapping:Associations among processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance

  22. Functional mapping:Associations among processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  23. Functional mapping:Associations among processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

  24. Functional mapping:Associations among processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered

  25. Cross-species knowledge transferusing functional data PinakiSarder TaFTan

  26. TaFTan: Cross-species knowledge transfer using functional data E. coli P. aeruginosa Species-specific data Species’ data excluded All species’ data • Important to take advantage of all available data for any one organism • Important to take advantage of all available data for every organism • Scalable to dozens of organisms with hundreds of functional datasets • Currently working on making this more context-specific log(precision/random) log(recall) B. subtilis M. tuberculosis

  27. Outline 1. Data mining: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

  28. ~2000 So what does all of this have to do with microbial communities AML/ALL Survival Mutation Batcheffects ? Geneexpression Functionalmodules

  29. ~2005 Healthy/Diabetes BMI M/F Populationstructure SNPgenotypes LD

  30. 2010 Intervention/perturbation Healthy/IBD Temperature Location Biological story? Independent sample ??? Cross-validate Taxa &Orthologs Niches &Phylogeny Test forcorrelates Confounds/stratification/environment Featureselectionp >> n Multiplehypothesiscorrection

  31. What’s metagenomics? Total collection of microorganisms within a community Also microbial communityormicrobiota Total genomic potential of a microbial community Study of uncultured microorganisms from the environment, which can include humans or other living hosts Total biomolecular repertoire of a microbial community

  32. The Human Microbiome Project • 300 “normal” adults, 18-40 • 16S rDNA + WGS • 5 sites/18 samples + blood • Oral cavity: saliva, tongue, palate, buccal mucosa, gingiva, tonsils, throat, teeth • Skin: ears, inner elbows • Nasal cavity • Gut: stool • Vagina:introitus, mid, fornix • Reference genomes (~200-800) Hamady, 2009 All healthy subjects; followup projects in psoriasis, Crohn’s, colitis, obesity, acne, cancer, resistant infection… 2006 - ongoing

  33. What features to test? Microbiome data Genomic data(Reference genomes) Functional data(Experimental models) 16S reads Taxa Binning WGS reads Orthologous clusters Functional roles Clustering Pathways/modules Pathway activity

  34. HMP: Data  features 16S reads Taxa Orthologous clusters Genes(KOs) Pathways/modules Pathways(KEGGs)

  35. HMP: Body sites Vanilla linear SVM Taxa KOs KEGGs

  36. HMP: Subjects We can tell who you are by the bugs in your mouth! Taxa KEGGs

  37. HMP: Metabolic reconstruction Functional seq. KEGG + MetaCYC CAZy, TCDB,VFDB, MEROPS… 300 subjects 1-3 visits/subject 15-18 body sites/visit 10-20M reads/sample 100bp reads BLAST Smoothing Witten-Bell BLAST → Genes WGS reads Genes(KOs) Genes → Pathways MinPath(Ye 2009) ? Pathways/modules Pathways(KEGGs) Gap filling

  38. HMP: Metabolic reconstruction Pathway coverage Pathway abundance

  39. HMP: Metabolic reconstruction Pathway abundance ← Samples → All body sites (“core”) Aerobic body sites Gastrointestinal body sites ← Pathways→ Pathway coverage

  40. MetaHIT: Data  features ReBLASTed against KEGG since published data obfuscates read counts 85 healthy, 15 IBD + 12 healthy, 12 IBD Taxa 10x bootstrap within training cohort, test on 12+12 as validation PhymmBrady 2009 WGS reads Genes(KOs) Pathways/modules Pathways(KEGGs)

  41. MetaHIT: Taxonomic CD biomarkers Bacteroidetes Methanomicrobia Enterobacteriaceae Firmicutes Chromatiales Desulfobacterales Bradyrhizobiaceae iTOLLetunic 2007 Rhodobacteraceae Oxalobacteraceae

  42. MetaHIT: Taxonomic CD biomarkers Down in CD Up in CD

  43. MetaHIT: Functional CD biomarkers Down in CD Up in CD Growth/replication Motility Transporters Sugar metabolism

  44. MetaHIT: KO IBD biomarkers Down in IBD Growth/replication LEfSe Motility Transporters NicolaSegata Sugarmetabolism Up in IBD

  45. Metagenomic differential analysis: LEfSe 1. Is there a statistically significant difference? t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis… 2. Is the difference biologically significant? expert supervision, specific post-hoc tests… 3. How large is the difference? PCA, LDA, mean difference, class or cluster distance… LEfSe: p(ANOVA) < 0.05 pairwise post-hoc Wilcoxon OK Log(Score(LDA)) = 3.68

  46. LEfSe: A non-human exampleViromes vs. bacterial metagenomes Dinsdale 2008 Metastats(White 2009): p < 0.001 LEfSE: NO DIFF! LEfSE: DIFF! ANOVA: p < 0.05 Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Carbohydrates Hi-level functional category: Transporters Microbial Viral

  47. Sleipnir: Software forscalable functional genomics Massive datasets require efficientalgorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. • Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) • And it’s fully documented! It’s also speedy: microbial data integration computation takes <3hrs.

  48. Outline • Network framework for scalable data integration • HEFalMp: human data integration • TaFTan: cross-species knowledge transfer from functional data • 16S and WGS community metabolic reconstruction • LEfSe: biologically relevant community differences • Sleipnir: software for scalable genomic data mining 1. Data mining: Integrating very large genomic data compendia 2. Metagenomics: Network models of microbial communities

  49. Thanks! Jacques Izard PinakiSarder Nicola Segata Hilary Coller Erin Haley OlgaTroyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Interested? We’re lookingfor postdocs! http://huttenhower.sph.harvard.edu Wendy Garrett Sarah Fortune Levi Waldron LarisaMiropolsky WillythssaPierre-Louis http://huttenhower.sph.harvard.edu/sleipnir

More Related