430 likes | 538 Views
Large scale genomic data integration for functional metagenomics. Curtis Huttenhower 03-29-10. Harvard School of Public Health Department of Biostatistics. Greatest Biological Discoveries?. Are We There Yet?. Species Diversity of Environmental Samples. How much biology is out there?
E N D
Large scale genomic data integration for functional metagenomics Curtis Huttenhower 03-29-10 Harvard School of Public Health Department of Biostatistics
Are We There Yet? Species Diversity ofEnvironmental Samples • How much biology is out there? • How much have we found? • How fast are we finding it? Fierer 2008 Human Proteins withAnnotated Biological Roles Age-Adjusted Citation Rates forMajor Sequencing Projects #DistinctRoles Matt Hibbs
Are We There Yet? Species Diversity ofEnvironmental Samples Lots! • How much biology is out there? • How much have we found? • How fast are we finding it? Our job is to create computational microscopes: To ask and answer specific biomedical questions using millions of experimental results Not nearly all Not fast enough Fierer 2008 Human Proteins withAnnotated Biological Roles Age-Adjusted Cost per Citation forMajor Sequencing Projects #DistinctRoles Matt Hibbs
Outline 1. Data mining: Algorithms for integrating very large data compendia 2. Metagenomics: Network models of microbial communities
A framework for functional genomics 100Ms gene pairs → ← 1Ks datasets P(G2-G5|Data) = 0.85 Frequency Low Correlation High Correlation = + Frequency Not coloc. Coloc. Frequency Dissim. Similar Low Similarity High Similarity Low Correlation High Correlation
Functional networkprediction and analysis Global interaction network HEFalMp Currently includes data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases Metabolism network Signaling network Gut community network
Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)
Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions
Meta-analysis for unsupervisedfunctional data integration Huttenhower 2006Hibbs 2007 Evangelou 2007 = + Following up with semi-supervised approach
Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Chemotaxis
Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis
Functional mapping: mining integrated networks Predicted relationships between genes The strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Chemotaxis Flagellar assembly
Functional Mapping:Scoring Functional Associations How can we formalizethese relationships? • Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.
Functional Mapping:Bootstrap p-values For any graph, compute FA scores for many randomly chosen gene sets of different sizes. • Scoring functional associations is great… …how do you interpret an association score? • For gene sets of arbitrary sizes? • In arbitrary graphs? • Each with its own bizarre distribution of edges? Null distribution is approximately normal with mean 1. Empirically! Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Histograms of FAs for random sets Null distribution σs for one graph
Functional Mapping:Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance
Functional Mapping:Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance
Functional Mapping:Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance
Functional Mapping:Functional Associations Between Processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered
Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?
Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? • Functional mapping • Very large collections of genomic data • Specific predicted molecular interactions • Pathway, process, or disease associations • Underlying experimental results and functional activities in data
Outline 1. Data mining: Algorithms for integrating very large data compendia 2. Metagenomics: Network models of microbial communities
Microbial Communities andFunctional Metagenomics With Jacques Izard, Wendy Garrett • Metagenomics: data analysis from environmental samples • Microflora: environment includes us! • Pathogen collections of “single” organisms form similar communities • Another data integration problem • Must include datasets from multiple organisms • What questions can we answer? • What pathways/processes are present/over/under-enriched in a newly sequences microbe/community? • What’s shared within community X?What’s different? What’s unique? • How do human microflora interact with diabetes,obesity, oral health, antibiotics, aging, … • Current functional methods annotate~50% of synthetic data, <5% of environmental data DLD ARG1 LPD1 PDPK1 PKH2 PKH1 ARG2 CAR1 PKH3 AGA LLC 1.3 pdk-1 T21 F4.1 W04B5.5 R04 B3.2
Data Integration for Microbial Communities ~300 available expression datasets ~30 species DLD DLD • Data integration works just as well in microbes as it does in yeast and humans • We know an awful lot about some microorganisms and almost nothing about others • Sequence-based and network-based tools for function transfer both work in isolation • We can use data integration to leverage both and mine out additional biology ARG1 ARG1 LPD1 PDPK1 PDPK1 PKH2 PKH1 ARG2 ARG2 CAR1 PKH3 AGA AGA LPD1 PKH2 PKH1 CAR1 PKH3 Weskamp et al 2004 Kanehisa et al 2008 LLC 1.3 LLC 1.3 pdk-1 pdk-1 T21 F4.1 T21 F4.1 W04B5.5 W04B5.5 R04 B3.2 R04 B3.2 Flannick et al 2006 Tatusov et al 1997
Functional network prediction from diverse microbial data 486 bacterial expression experiments 310 postprocessed datasets 304 normalized coexpression networks in 27 species 876 raw datasets 307 bacterial interaction experiments 114786 postprocessed interactions Integrated functional interaction networks in 15 species 154796 raw interactions E. Coli Integration ← Precision ↑, Recall ↓
Functional maps for cross-speciesknowledge transfer O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 … G2 G3 O1 G4 G1 O2 G5 G6 O3 G7 O5 O4 G8 G9 G10 O8 O6 G12 G11 G13 O7 G16 G15 O9 G14 G17
Functional maps for cross-speciesknowledge transfer Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓
Functional maps for functional metagenomics GOS 4441599.3Hypersaline Lagoon, Ecuador + KEGG Pathways Integrated functional interaction networks in 27 species Mapping organisms into phyla Env. Organisms Pathogens = Mapping genes into pathways Mapping pathways into organisms
Functional maps for functional metagenomics Edges Process association in obesity LessCoregulated Baseline (no change) MoreCoregulated Nodes Process cohesiveness in obesity VeryDownregulated Baseline (no change) Very Upregulated
Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. • Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) • And it’s fully documented! It’s also speedy: microbial data integration computation takes <3hrs.
Outline • Bayesian and unsupervised methods for data integration • HEFalMp system for human data analysis and integration • Functional mapping to statistically summarize large data collections • Integration for microbial communities and metagenomics • Accurate cross-speciesinteractome transfer • Sleipnir software for efficient large scale data mining 1. Data mining: Algorithms for integrating very large data compendia 2. Metagenomics: Network models of microbial communities
Thanks! Jacques Izard Hilary Coller Erin Haley Olga Troyanskaya Chris Park David Hess Matt Hibbs Chad Myers Ana Pop Aaron Wong Wendy Garrett Sarah Fortune Tracy Rosebrock http://huttenhower.sph.harvard.edu/sleipnir http://function.princeton.edu/hefalmp NIGMS
Current Work: Molecular Mechanismsin a Colorectal Cancer Cohort With ShujiOgino, Charlie Fuchs Health Professionals Follow-Up Study • LINE-1 Methylation • Repetitive element making up ~20% of mammalian genomes • Very easy to assay methylation level (%) • Good proxy for whole-genome methylation level Nurse’s HealthStudy ~3,100gastrointestinal subjects ~2,100cancer mutation tests ~3,800tissue samples ~1,200LINE-1 methylation ~1,450colon cancer samples ~1,150CpG island methylation • DASL Gene Expression • Gene expression analysis from paraffin blocks • Thanks to Todd Golub, YujinHoshida ~775gene expression ~700TMAimmuno-histochemistry
Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation Nonnegative matrix factorization C1 C2 C3 C4 Tumors → ← Genes Cell cycle regulation Chr. 19 rearrangement,membrane receptors/channels Angiogenesis, proliferation HSC signature Neural/ESC signature BRCAinteractors,chrom. stability factors
Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation CD133 + Bcl-X(L) Subramanian et al, 2005 HematopoeiticStem Cell Signature NeuralStem Cell Signature CD44 + CD166 166 799 945 195 678 18 146 7 Chr. 19q Note that these regulatory programsdo not appear to correspondwith demographics or commonpathologic markers…Testing now for correlation with outcome. BAX 8 325 • Hypotheses? • Two main pathways to proliferation: • HSC program + BAX • ESC/NSC program • Two main pathways to deregulation: • Angiogenesis + chrom. instability • Cell cycle disruption (MSI?) EmbryonicStem Cell Signature
Epigenetics of Colorectal Cancer:LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Ogino et al, 2008 What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer? ρ = 0.718, p < 0.01
Epigenetics of Colorectal Cancer:LINE-1 methylation levels Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Is anything different about these outliers? Ogino et al, 2008 This suggests linkage to a cancer-related pathway. This suggests a copy number variation. What is the biological mechanism linking LINE-1 methylation to colon cancer? This suggests a genetic effect. ρ = 0.718, p < 0.01
Epigenetics of Colorectal Cancer:LINE-1 methylation levels • Preliminary Data • 10 genes differentially expressed even using simple methods • 1/3 are from the same family with known GI tumor prognostic value • 1/3 are X-chromosome testis/cancer-specific antigens • 1/2 fall in same cytogenic band, which is also a known CNV hotspot • HEFalMp links to a cascade of antigens/membrane receptors/TFs • Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays • GSEA pulls out a wide range of proliferation up (E2F), immune response down; need to regress out prognosis correlates Check back in acouple of months! What is the biological mechanism linking LINE-1 methylation to colon cancer?