Answering biological questions using large genomic data collections

Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 Harvard School of Public Health Department of Biostatistics

A Definition ofComputational Functional Genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Gene Data ↓ Function Function ↓ Function

MEFIT: A Framework forFunctional Genomics Related Gene Pairs MEFIT BRCA1BRCA2 0.9 BRCA1RAD51 0.8 RAD51TP53 0.85 … Frequency Low Correlation High Correlation

MEFIT: A Framework forFunctional Genomics Related Gene Pairs MEFIT BRCA1BRCA2 0.9 BRCA1RAD51 0.8 RAD51TP53 0.85 … Frequency Unrelated Gene Pairs BRCA2SOX2 0.1 RAD51FOXP2 0.2 ACTR1H6PD 0.15 … Low Correlation High Correlation

MEFIT: A Framework forFunctional Genomics Functional Relationship Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

MEFIT: A Framework forFunctional Genomics Functional area Tissue Disease … Functional Relationship Biological Context Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

Functional Interaction Networks Global interaction network Currently have data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases MEFIT Vacuolar transport network Autophagy network Translation network

Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

Functional Associations Between Contexts Predicted relationships between genes The average strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Cell cycle genes

Functional Associations Between Contexts Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

Functional Associations Between Contexts Predicted relationships between genes The average strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Cell cycle genes DNA replication genes

Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance

Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes AHP1 DOT5 GRX1 GRX2 … Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism APE3 LAP4 PAI3 PEP4 … Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

HEFalMp: Predicting human gene function HEFalMp

HEFalMp: Predicting humangenetic interactions HEFalMp

HEFalMp: Analyzing human genomic data HEFalMp

HEFalMp: Understanding human disease HEFalMp

Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)

Comprehensive Validation of Computational Predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Laboratory Experiments Growth curves Petite frequency Confocal microscopy

Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months

Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a researcher take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? • Functional mapping • Very large collections of genomic data • Specific predicted molecular interactions • Pathway, process, or disease associations • Underlying experimental results and functional activities in data

Thanks! Hilary Coller Erin Haley TshekoMutungu Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi FlorianMarkowetz ShujiOgino Charlie Fuchs Interested? I’m accepting students and postdocs! http://www.huttenhower.org http://function.princeton.edu/hefalmp NIGMS

Next Steps:Microbial Communities • Data integration is off to a great start in humans • Complex communities of distinct cell types • Very sparse prior knowledge • Concentrated in a few specific areas • Variation across populations • Critical to understand mechanisms of disease

Next Steps:Microbial Communities • What about microbial communities? • Complex communities of distinct species/strains • Very sparse prior knowledge • Concentrated in a few specific species/strains • Variation across populations • Critical to understand mechanisms of disease

Next Steps:Microbial Communities ~120 available expression datasets ~70 species DLD DLD • Data integration works just as well in microbes as it does in humans • We know an awful lot about some microorganisms and almost nothing about others • Purely sequence-based and purely network-based tools for function transfer both fall short • We need data integration to take advantage of both and mine out useful biology! ARG1 ARG1 LPD1 PDPK1 PDPK1 PKH2 PKH1 ARG2 ARG2 CAR1 PKH3 AGA AGA LPD1 PKH2 PKH1 CAR1 PKH3 Weskamp et al 2004 Kanehisa et al 2008 LLC 1.3 LLC 1.3 pdk-1 pdk-1 T21 F4.1 T21 F4.1 W04B5.5 W04B5.5 R04 B3.2 R04 B3.2 Flannick et al 2006 Tatusov et al 1997

Next Steps:Functional Metagenomics • Metagenomics: data analysis from environmental samples • Microflora: environment includes us! • Another data integration problem • Must include datasets from multiple organisms • Another context-specificity problem • Now “context” can also mean “species” • What questions can we answer? • How do human microflora interact with diabetes,obesity, oral health, antibiotics, aging, … • What’s shared within community X?What’s different? What’s unique? • What’s perturbed in disease state Y?One organism, or many? Host interactions? • Current methods annotate ~50% of synthetic data,<5% of environmental data DLD ARG1 LPD1 PDPK1 PKH2 PKH1 ARG2 CAR1 PKH3 AGA LLC 1.3 pdk-1 T21 F4.1 W04B5.5 R04 B3.2

Answering biological questions using large genomic data collections

Answering biological questions using large genomic data collections

Presentation Transcript

Collective Vision: Using Extremely Large Photograph Collections

Answering exam questions

Large scale genomic data integration for functional metagenomics

Answering new questions

Answering Questions

Answering Questions

Large scale genomic data mining

Digitizing Biological Collections

Digitizing Biological Collections

Large scale genomic data mining

using large data sets

Answering Portuguese Questions

Handling Large Amounts of Biological Data

Answering Questions

Answering Questions

Answering Questions

Predicting Genetic Merit Using Genomic Data

Building biological networks from diverse genomic data

using large data sets

Biological (genomic) information

using large data sets

Answering questions