Large scale genomic data mining

Large scalegenomic data mining Curtis Huttenhower 10-23-09 Harvard School of Public Health Department of Biostatistics

Mining Biological Data ~100 GB More than 100GB

Mining Biological Data ~100 GB How can we ask and answer specific biomedical questions using thousands ofgenome-scale datasets? More than 100GB

Outline 1. Methodology:Algorithms for mining genome-scale datasets 2. Applications: Human molecular data and clinical cancer cohorts 3. Next steps:Methods for microbial communities and functional metagenomics

A Definition of Functional Genomics Prior knowledge Genomic data Gene ↓ Function Gene ↓ Gene Data ↓ Function Function ↓ Function

MEFIT: A Framework forFunctional Genomics Related Gene Pairs MEFIT BRCA1BRCA2 0.9 BRCA1RAD51 0.8 RAD51TP53 0.85 … Frequency Low Correlation High Correlation

MEFIT: A Framework forFunctional Genomics Related Gene Pairs MEFIT BRCA1BRCA2 0.9 BRCA1RAD51 0.8 RAD51TP53 0.85 … Frequency Unrelated Gene Pairs BRCA2SOX2 0.1 RAD51FOXP2 0.2 ACTR1H6PD 0.15 … Low Correlation High Correlation

MEFIT: A Framework forFunctional Genomics Functional Relationship Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

MEFIT: A Framework forFunctional Genomics Functional area Tissue Disease … Functional Relationship Biological Context Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998

Functional Interaction Networks Global interaction network Currently have data from30,000 human experimental results,15,000 expression conditions +15,000 diverse others, analyzed for200 biological functions and150 diseases MEFIT Vacuolar transport network Autophagy network Translation network

Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes

Comprehensive Validation of Computational Predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 MEFIT Retraining Genes predicted to function in mitochondrion organization and biogenesis New known functions for correctly predicted genes Laboratory Experiments Growth curves Petite frequency Confocal microscopy

Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis 106 Original GO Annotations 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated. 106 Original GO Annotations 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months

Functional Associations Between Contexts Predicted relationships between genes The average strength of these relationships indicates how cohesive a process is. Low Confidence High Confidence Cell cycle genes

Functional Associations Between Contexts Predicted relationships between genes Low Confidence High Confidence Cell cycle genes

Functional Associations Between Contexts Predicted relationships between genes The average strength of these relationships indicates how associated two processes are. Low Confidence High Confidence Cell cycle genes DNA replication genes

Functional mapping:Scoring functional associations How can we formalizethese relationships? • Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

Functional mapping:Bootstrap p-values For any graph, compute FA scores for many randomly chosen gene sets of different sizes. • Scoring functional associations is great… …how do you interpret an association score? • For gene sets of arbitrary sizes? • In arbitrary graphs? • Each with its own bizarre distribution of edges? Null distribution is approximately normal with mean 1. Empirically! Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Histograms of FAs for random sets Null distribution σs for one graph

Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance

Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Protein Processing Peptide Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

Functional Associations Between Processes HydrogenTransport ElectronTransport Edges Associations between processes AHP1 DOT5 GRX1 GRX2 … Cellular Respiration Moderately Strong Very Strong Cell Redox Homeostasis Aldehyde Metabolism Nodes Cohesiveness of processes Protein Processing Peptide Metabolism Below Baseline Baseline (genomic background) Very Cohesive Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism APE3 LAP4 PAI3 PEP4 … Energy Reserve Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance

Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps:Focused Data Summarization ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? • Functional mapping • Very large collections of genomic data • Specific predicted molecular interactions • Pathway, process, or disease associations • Underlying experimental results and functional activities in data

HEFalMp: Predicting human gene function HEFalMp

HEFalMp: Predicting humangenetic interactions HEFalMp

HEFalMp: Analyzing human genomic data HEFalMp

HEFalMp: Understanding human disease HEFalMp

Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Predicted novel autophagy proteins Luciferase (Negative control) ATG5 (Positive control) LAMP2 RAB11A Not Starved Starved (Autophagic)

Current Work: MolecularMechanisms in a Colon Cancer Cohort With ShujiOgino, Charlie Fuchs Health Professionals Follow-Up Study • LINE-1 Methylation • Repetitive element making up ~20% of mammalian genomes • Very easy to assay methylation level (%) • Good proxy for whole-genome methylation level Nurse’s HealthStudy ~3,100gastrointestinal subjects ~2,100cancer mutation tests ~3,800tissue samples ~1,200LINE-1 methylation ~1,450colon cancer samples ~1,150CpG island methylation • DASL Gene Expression • Gene expression analysis from paraffin blocks • Thanks to Todd Golub, YujinHoshida ~775gene expression ~700TMAimmuno-histochemistry

Colon Cancer:LINE-1 methylation levels With ShujiOgino, Charlie Fuchs Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Ogino et al, 2008 What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer? ρ = 0.718, p < 0.01

Colon Cancer:LINE-1 methylation levels With ShujiOgino, Charlie Fuchs Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. Is anything different about these outliers? Ogino et al, 2008 This suggests linkage to a cancer-related pathway. This suggests a copy number variation. What is the biological mechanism linking LINE-1 methylation to colon cancer? This suggests a genetic effect. ρ = 0.718, p < 0.01

Colon Cancer:LINE-1 methylation levels • Preliminary Data • Six genes differentially expressed even using naïve methods • One uncharacterized, one oncogene, three malignancy, one histone • 1/3 are from a family with known variable GI expression, prognostic value • 2/3 fall in same cytogenic band, which is also a known CNV hotspot • HEFalMp links to a set of transmembrane receptors/channels • Better analysis pulls out mostly one-carbon metabolism and a few more signaling pathways (neurotransmitters??) Check back in acouple of months! What is the biological mechanism linking LINE-1 methylation to colon cancer?

Next Steps:Microbial Communities • Data integration is off to a great start in humans • Complex communities of distinct cell types • Very sparse prior knowledge • Concentrated in a few specific areas • Variation across populations • Critical to understand mechanisms of disease

Next Steps:Microbial Communities • What about microbial communities? • Complex communities of distinct species/strains • Very sparse prior knowledge • Concentrated in a few specific species/strains • Variation across populations • Critical to understand mechanisms of disease

Next Steps:Functional Metagenomics • Metagenomics: data analysis from environmental samples • Microflora: environment includes us! • Another data integration problem • Must include datasets from multiple organisms • Another context-specificity problem • Now “context” can also mean “species” • What questions can we answer? • How do human microflora interact with diabetes,obesity, oral health, antibiotics, aging, … • What’s shared within community X?What’s different? What’s unique? • What’s perturbed in disease state Y?One organism, or many? Host interactions? • Current methods annotate ~50% of synthetic data,<5% of environmental data DLD ARG1 LPD1 PDPK1 PKH2 PKH1 ARG2 CAR1 PKH3 AGA LLC 1.3 pdk-1 T21 F4.1 W04B5.5 R04 B3.2

Next Steps:Microbial Communities ~120 available expression datasets ~70 species DLD DLD • Data integration works just as well in microbes as it does in humans • We know an awful lot about some microorganisms and almost nothing about others • Purely sequence-based and purely network-based tools for function transfer both fall short • We need data integration to take advantage of both and mine out useful biology! ARG1 ARG1 LPD1 PDPK1 PDPK1 PKH2 PKH1 ARG2 ARG2 CAR1 PKH3 AGA AGA LPD1 PKH2 PKH1 CAR1 PKH3 Weskamp et al 2004 Kanehisa et al 2008 LLC 1.3 LLC 1.3 pdk-1 pdk-1 T21 F4.1 T21 F4.1 W04B5.5 W04B5.5 R04 B3.2 R04 B3.2 Flannick et al 2006 Tatusov et al 1997

Functional Maps forFunctional Metagenomics KO1: YG1, YG2, YG3 KO2: YG4 KO3: YG6 … ECG1, ECG2 PAG1 ECG3, PAG2 … YG2 YG3 KO1 YG4 YG1 KO2 YG5 YG6 KO3 YG7 KO5 KO4 YG8 YG9 YG10 KO8 KO6 YG12 YG11 YG13 KO7 YG16 YG15 KO9 YG14 YG17

Functional Maps forFunctional Metagenomics

Validating Orthology-BasedFunctional Mapping Does unweighted data integration predict functional relationships? What is the effect of “projecting” through an orthologous space? GO GO Individual datasets Unsupervised integration log(Precision/Random) log(Precision/Random) Recall Recall KEGG KEGG Unsupervised integration Individual datasets log(Precision/Random) log(Precision/Random) Recall Recall

Validating Orthology-BasedFunctional Mapping YG2 YG3 Holdout set, uncharacterized “genome” YG4 YG1 YG5 Random subsets, characterized “genomes” YG6 YG7 YG8 YG9 YG10 YG12 YG11 YG13 YG15 YG16 YG14 YG17

Validating Orthology-BasedFunctional Mapping

Validating Orthology-BasedFunctional Mapping Can subsets of the yeast genome predict a heldout subset’s functional maps? Can subsets of the yeast genome predict a heldout subset’s interactome? GO GO • What have we learned? • Yeast is incredibly well-curated • KEGG tends to be more specific than GO • Predictinginteractomes by projecting through functional maps works decently in the absolute best case 0.30 0.37 0.68 0.48 0.40 0.43 0.39 0.25 0.27 0.39 KEGG KEGG

Functional Maps forFunctional Metagenomics • Now, what happens if you do this forcharacterized microbes? • ~20 (somewhat) well-characterized species • 1-35 datasets each • Integrate within species • Evaluate using KEGG • Then cross-validate by holding out species KEGG Unsupervised integrations log(Precision/Random) Recall

Next Steps:Missing Methodology, Mining • Most machine learning algorithms are optimized for one of two cases: • Small, dense data • Large, sparse data • HEFalMp integrates ~300M records using ~1K features, relatively few of which are missing, in ~200 contexts Regularization Dimension reduction Feature selection Slightly less Simple models, efficient algorithms

Large scale genomic data mining

Large scale genomic data mining

Presentation Transcript

Large scale genomic data integration for functional metagenomics

iRODS and Large-Scale Data Management

Understanding the Computational Challenges in Large-Scale Genomic Analysis

Large-scale Data Mining: MapReduce and beyond Part 1: Basics

Computation of Large-Scale Genomic Evaluations

Examples of use cases Mining large-scale data: dealing with the data deluge

LARGE SCALE

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

Supervised and unsupervised methods for large scale genomic data integration

Large scale data processing

Large-scale mining of gene expression patterns

Mining Large Data at SDSC

Large scale

STRING Large-scale data and text mining

Large Scale Data Processing with DryadLINQ

Data Mining Algorithms for Large-Scale Distributed Systems

Large Scale Data Integration

Data Mining of Very Large Data

Large Scale Data Analytics

Are Large Scale Data Breaches Inevitable?