790 likes | 1.01k Views
Machine Learning for Functional Genomics II. Matt Hibbs http:// cbfg.jax.org. Functional Genomics. Identify the roles played by genes/proteins. Sealfon et al. , 2006. Promise of Computational Functional Genomics. Data & Existing Knowledge. Laboratory Experiments.
E N D
Machine Learning for Functional Genomics II Matt Hibbs http://cbfg.jax.org
Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.
Promise of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions
Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.
Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.
Bayesian Networks Raining? Jim brought umbrella Cloudy this morning Rain in forecast Encodes dependence relationships between observed and unobserved events
Bayesian Network Overview • Graphical representation of relationships • Probabilistic information from data to concepts
Bayesian Network Overview • Graphical representation of relationships • Probabilistic information from data to concepts
Bayesian Network Overview P(FR | CE, AP, Y2H) P(FR | CE=yes, AP=yes, Y2H=yes) = α P(FR) P(CE=yes|FR) Σ P(PI|FR) P(AP=yes|PI) P(Y2H=yes|PI) Bayes’ Rule: P(A|B) ~ P(A) P(B|A) P(FR=yes) + P(FR=no) = 0.0105α + 0.0216α P(FR) = .327 (up from 0.10)
Naïve Bayes No internal hidden nodes Greatly simplifies problem, reduces computational complexity and time Imposes independence assumption
Naïve Bayes P(FR | D1, D2, D3, D4) = α P(FR) P(D1|FR) P(D2|FR) P(D3|FR) P(D4|FR) Bayes’ Rule: P(A|B) ~ P(A) P(B|A) Assumes that all measures are independent
Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Gold Standard Construction • Gene Ontology annotations used to define known functional relationships Threshold for positive relationships Threshold for negative relationships Myers et al., 2006
Gold Standard Used For Training positive relationships negative relationships Global Gold Standard
Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Gene-Gene Scores • Binary data • PPI, co-localization, synthetic lethality • Can use binary scores • Can use profiles to generate scores (dot product) • Continuous data • Profile distance metrics • Binning results • Converts everything to discrete case
Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation • Choice of distance measure is important for quantifying relationships in datasets • Pair-wise metrics – compare vectors of numbers • e.g. genes x & y, ea. with n measurements
Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation
Sensible Binning • Commonly used Pearson correlation yields greatly different distributions of correlation • These differences complicate comparisons Histograms of Pearson correlations between all pairs of genes DeRisi et al., 97 Primig et al., 00
Sensible Binning • Fisher Z-transform, Z-score equalizes distributions • Increases comparability between datasets Histograms of Z-scores between all pairs of genes
Pre-calculation and Storage Pair-wise distances only need to be calculated once, even if using different binnings Typical mouse microarray ~5-20k genes 16M pair-wise distances ~50-700 MB of storage for one dataset ~800 datasets in GEO ~200 GB for all datasets
Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Counting & Learning • Conceptually straightforward • Counting • Just look at all of the pairs in each dataset, see which bin it falls into, increment a counter • But… you need to do this 16M times/dataset • “Dumb” parallelization – each dataset is independent • Learning CPTs • Fractions based on counts
Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Inference • Also pretty straightforward • For all pairs of genes… • For each dataset • Look-up value from pre-calculated distances • Determine bin and value from CPT • Multiply probability into product • Do this for FR=yes and FR=no • Normalize out α • Store Result • 1.5GB result file
Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Evaluation Metrics TPs, FPs, TNs, FNs Agnostic to pairs not appearing in standard ROC curves: Sensitivity-Specificity PR curves: Precision-Recall
Precision Recall Curves Ordered Predictions 1 Precision TP TP TP + FP TP + FN 0 1 0 Recall
Summary Statistics • AUC – area under the (ROC) curve • equivalent to Mann-Whitney U • Average Precision – average of the precisions calculated at each true positive • quantized version of area under precision recall curve (AUPRC) • Precision @ n% recall
Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Graph Analysis for Predictions gi ci = confidence of function S = set of genes in function G = set of all genes wi,j = weight of edge
Steps for Our Evaluation Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network
Bayesian Network Integration Gene expression dataset 1 Gene expression dataset 2 Gene expression Gene expression dataset N Data integration via a Bayesian network Yeast two-hybrid dataset 1 Probabilistic, weighted networks of gene function Physical interactions Co-precipitation dataset 1 Synthetic lethality dataset Synthetic rescue dataset Genetic interactions User-selected query focuses search Transcription factor bin sites New genes predicted to interact with known mitochondrial genes Localization Other Curated literature Results displayed Myers et al., 2005; Huttenhower et al., 2006; Guan et al., 2008
Basic Approach Applied Several Times Huttenhower et al., 2009 Myers et al., 2005; 2007 Guan et al., 2008 Huttenhower et al., 2007
Limitations and Improvements • Original work designed for yeast, and general notion of functionally related • Ignores reality that some genes are related only under certain conditions • Treats multi-cellular organisms as big single-celled organisms • Increased specificity can be used to improve results • 2nd iteration of bioPIXIE included biological processes into gold standards • Currently working on 2nd generation mouseNET to account for tissue and developmental stages
Global Gold Standard positive relationships negative relationships Global Gold Standard
Specific Gold Standards • Not all datasets capture all functional relationships • Process/Pathway specific • Functionally related genes aren’t always functionally related • Tissue specific • Developmental stage specific
Specific Gold Standard Construction positive relationships negative relationships Global Gold Standard Specific Gold Standard
Tissue/Stage Gold Standards • Based on data from GXD • Cross reference Theiler stages with mammalian anatomy hierarchy • 729 total intersections • ranging from 50 to ~3500 genes • not including post-natal stages
Preliminary Results training evaluation test evaluation Running 4-fold cross validation using tissue/stage specific GO-based gold standards
Preliminary Results training evaluation test evaluation Accounting for developmental stage helps
Preliminary Results training evaluation test evaluation Many specific tissue/stage combinations are overfitting
Preliminary Results Folds were randomly generated, are biased, need to balance positives and negatives
New Visualization Interface Graphle
Simple Things Long Times • No single step is too complicated • Mostly O(G2D) • 16M * 800 * 4 • Evaluating one fold ~7 hours • So far have results for ~200 tissue/stages • Should take ~3 days on the cluster • Actually took ~15 days