Machine Learning for Functional Genomics II

Machine Learning for Functional Genomics II Matt Hibbs http://cbfg.jax.org

Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.

Promise of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions

Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.

Bayesian Networks Raining? Jim brought umbrella Cloudy this morning Rain in forecast Encodes dependence relationships between observed and unobserved events

Bayesian Network Overview • Graphical representation of relationships • Probabilistic information from data to concepts

Naïve Bayes No internal hidden nodes Greatly simplifies problem, reduces computational complexity and time Imposes independence assumption

Learning Naïve Bayes Nets …

Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

Gold Standard Construction • Gene Ontology annotations used to define known functional relationships Threshold for positive relationships Threshold for negative relationships Myers et al., 2006

Gold Standard Used For Training positive relationships negative relationships Global Gold Standard

Gene-Gene Scores • Binary data • PPI, co-localization, synthetic lethality • Can use binary scores • Can use profiles to generate scores (dot product) • Continuous data • Profile distance metrics • Binning results • Converts everything to discrete case

Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation • Choice of distance measure is important for quantifying relationships in datasets • Pair-wise metrics – compare vectors of numbers • e.g. genes x & y, ea. with n measurements

Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation

Sensible Binning • Commonly used Pearson correlation yields greatly different distributions of correlation • These differences complicate comparisons Histograms of Pearson correlations between all pairs of genes DeRisi et al., 97 Primig et al., 00

Sensible Binning • Fisher Z-transform, Z-score equalizes distributions • Increases comparability between datasets Histograms of Z-scores between all pairs of genes

Pre-calculation and Storage Pair-wise distances only need to be calculated once, even if using different binnings Typical mouse microarray ~5-20k genes 16M pair-wise distances ~50-700 MB of storage for one dataset ~800 datasets in GEO ~200 GB for all datasets

Counting & Learning • Conceptually straightforward • Counting • Just look at all of the pairs in each dataset, see which bin it falls into, increment a counter • But… you need to do this 16M times/dataset • “Dumb” parallelization – each dataset is independent • Learning CPTs • Fractions based on counts

Inference • Also pretty straightforward • For all pairs of genes… • For each dataset • Look-up value from pre-calculated distances • Determine bin and value from CPT • Multiply probability into product • Do this for FR=yes and FR=no • Normalize out α • Store Result • 1.5GB result file

Evaluation Metrics TPs, FPs, TNs, FNs Agnostic to pairs not appearing in standard ROC curves: Sensitivity-Specificity PR curves: Precision-Recall

Precision Recall Curves Ordered Predictions 1 Precision TP TP TP + FP TP + FN 0 1 0 Recall

Summary Statistics • AUC – area under the (ROC) curve • equivalent to Mann-Whitney U • Average Precision – average of the precisions calculated at each true positive • quantized version of area under precision recall curve (AUPRC) • Precision @ n% recall

Cross Validation

Graph Analysis for Predictions gi ci = confidence of function S = set of genes in function G = set of all genes wi,j = weight of edge

Steps for Our Evaluation Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

Bayesian Network Integration Gene expression dataset 1 Gene expression dataset 2 Gene expression Gene expression dataset N Data integration via a Bayesian network Yeast two-hybrid dataset 1 Probabilistic, weighted networks of gene function Physical interactions Co-precipitation dataset 1 Synthetic lethality dataset Synthetic rescue dataset Genetic interactions User-selected query focuses search Transcription factor bin sites New genes predicted to interact with known mitochondrial genes Localization Other Curated literature Results displayed Myers et al., 2005; Huttenhower et al., 2006; Guan et al., 2008

Basic Approach Applied Several Times Huttenhower et al., 2009 Myers et al., 2005; 2007 Guan et al., 2008 Huttenhower et al., 2007

Limitations and Improvements • Original work designed for yeast, and general notion of functionally related • Ignores reality that some genes are related only under certain conditions • Treats multi-cellular organisms as big single-celled organisms • Increased specificity can be used to improve results • 2nd iteration of bioPIXIE included biological processes into gold standards • Currently working on 2nd generation mouseNET to account for tissue and developmental stages

General mouseNET Approach

Global Gold Standard positive relationships negative relationships Global Gold Standard

Specific Gold Standards • Not all datasets capture all functional relationships • Process/Pathway specific • Functionally related genes aren’t always functionally related • Tissue specific • Developmental stage specific

Specific Gold Standard Construction positive relationships negative relationships Global Gold Standard Specific Gold Standard

Tissue/Stage Gold Standards • Based on data from GXD • Cross reference Theiler stages with mammalian anatomy hierarchy • 729 total intersections • ranging from 50 to ~3500 genes • not including post-natal stages

Initial Computational Evaluations

Preliminary Results training evaluation test evaluation Running 4-fold cross validation using tissue/stage specific GO-based gold standards

Preliminary Results training evaluation test evaluation Accounting for developmental stage helps

Preliminary Results training evaluation test evaluation Many specific tissue/stage combinations are overfitting

Preliminary Results Folds were randomly generated, are biased, need to balance positives and negatives

New Visualization Interface Graphle

Simple Things  Long Times • No single step is too complicated • Mostly O(G2D) • 16M * 800 * 4 • Evaluating one fold ~7 hours • So far have results for ~200 tissue/stages • Should take ~3 days on the cluster • Actually took ~15 days

Machine Learning for Functional Genomics II

Machine Learning for Functional Genomics II

Presentation Transcript

CSI5388: Functional Elements of Statistics for Machine Learning Part II

Functional Genomics – Why?

GPU and machine learning solutions for comparative genomics

Comparative Genomics II : Functional comparisons

Microbial Functional Genomics

FUNCTIONAL GENOMICS 2

Computational functional genomics

Machine Learning for Functional Genomics I

Machine Learning II

Microbial Functional Genomics

Functional Genomics

Functional Genomics

Analysis Environments For Functional Genomics

Analysis Environments For Functional Genomics

Machine Learning II - Outline

CTD2: Functional Cancer Genomics

Functional mitochondrial genomics II

Functional Genomics

Functional Genomics

Functional genomics

Microbial Functional Genomics

Functional Genomics