650 likes | 888 Views
Machine Learning for Functional Genomics I. Matt Hibbs http:// cbfg.jax.org. Central Dogma. Gene Expression. Proteins. DNA. Phenotypes. Functional Genomics. Identify the roles played by genes/proteins. Sealfon et al. , 2006. Gene Expression Microarrays.
E N D
Machine Learning forFunctional Genomics I Matt Hibbs http://cbfg.jax.org
Central Dogma GeneExpression Proteins DNA Phenotypes
Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.
Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome Conditions Genes
Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome – in thousands of conditions Rich functional information in these data, but how can we utilize the entire compendia?
Biological Data Explosion Huge repositories of biological data… Publically available microarrays in GEO Mouse genes with known process association # of measurements # of genes Year Year …are not directly translating into knowledge
Why is there a Data-Knowledge Gap? • Many datasets are analyzed only once • Initial publication looks for hypothesis • Need standards for naming, formats, collection • Data should be aggregated and integrated • Modestly significant clues seen repeatedly can become convincing • “a preponderance of circumstantial evidence” • Scale of this problem overwhelms traditional biology
Scalable Artificial Intelligence Computer science is really a study in scalability Use machine learning and data mining techniques to quickly identify important patterns
Amazon Recommendations Purchase History Item Rankings • Compare your purchase history to all other customers • Find commonalities between profiles • Predict potential purchases Machine Learning (Bayesian networks) Observe Browsing Patterns and Account Activity Recommendations
Gene Function Prediction Purchase History Item Rankings Genome Scale Data MGI Annotations ≈ Observe Browsing Patterns and Account Activity Machine Learning (Bayesian networks) Laboratory Experiments Machine Learning (Bayesian networks) Recommendations Predictions
Challenges for AI from Biology Input data is noisy, heterogeneous, constantly evolving Current knowledge is incomplete and biased Can be difficult to determine accuracy
Promise of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions
Reality of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions
Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.
Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.
Similarity Search Approach Relevant Datasets Search Algorithm (SPELL) Data Collection Query Genes Related Genes • Re-frame analysis as exploratory search
Key Insights X U Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability
Key Insights X U Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability
Dataset relevance weighting 0.15 0.82 0.05 0.55 Query Genes: Q= {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Calculate correlation measure among query for each dataset -- This is each datasets’ weight Datasets
Identify Novel Partners 0.15 0.82 0.05 0.55 Query Genes: Q= {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 geneA geneB geneC Datasets Calculate weighted distance score for all other genes to the query set
Identify Novel Partners 0.15 0.82 0.05 0.55 Query Genes: Q= {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Best score Worst score geneB geneC geneA + Takes advantage of functional diversity + Addresses statistical concerns + Fast running times [O(GDQ2)] (ms per query) + Top results are candidates for investigation + Search process is iterative to refine results Datasets Calculate weighted distance score for all other genes to the query set
Key Insights X U Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability
Signal Balancing Data - SVD • Singular Value Decomposition (SVD) • Projects data into another orthonormal basis • Correlations in U (rather than X) often contain better biological signals
Signal Balancing SVD Signal Balancing
Signal Balancing • Use correlations among left singular vectors • Downweights dominant patterns, amplifies subtle patterns • Top eigengenes dominate data • Sometimes correspond to systematic bias • Often correspond to common biological processes • eg. ribosome biogenesis, etc. • Accuracy of signal balancing improved over re-projection
Key Insights X U Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability
Between-dataset normalization • Commonly used Pearson correlation yields greatly different distributions of correlation • These differences complicate comparisons Histograms of Pearson correlations between all pairs of genes DeRisi et al., 97 Primig et al., 00
Between-dataset normalization • Fisher Z-transform, Z-score equalizes distributions • Increases comparability between datasets Histograms of Z-scores between all pairs of genes
SPELL Algorithm Overview Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007.
Web Interface http://spell.princeton.edu
Evaluation of Performance • Leave-k-in cross validation / bootstrapping • Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006) • Many predictions also verified through experimental validations in other studies • Hibbs et al., Bioinf, 2007 • Hess et al., PLoS Gen, 2009 • Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009
Search Accuracy • Perform “leave-k-in” cross-validation Order Genome Master List … Rank Average Genes with common function For all pairs
Search Accuracy • Precision-Recall Curve Master List 1 Precision TP TP TP + FP TP + FN 0 1 0 Recall
Sample & Query Size Effects Even relatively small sample sizes produce similar results (1000 samples used for all other tests) Significant performance gain between 2 and 3 query genes, little change beyond (5 query genes used for all other tests)
Effect of Signal Balancing Improvement is robust to missing value imputation method Signal balancing further improves context-specific search performance
Effects of Signal Balancing signal balanced n% re-projection n% balanced
Effects of Signal Balancing n% re-projection n% balanced
Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.
Function Prediction Evaluation Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009. • Cross-validation based on known biology • Most often used method in literature • Results are useful, but can be biased • Laboratory evaluation • More accurate, more difficult • Ultimate goal of functional genomics • Identify novel biology • Publish biological corpus
Promise of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions