210 likes | 387 Views
SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration. Katerina Kechris , PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver. Omics.
E N D
SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical ChallengesTopic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver
Omics • Large-scale analyses for studying a population of molecules or molecular mechanisms • High-throughput data • Examples • Genomics (entire genome – DNA) • Proteomics (study of protein repertoire) • Epigenomics (study of DNA and histone modifications)
Omics Epigenome Phenome Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gif http://themedicalbiochemistrypage.org/images/hemoglobin.jpghttp://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gif
Large-scale Projects & Databases NCI 60 Database
Integration of Omics Data • Each type of data gives a different snapshot of the biological or disease system • Why integrate data? • Reduce false positives/negatives • Identify interactions between different molecules • Explore functional mechanisms
Challenges • When to integrate? • Dimensionality • Resolution • Heterogeneity • Interactions and Pathways
Challenge 1: When to integrate? • Early • Merging data to increase sample size • Intermediate • Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis • Late • Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results
Genomic Meta-analysis:Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt.
Assessing Genomic Overlap:Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:4 1660-1697.
Challenge 2: Dimensionality • Most technologies produce 10Ks to 100Ks measurements per sample • Exponential increase with 2+ data types • Dimension reduction • Process data type separately (filtering) • Combine with model fitting • Multivariate analysis
Sparse Multivariate Methods • Variable Selection, Discriminant Analysis, Visualization • Penalties (or regularization) to reduce parameter space, only a few entries are non-zero (sparsity) • Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35
Challenge 3: Genomic Resolution • Base level (conservation, motif scores) • Regular intervals (expression/binding from tiling arrays) • Irregular intervals • Gene/ncRNA level data (expression) • Individual positions (SNP, methylation sites)
Challenge 4: Heterogeneity • Technology-specific sources of error • Different pre-processing, normalization • Different amounts of missing values • Data matching • Different identifiers • Not always one-to-one (microarrays) • Imputation
Challenge 4: Heterogeneity • Continuous • expression and binding data from microarrays, motif scores, protein/metabolite abundance • Counts • expression data from sequencing • 0-1 • conservation (UCSC), DNA methylation • Binary/Categorical • Thresh-holding (e.g., motif scores), genotype
Case Study: Development • Ci • important for differentiation of appendages during development • transcription factor – binds to DNA near target genes http://www.biology.ualberta.ca/locke.hp/research.htm http://howardhughes.trinity.duke.edu Kechris Lab, CU Denver
Hierarchical Mixture Model • Data • Transcriptome:Ci pathway mutants (expr) – irregular interval • Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level • Goal: Predict gene targets of Ci • Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review)
Challenge 5: Interactions and Pathways • Known Pathways • Incorporate information in databases (curated but sparse) • e.g., KEGG pathways have metabolite – protein interactions (directed graphs) • De novo Pathways • Discover novel interactions
Known Pathways gene metabolite Joint modeling of metabolite and transcript data to identify active pathways Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13:4 748-761
de novo Interactions PHENOTYPE • Single data INTEGRATION • Pair-wise • Correlations (e.g., eQTL) • Bayesian networks • Multiple • Kernel-based methods • Probabilistic graphical models • Network analysis methylation site gene SNP protein metabolite gene
de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3): 519-538.
Summary Methodology • Meta-analysis • Permutation-based Methods • Sparse Multivariate Methods • Graphical Models • Network Analysis