180 likes | 275 Views
GenXHC: A Probabilistic Generative Model for Cross-hybridization Compensation in High-density Genome-wide Microarray Data. Jim Huang (1). Joint work with Quaid Morris (1),(2) , Tim Hughes (2) and Brendan Frey (1),(2).
E N D
GenXHC: A Probabilistic Generative Model for Cross-hybridization Compensation in High-density Genome-wide Microarray Data Jim Huang(1) Joint work with Quaid Morris(1),(2), Tim Hughes(2) and Brendan Frey(1),(2) • Probabilistic and Statistical Inference Group, University of Toronto • (2) Banting & Best Department of Medical Research, University of Toronto ISMB 2005
Genome-wide profiling using high-density microarrays • The move towards high-density arrays for genome-wide profiling presents challenges… Coding regions … Genome Conditions Expression Probes ISMB 2005
Cross-hybridization in high-density microarrays mRNA transcript G C GCTAG C AGCTAGGAT G C T C T A • As we move to higher-density arrays, cross-hybridization noise becomes significant and unavoidable TCGAT CTA TCGAT CTA Hybridization Cross-hybridization Oligonucleotide Probes ISMB 2005
Cross-hybridization in high-density microarrays (cont’d) • Large cross-hybridization noise component in high-density data! ISMB 2005
Cross-hybridization compensation • State-of-the-art methods for cross-hybridization compensation designed for Affymetrix GeneChips • Affymetrix MAS 5.0 • Robust Multi-array Analysis (RMA/GC-RMA)(1),(2) • Wu, Z. and Irizarry, R.A. (2004) Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proc. Ninth International Conference on Research in Computational Molecular Biology (RECOMB), March 2004, pp. 98-106. • (2) Irizarry, R.A. et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array • probe level data. Biostatistics, 4, pp. 249 - 264. ISMB 2005
Bilinear model for cross-hybridization Λ Z X • Each probe is assigned a set of cross-hybridizing transcript expression profiles • Each transcript has a hybridization weight λ that determines its contribution ISMB 2005
The probabilistic generative model for cross-hybridization • Model the data probabilistically as X = ΛZ + V where X = [x1x2 … xT] isN x T, Z = [z1z2 … zT] is M x T, Λis the N x M hybridization matrix, V is additive noise ISMB 2005
Sparsity of the Λ matrix • Force many of the weights λij to 0 • Denote by S the set of weights which are non-zero: the prior becomes where ISMB 2005
The probabilistic generative model for cross-hybridization (cont’d) • The probabilistic model p(X,Z,Λ|S) for cross-hybridization is therefore ISMB 2005
Variational inference • To perform inference, minimize the KL-divergence with respect to a distribution qfor the given probabilistic modelp • The optimum is the posterior distribution q(Z,Λ) = p(Z,Λ|X,S) • Difficult to compute exactly! • Use a surrogate which approximates the true posterior ISMB 2005
Variational EM for approximate inference and parameter estimation • Use exponential distributions parameterized by variational parameters for q • Minimize KL-divergence via variational EM(2),(3) to get the estimate βjt of the transcript expression profiles: Variational E-step Variational M-step (2) Neal, R. M. and Hinton, G. E. (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants, Learning in Graphical Models, Kluwer Academic Publishers, pp. 355-368. (3) Jaakkola, T. and Jordan, M.I. (2000) Bayesian parameter estimation via variational methods. Statistics and Computing, 10:1, January 2000, pp. 25-37. ISMB 2005
Variational Expectation-Maximization algorithm Variational E-step Variational M-step ISMB 2005
Results • Agilent exon-tiling microarray data with 26,486 60-mer probes across 12 tissue pools • Matched each probe to full-length RefSeq cDNAs via BLAST search to determine the sparsity structure S • Resulting data set contains 9,904 probes matched to 2,905 mouse transcripts ISMB 2005
Results (cont’d) ISMB 2005
Significance testing of inferred expression profiles • Randomly permute the rows of the S matrix and perform inference • Mean SNR significantly lower for permuted data compared to unpermuted data ISMB 2005
Gene Ontology-Biological Process (GO-BP) enrichment using denoised data • Perform agglomerative hierarchical clustering and compute a hypergeometric p-value for each cluster to evaluate statistical significance of the clustering • Majority of clusters are have increased significance in denoised data compared to clustering using noisy data ISMB 2005
Comparison to Robust Multi-array Analysis • Unlike RMA, GenXHC models the explicit sparse structure of the set of probe-transcript interactions • This increases statistical power when doing functional prediction ISMB 2005
Summary • Cross-hybridization compensation using prior knowledge about the transcript population doubles number of probes on array • Problem of inferring latent transcript profiles is one of variational inference • Functional annotation using denoised data yields functional categories which have higher statistical significance compared to noisy expression data • Taking into account the set of probe-transcript binding interactions generally yields greater statistical power versus ignoring them ISMB 2005