Clustering Short Gene Expression Profiles

Clustering Short Gene Expression Profiles Ling Wang Marco RamoniPaolo Sebastiani Abdullah Mueen

The Problem: Input Gene expression profiles for J genes from microarray experiments [1] Abdullah Mueen

The Problem: Output • A clustering of the genes that groups functionally related genes in the same cluster. Abdullah Mueen

Previous Works • Hierarchical Clustering (Eisen et al., 1998) • K-means and self organizing maps (Tamayo et al, 1999) • Standard measures : Euclidian Distance, Correlation coefficient. • Problem • Ignores the sequential nature of the profiles. • Different pairs of time series can have same measure. [3] Abdullah Mueen

Previous Works • Continuous representation of the profile using • Autoregressive Models. • Hidden Markov Models. • Advantages: • Count the temporal information • Good for long profiles ( 10 points or more ) • Easily go with Bayesian Clustering. [3] Abdullah Mueen

Autoregressive Model: Definition • Each time point is correlated with p previous time points. • Combining the models of all the time points for a gene • Xj is the regression matrix of size (n-p)x(p+1) and βj is the coefficient matrix. [2] Abdullah Mueen

Autoregressive Model: Problems • Problems • AR model is for stationary time series. Interval between time points are ignored. • For short gene expression profiles (5 time points) the regression order can not be large. • For a large number of genes with short expression profiles, there may be random patterns. AR model overfit these random patterns. Abdullah Mueen

The Algorithm The algorithm has three components • A modeldescribing the dynamics of gene expression temporal profiles. • A probabilistic metric to score different clustering models based on the posterior probability of each clustering model. • A heuristic to make the search for the best clustering model feasible. Abdullah Mueen

Polynomial Model: Definition • Each time point is approximated by a polynomial of degree p . • The combined model for a gene is Abdullah Mueen

Polynomial Model: Assumptions • The uncorrelated errors are normally distributed with mean 0 and variance1/τjwhere • The coefficients are normally distributed • β0, α1andα2are hyper-parameters of the prior distributions of the parameters. Abdullah Mueen

Hyper-parameters • Around 25-50% of the total number of genes/probes in the microarrays are disregarded because of their low confidence level. • To avoid overfitting random patterns, hyper parameters are estimated from random data. • If σ2a is the sample variance of the disregarded genes then the hyper-parameters are related through Abdullah Mueen

Scoring Method • The scoring function is calculated using marginal likelihood of each gene which is • For the current model marginal likelihood of a gene is Abdullah Mueen

Marginal Likelihood • With the polynomial model, assumed prior distribution and hyper parameters, the marginal likelihood function is computed. Abdullah Mueen

Scoring the Model • The weighted average of the marginal likelihood of each gene is the scoring function for a clustering model. • The weights for each cluster varies with the size of the cluster. Abdullah Mueen

Agglomerative Clustering • The clustering phase starts with singleton clusters. • It computes and • Iterativelymergestime series into clusters until the scoring function does not increase. • While merging it takes average of the cluster representatives. Abdullah Mueen

Heuristic Search • Computing the scoring function for all the model is expensive and a heuristic is adopted. • Instead of computing all the possible merge pairs, it tries to find a merge pair that increases the scoring function. The search for such a merge pair is done in the descending order of their Euclidian Distance, Dynamic Time Warping, etc. Abdullah Mueen

Evaluation: Simulation Abdullah Mueen

Evaluation: Real Data • The gene expression profiles from [1] are used. Clusters are tested using Gene Ontology enrichment test with EASE (Hosack et al. 2003). Abdullah Mueen

Conclusion • Short gene expression profiles are modeled using polynomials. • A clustering model is evaluated using the marginal likelihood of the genes with respect to the polynomial model. • An agglomerative clustering is done with a heuristic search strategy. • Output clusters are gene ontology enriched. Abdullah Mueen

References • Guillemin K., Salma N.R., Tompkins L.S., and Falkow S. Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection. PNAS. 99: 15136-15141, 2002. • M. Ramoni, P. Sebastiani, and I. S. Kohane. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA, 99(14):9121–6, 2002 • J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expression data. Bioinformatics, 21 Suppl. 1:i159-i168, 2005 Abdullah Mueen

Clustering Short Gene Expression Profiles

Clustering Short Gene Expression Profiles

Presentation Transcript

Clustering analysis of microarray gene expression data

Gene Clustering

Lecture 9: Gene expression analysis/Clustering

Basic Gene Expression Data Analysis--Clustering

Clustering Gene Expression Data

Antibody Mediated Rejection and Gene Expression Profiles

Classifying Gene Expression Profiles from Pairwise mRNA Comparisons

DIFFERENTIAL GENE EXPRESSION PROFILES OF CHRONIC ALLOGRAFT NEPHROPATHY

Clustering short time series gene expression data

Characterizing Gene Functional Expression Profiles

Clustering Gene Expression Data

Gene expression profiles

Unsupervised clustering in mRNA expression profiles

Gene expression profiles as predictors of relapse

Soft clustering of gene expression data

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles

Clustering analysis of microarray gene expression data

Clustering Gene Expression Data

Clustering Gene Expression Data