1 / 19

Clustering of Gene Expression Time Series with Conditional Random Fields

Clustering of Gene Expression Time Series with Conditional Random Fields. Yinyin Yuan and Chang-Tsun Li Computer Science Department. Microarray and Gene Expression. Microarray is a high throughput technique that can assay gene expression levels of a large number of genes in a tissue

ping
Download Presentation

Clustering of Gene Expression Time Series with Conditional Random Fields

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

  2. Microarray and Gene Expression • Microarray is a high throughput technique that can assay gene expression levels of a large number of genes in a tissue • Gene expression level is the relative amounts of mRNA produced at specific time point and under certain experiment conditions. • Thus microarray provides a mean to decipher the logic of gene regulation, by monitoring the gene expression of all genes in a tissue.

  3. Gene Expression • Gene expression data are obtained from microarrays and organized into gene expression matrix for analysis in various methodologies for medical and biological purposes.

  4. Gene Series Time Series • A sequence of gene expression measured at successive time points at either uniform or uneven time intervals. • Reveal more information than static data as time series data have strong correlations between successive points. Time Series Clustering • Assumption: co-expression indicates co-regulation, thus clustering identify genes that share similar functions.

  5. Probabilistic models A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models • Allow measurements of uncertainty • Give analytical measurement of the confidence of the clustering result • Indicate the significance of a data point • Reflect temporal dependencies in the data points

  6. Goal • Identify highly informative genes • Cluster genes in the dataset • GO (Gene Ontology) analysis of biological function for each cluster.

  7. HMMs and CRFs • HMMs CRFs • HMMs are trained to maximize the joint probability of a set of observed data and their corresponding labels. • Independence assumptions are needed in order to be computationally tractable. • Representing long-range dependencies between genes and gene interactions are computationally impossible.

  8. Conditional Random Fields • CRFs are undirected graphical models that define a probability distribution over the label sequences, globally conditioned on a set of observed features. • X = {x1, x2,…, xn}: variable over the observations; • Y = {y1, y2,…, yn}: variable over the corresponding labels. • Observed data xj and class labels yj for all j in a voting pool Ni for sample xi;

  9. CRFs Model • The CRFs model can be formulated as follows • The CRFs model can be expressed in a Gibbs form in terms of cost functions

  10. Cost function • The conditional random field model can also be expressed in a Gibbs form in terms of cost functions • Cost function

  11. Potential function • Real-value potential functions are obtained and used to form the cost function • D: the estimated threshold dividing the set of Euclidean distances into intra- and inter-class distances

  12. Finding the optimal labels • We adopt deterministic label selection, the optimal label is determined by

  13. Pre-processing • Linear Warping for data alignment • τ -time point data transformed into τ-1feature space Differences between consecutive time points inversely proportional to time intervals are used as features as they can reflect the temporal structures in the time series. • Voting pool: keeps one most similar sample, one most-different sample and k-2 randomly selected samples.

  14. Process • Initialization • Each sample is assigned a random label • Voting pools are formed randomly • Samples interact with each other via its voting pool progressively • Update labels • Updata voting pool • Until steady

  15. Experimental Validation • Both biological dataset and simulated dataset • Adjusted Rand index: Similarity measure of two partitions • Yeast galactose dataset • Gene expression measurements in galactose utilization in Saccharomyces cerevisiae • Subset of meansurements of 205 genes whose expression patterns reflect four functional categories in the Gene Ontology (GO) listings • 4 repeated measurements across 20 time points

  16. Results for Yeast galactose dataset Experimental results on Yeast galactose dataset • The four functional categories of • Yeast galactose dataset We obtained an average Rand index value of 0.943 in 10 experiments, greater than the result 0.7 in Tjaden et al. 2006.

  17. Simulated Dataset • Data are generated for 400 genes across 20 time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles. • High Gaussian noise is added. • Perfect partitions are obtained with 10 iterations

  18. Conclusions • A novel unsupervised Conditional Random Fields model for efficient and accurate gene expression time series clustering • All data points are randomly initialized • The randomness of the voting pool facilitates global interactions

  19. Future work • Various similarity measurement • Advantage of information from repeated measurements • Training and testing procedures

More Related