920 likes | 1.07k Views
Introduction to Time-Course Gene Expression Data. STAT 675 R Guerra April 21, 2008. Outline. The Data Clustering – nonparametric, model based A case study A new model. The Data.
E N D
Introduction to Time-Course Gene Expression Data STAT 675 R Guerra April 21, 2008
Outline • The Data • Clustering – nonparametric, model based • A case study • A new model
The Data • DNA Microarrays: collections of microscopic DNA spots, often representing single genes, attached to a solid surface
The Data • Gene expression changes over time due to environmental stimuli or changing needs of the cell • Measuring gene expression against time leads to time-course data sets
Time-Course Gene Expression • Each row represents a single gene • Each column represents a single time point • These data sets can be massive, analyzing many genes simultaneously
Time-Course Gene Expression • k-means to clustering • “in the budding yeast Saccharomyces cerevisiae clustering gene expression data • groups together efficiently genes of known similar function, • and we find a similar tendency in human data…” Eisen et al. (1998)
Clustering Expression Data • When these data sets first became available, it was common to cluster using non-parametric clustering techniques like K-Means and hierarchical clustering
Yeast Data Set • Spellman et al (1998) measured mRNA levels on yeast (saccharomyces cerevisiae) • 18 equally spaced time-points • Of 6300 genes nearly 800 were categorized as cell-cycle regulated • A subset of 433 genes with no missing values is a commonly used data set in papers detailing new time-course methods • Original and follow-up papers clustered genes using K-means and hierarchical clustering
Spellman et al. (1998) Yeast cell cycle Row labels = cell cycle Rows=genes Col labels = expts Cols = time points
Yeast Data Set (Spellman et al.) K-means Hierarchical Which method gives the “right” result???
Non-Parametric Clustering • Data curves • Apply distance metric to get distance matrix • Cluster
Issues with Non-Parametric Clustering • Technical • Require the number of clusters to be chosen a priori • Do not take into account the time-ordering of the data • Hard to incoporate covariate data, eg, gene ontology • Yeast analysis had number of clusters chosen based on number of cell cycle groups .…no statistical validation showing that these were the best clustering assignments
Model-Based Clustering • In response to limitations of nonparametric methods, model based methods proposed • Time series • Spline Methods • Hidden Markov Model • Bayesian Clustering Models • Little consensus over which method is “best” to cluster time course data
K-Means Clustering Relocation method: Number of clusters pre-determined and curves can change clusters at each iteration • Initially, data assigned at random to k clusters • Centroid is computed for each cluster • Data reassigned to cluster whose centroid is closest to it • Algorithm repeats until no further change in assignment of data to clusters • Hartigan rule used to select “optimal” #clusters
K-means: Hartigan Rule • n curves, let k1 =k groups and k2 = k+1 groups. • If E1 and E2 are the sums of the within cluster sums of squares for k1 and k2 respectively, then add the extra group if:
K-means: Distance Metric • Euclidean Distance • Pearson Correlation
K-means: Starting Chains • Initially, data are randomly assigned to k clusters but this choice of k cluster centers can have an effect on the final clustering • R implementation of K-means software allows the choice of “number of initial starting chains” to be chosen and the run with the smallest sum of within cluster sums of squares is the run which is given as output
K-Means: Starting Chains • For j = 1 to B • Random assignment j • k clusters • wj = within cluster sum-of-squares End j Pick clustering with min(wj)
Hierarchical Clustering • Hierarchical clustering is an addition or subtraction method. • Initially each curve is assigned its own cluster • The two closest clusters are joined into one branch to create a clustering tree • The clustering tree stops when the algorithm terminates via a stopping rule
Hierarchical Clustering • Nearest neighbor: Distance between two clusters is the minimum of all distances between all pairs of curves, one from each cluster • Furthest neighbor: Distance between two cluster is the maximum of all distances between all pairs of curves, one from each cluster • Average linkage: Distance between two clusters is the average of all distances between all pairs of elements, one from each cluster
Hierarchical Clustering • Normally the algorithm stops at a pre-determined number of clusters or when the distance between two clusters reaches some pre-determined threshold • No universal stopping rule of thumb to find an optimal number of clusters using this algorithm.
Model-Based Clustering Many uses mixture models, splines or piecewise polynomial functions used to approximate curves Can better incorporate covariate information
Models using Splines • Time course profiles assumed observations from some underlying smooth expression curve • Each data curves represented as the sum of: • Smooth population mean spline (dependent on time and cluster assignment) • Spline function representing individual (gene) effects • Gaussian measurement noise
Model based clustering and data transformationsfor gene expression data (2001) Yeung et al., Bioinformatics, 17:977-987. MCLUST software
Validation Methods • L(C) is maximized log-likelihood for model with C clusters, m is the number of independent parameters to be estimated and n is the number of genes • Strikes a balance between goodness-of-fit and model complexity • The non-model-based methods have no such validation method
Comparison of Methods • Ma et al (2006) • Smoothing Spline Clustering (SSClust) • Simulation study • SSClust better than MClust & nonparameteric • Comparison: misclassification rates
Functional Form of Ma et al (2006) Simulation Cluster Centers
MR and OSR • Misclassification Rate • Overall Success Rate • To calculate OSR the MR is only for the cases when the correct number of clusters is found
Comparison of Methods • From Ma et al. (2006) paper.
SSClust Methods Paper • Concluded that SSClust was the superior clustering method • Looking at the data, the differences in scale between the four true curves is large • Typical time course clusters differ in location and spread but not in scale to this extreme • Their conclusions are based on a data set which is not representative of the type of data this clustering method would be used for
Alternative Simulation Functional Form for five clusters centers
Example of SSClust Breaking Down Linear curves joined while sine curves arbitrarily split into 2 clusters
Simulation Configuration • Distance Metric • Euclidean or Pearson • # of Curves • Small (100), Large (3000) • # Resolution of Time Points • 13 or 25 time points • evenly spaced or unevenly spaced • Types of underlying Curves • Small (4) – Large (8)
Simulation Configuration • Distribution of curves across clusters • Equally distributed verses unequally distributed • Noise Level • Small (< 0.5*SD of the data set) • Large (> 0.5*SD of the data set) • For these cases, found the misclassification rates and the percent of times that the correct number of clusters was found
Conclusions from Simulations • MCLUST performed better than SSClust and K-means in terms of misclassification rate and finding the correct number of clusters • Clustering methods were affected by the level of noise but, in general, not by the number of curves, the number of time points or the distribution of curves across cluster
Comparison based on Real Data • Applied these same clustering techniques to real data • Different numbers of clusters found for different methods for each real data set
Simulations Based on Real Data • Start with real data, like the yeast data set • Cluster the results using a given clustering method • Perturb the original data (add noise at each point) • Evaluate how different the new clustering is in comparison to the original clustering • Use MR and OSR