Introduction to Time-Course Gene Expression Data

Introduction to Time-Course Gene Expression Data STAT 675 R Guerra April 21, 2008

Outline • The Data • Clustering – nonparametric, model based • A case study • A new model

The Data • DNA Microarrays: collections of microscopic DNA spots, often representing single genes, attached to a solid surface

The Data • Gene expression changes over time due to environmental stimuli or changing needs of the cell • Measuring gene expression against time leads to time-course data sets

Time-Course Gene Expression • Each row represents a single gene • Each column represents a single time point • These data sets can be massive, analyzing many genes simultaneously

Time-Course Gene Expression • k-means to clustering • “in the budding yeast Saccharomyces cerevisiae clustering gene expression data • groups together efficiently genes of known similar function, • and we find a similar tendency in human data…” Eisen et al. (1998)

Clustering Expression Data • When these data sets first became available, it was common to cluster using non-parametric clustering techniques like K-Means and hierarchical clustering

Yeast Data Set • Spellman et al (1998) measured mRNA levels on yeast (saccharomyces cerevisiae) • 18 equally spaced time-points • Of 6300 genes nearly 800 were categorized as cell-cycle regulated • A subset of 433 genes with no missing values is a commonly used data set in papers detailing new time-course methods • Original and follow-up papers clustered genes using K-means and hierarchical clustering

Spellman et al. (1998) Yeast cell cycle Row labels = cell cycle Rows=genes Col labels = expts Cols = time points

Yeast Data Set (Spellman et al.) K-means Hierarchical Which method gives the “right” result???

Non-Parametric Clustering • Data curves • Apply distance metric to get distance matrix • Cluster

Issues with Non-Parametric Clustering • Technical • Require the number of clusters to be chosen a priori • Do not take into account the time-ordering of the data • Hard to incoporate covariate data, eg, gene ontology • Yeast analysis had number of clusters chosen based on number of cell cycle groups .…no statistical validation showing that these were the best clustering assignments

Model-Based Clustering • In response to limitations of nonparametric methods, model based methods proposed • Time series • Spline Methods • Hidden Markov Model • Bayesian Clustering Models • Little consensus over which method is “best” to cluster time course data

K-Means Clustering Relocation method: Number of clusters pre-determined and curves can change clusters at each iteration • Initially, data assigned at random to k clusters • Centroid is computed for each cluster • Data reassigned to cluster whose centroid is closest to it • Algorithm repeats until no further change in assignment of data to clusters • Hartigan rule used to select “optimal” #clusters

K-means: Hartigan Rule • n curves, let k1 =k groups and k2 = k+1 groups. • If E1 and E2 are the sums of the within cluster sums of squares for k1 and k2 respectively, then add the extra group if:

K-means: Distance Metric • Euclidean Distance • Pearson Correlation

K-means: Starting Chains • Initially, data are randomly assigned to k clusters but this choice of k cluster centers can have an effect on the final clustering • R implementation of K-means software allows the choice of “number of initial starting chains” to be chosen and the run with the smallest sum of within cluster sums of squares is the run which is given as output

K-Means: Starting Chains • For j = 1 to B • Random assignment j • k clusters • wj = within cluster sum-of-squares End j Pick clustering with min(wj)

Insert Initial starting chains

Hierarchical Clustering • Hierarchical clustering is an addition or subtraction method. • Initially each curve is assigned its own cluster • The two closest clusters are joined into one branch to create a clustering tree • The clustering tree stops when the algorithm terminates via a stopping rule

Hierarchical Clustering • Nearest neighbor: Distance between two clusters is the minimum of all distances between all pairs of curves, one from each cluster • Furthest neighbor: Distance between two cluster is the maximum of all distances between all pairs of curves, one from each cluster • Average linkage: Distance between two clusters is the average of all distances between all pairs of elements, one from each cluster

Hierarchical Clustering • Normally the algorithm stops at a pre-determined number of clusters or when the distance between two clusters reaches some pre-determined threshold • No universal stopping rule of thumb to find an optimal number of clusters using this algorithm.

Model-Based Clustering Many uses mixture models, splines or piecewise polynomial functions used to approximate curves Can better incorporate covariate information

Models using Splines • Time course profiles assumed observations from some underlying smooth expression curve • Each data curves represented as the sum of: • Smooth population mean spline (dependent on time and cluster assignment) • Spline function representing individual (gene) effects • Gaussian measurement noise

SSCLUST software

Pan

Model based clustering and data transformationsfor gene expression data (2001) Yeung et al., Bioinformatics, 17:977-987. MCLUST software

Validation Methods • L(C) is maximized log-likelihood for model with C clusters, m is the number of independent parameters to be estimated and n is the number of genes • Strikes a balance between goodness-of-fit and model complexity • The non-model-based methods have no such validation method

Clustering Yeast Data using SSClust

Clustering Yeast Data in MCLUST

Comparison of Methods • Ma et al (2006) • Smoothing Spline Clustering (SSClust) • Simulation study • SSClust better than MClust & nonparameteric • Comparison: misclassification rates

Functional Form of Ma et al (2006) Simulation Cluster Centers

MR and OSR • Misclassification Rate • Overall Success Rate • To calculate OSR the MR is only for the cases when the correct number of clusters is found

Comparison of Methods • From Ma et al. (2006) paper.

SSClust Methods Paper • Concluded that SSClust was the superior clustering method • Looking at the data, the differences in scale between the four true curves is large • Typical time course clusters differ in location and spread but not in scale to this extreme • Their conclusions are based on a data set which is not representative of the type of data this clustering method would be used for

Alternative Simulation Functional Form for five clusters centers

Example of SSClust Breaking Down Linear curves joined while sine curves arbitrarily split into 2 clusters

Simulation Configuration • Distance Metric • Euclidean or Pearson • # of Curves • Small (100), Large (3000) • # Resolution of Time Points • 13 or 25 time points • evenly spaced or unevenly spaced • Types of underlying Curves • Small (4) – Large (8)

Simulation Configuration • Distribution of curves across clusters • Equally distributed verses unequally distributed • Noise Level • Small (< 0.5*SD of the data set) • Large (> 0.5*SD of the data set) • For these cases, found the misclassification rates and the percent of times that the correct number of clusters was found

Function Forms of 7 Cluster Centers

Simulation Analysis

Conclusions from Simulations • MCLUST performed better than SSClust and K-means in terms of misclassification rate and finding the correct number of clusters • Clustering methods were affected by the level of noise but, in general, not by the number of curves, the number of time points or the distribution of curves across cluster

Effect on Number of Profiles on OSR

Comparison based on Real Data • Applied these same clustering techniques to real data • Different numbers of clusters found for different methods for each real data set

Yeast Data

Human Fibroblast Data

Simulations Based on Real Data • Start with real data, like the yeast data set • Cluster the results using a given clustering method • Perturb the original data (add noise at each point) • Evaluate how different the new clustering is in comparison to the original clustering • Use MR and OSR

Simulations Based on Yeast Data

Simulations Based on HF Data

Introduction to Time-Course Gene Expression Data

Introduction to Time-Course Gene Expression Data

Presentation Transcript

Gene Expression Data Analyses (1)

Clustered alignments of gene-expression time series data

Microarray Gene Expression Data Analysis

Continuous Representations of Time Gene Expression Data

Analysis of Gene Expression Data

Clustering Gene Expression Data

Clustering of Time Course Gene-Expression Data via Mixture Regression Models

Introduction to Statistical Analysis of Gene Expression Data

Introduction to Gene Chips and Microarray Expression Data

Introduction to Gene Chips and Microarray Expression Data

Introduction to Microarray Gene Expression

Gene expression data in VectorBase

Analysis of time-course gene expression data

Clustering short time series gene expression data

4. Gene Expression Data Analysis

Introduction to Gene Expression

Clustering Gene Expression Data

Gene Expression Data

Clustering Gene Expression Data

Bioinformatics : Gene Expression Data Analysis

Clustering Gene Expression Data