480 likes | 675 Views
Statistical methods for analyzing DNA microarray data. Steve Horvath Human Genetics and Biostatistics UCLA. Contents. Data normalization dChip Gene filtering SAM Gene clustering k-means clustering hierarchical clustering Prediction methods k-nearest neighbor predictors
E N D
Statistical methods for analyzing DNA microarray data Steve Horvath Human Genetics and Biostatistics UCLA
Contents • Data normalization • dChip • Gene filtering • SAM • Gene clustering • k-means clustering • hierarchical clustering • Prediction methods • k-nearest neighbor predictors • linear model predictors
dChipModel-based analysis of oligonucleotide arrays: Expression index computation and outlier detection Cheng Li and Wing Hung WongProc. Natl. Acad. Sci. USA, Vol. 98, Issue 1, 31-36, January 2, 2001
Abstract • Insight: Probe-specific biases arehighly reproducible • Their adverse effectcan be reduced by proper modeling and analysis methods. • Li and Wong propose a statistical model for the probe-level data, and developmodel-based estimates for gene expression indexes. • They also presentmodel-based methods for identifying and handling cross-hybridizingprobes and contaminated array regions.
Oligonucleotide expression arrays • 14 to 20 probe pairs are used to interrogate each gene, each probe pair has a Perfect Match (PM) and Mismatch (MM) signal, and the average of the PM-MM differences for all probe pairs in a probe set (called "average difference") is used as an expression index for the target gene. • Researchers rely on the average differences as the starting point for "high-level analysis" s • Li and Wong focus on "low-level" analysis issues such as feature extraction, normalization, and computation of expression indexes.
Results of real data analyses • GOOD: There are considerable differences in the expression levels of many genes: the between-array variation in PM-MM differences is substantial. • BAD: For the majority of probe sets, the mean square variation due to probes is several times that due to arrays (Anova). • Message: The proper treatment of probe effects is an essential component of any approach to the analysis of expression array data.
Graphics from dChip • Black curves are the PM-MM difference data of gene A in the first six arrays. Light curves are the fitted values to model 2.
Linear Model fit to probe differences • For each gene, a simple model for the PM-MM differences: i= samples, j=probe pair Given probe parameters, one can compute a standard error for the expression index theta. Given theta, standard errors for phi can be computed-> detection of outlying probe pairs
(A) A long scratch contamination (indicated by arrow) is alleviated by automatic outlier exclusion
Conclusions for dChip • Li and Wong propose a statistical model for oligonucleotide expression array data at the probe level. • The model allows one to account for individual probe-specific effects, and automaticdetection of outliers and image artifacts. • In a follow-up paper, they discuss thecomputation of standard errors (SE) for expression indexes andconfidence intervals (CI) for fold changes.
SAMSignificance analysis of microarrays applied to the ionizing radiation response Virginia Goss Tusher, Robert Tibshirani, and Gilbert ChuProc. Natl. Acad. Sci. USA, Vol. 98, Issue 9, 5116-5121, April 24, 2001
Abstract • Method for gene filtering: find genes change that significantly across samples • Significance Analysis of Microarrays (SAM) assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. • For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR).
Introduction • Suitable for oligo, cDNA, protein arrays • Does not normalize the data! • Challenge: methods based on conventional t tests provide the probability (P) that a difference in gene expression occurred by chance. For an array with 10000 genes, a significance level of alpha = 0.01 would identify 100 genes by chance. • Solution based on SAM: • assimilate a set of gene-specific t tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene. • Genes with scores greater than a threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements. • The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. To demonstrate its utility, SAM was used to analyze a biologically important problem: the transcriptional response of lymphoblastoid cells to ionizing radiation (IR).
Data description • RNA was harvested from wild-type human lymphoblastoid cell lines, designated 1 and 2, • growing in an unirradiated state (U)or in an irradiated state (I) • RNA samples were labeled and divided into two identicalaliquots for independent hybridizations, A and B. Thus, data for6,800 genes on the microarray were generated from eight hybridizations(U1A, U1B, U2A, U2B, I1A, I1B, I2A, andI2B).
Method For the i-th gene define the "relative difference" d(i) in gene expression as a statistic, which is based on the ratio of change in gene expression to standard deviation. The "gene-specific scatter" s(i) is the standard deviation of repeated expression measurements: where m and n are summations of the expression measurements in states I and U, respectively, a = (1/n1 + 1/n2)/(n1 + n2 2), and n1 and n2 are the numbers of measurements in states I and U (four in this experiment).
Why does one need s0 in d(i)? • To compare values of d(i) across all genes, the distribution of d(i) should be independent of the level of gene expression. • At low expression levels, variance in d(i) can be high because of small values of s(i). • To ensure that the variance of d(i) is independent of gene expression, a small positive constant s0 is added to the denominator. • The coefficient of variation of d(i) is computed as a function of s(i) in moving windows across the data. The value for s0 was chosen to minimize the coefficient of variation.
Scatter plots of relative difference in gene expression d(i) vs. gene-specific scatter s(i). (A) Relative difference between irradiated and unirradiated states. (B) Relative difference between cell lines 1 and 2: often bigger than A! (C) Relative difference between hybridizations A and B. (D) Relative difference for a permutation of the data that was balanced between cell lines 1 and 2.
Computing the distribution • Genes were ranked by magnitude of their d(i) values, e.g. d(1) was the largest relative difference • For each of the 36 balanced permutations, relative differences dp(i) were also calculated, and the genes were again ranked such that dp(i) was the ith largest relative difference for permutation p. The expected relative difference, dE(i), was defined as the average over the 36 balanced permutations, dE(i) =Sum dp(i) /36. • Used a scatter plot of the observed relative difference d(i) vs. the expected relative difference dE(i)
Identification of genes with significant change • (A) Scatter plot of the observed relative difference d(i) versus the expected relative difference dE(i). The dotted lines are drawn at distance = 1.2 from the solid line. • (B) Scatter plot of d(i) vs. s(i). (C) Cube root scatter plot of average gene expression in induced and uninduced cells.
Conclusion SAM • SAM is a method for identifying genes on a microarray with statistically significant changes in expression. • SAM provides an estimate of the FDR for each value of the tuning parameter. The estimated FDR is computed from permutations of the data. • SAM can be generalized to other types of experiments and outcomes by redefining d(i) • http://www-stat-class.stanford.edu/SAM/SAMServlet.
2 references for clustering • T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series • L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability
Introduction to clustering Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects. Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.
Proximity matrices are the input to most clustering algorithms Proximity between pairs of objects: similarity or dissimilarity. If the original data were collected as similarities, a monotone-decreasing function can be used to convert them to dissimilarities. Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:
Agglomerative clustering, hierarchical clustering and dendrograms
Agglomerative clustering • Agglomerative clustering algorithms begin with every observation representing a singleton cluster. • At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster. • Therefore a measure of dissimilarity between 2 clusters must be defined.
Different intergroup dissimilarities Let G and H represent 2 groups.
Comparing different linkage methods If there is a strong clustering tendency, all 3 methods produce similar results. Single linkage has a tendency to combine observations linked by a series of close intermediate observations ("chaining“). Good for elongated clusters Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string). Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters
Dendrogram Recursive binary splitting/agglomeration can be represented by a rooted binary tree. The root node represents the entire data set. The N terminal nodes of the trees represent individual observations. Each nonterminal node ("parent") has two daughter nodes. Thus the binary tree can be plotted so that the height of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters. A dendrogram provides a complete description of the hierarchical clustering in graphical format. Comment: Dendrograms are often viewed as graphical summary of the data rather than a description of the results of the algorithm. Caution: different hierarchical methods as well as small changes in the data can lead to quite different dendrograms. Also such a summary will be valide only to the extent that the pairwise *observation* dissimilarities possess the hierarchical structure produced by the algorithm. Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data. The extent to which the hierarchical structure produced by a dendrogram actually represents the data itself can be judged by the cophenetic correlation coefficient. This is the correlation between the N(N-1)/2 pairwise observation dissimilarities d_{ii'} input to the algorithm and their corresponding cophenetic dissimilarities d_{ii'} derived from the dendrogram. The cophenetic dissimilarity C_{ii'} between 2 observations i and i' is the intergroup dissimilarity at which observations i and i' are first joint together in the same cluster. The cophenetic dissimilarity is a very restrictive dissimilarity measure. First, the C_{ii'} over the observations must contain many ties, since only N-1 of the total N(N-1)/2 values can be distinct. Also these dissimilarities obey the ultrametric inequality C_{ii'}<=max(C_{ik},C_{i'k} ) for any 3 observations (i,i',k). Comment: since most dissimilarities don't satisfy the ultrametric inequality, the dendrogram should be viewed mainly as a description of the clustering structure imposed by a particular algorithm.
Comments on dendrograms Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms. Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data. In general dendrograms are a description of the results of the algorithm and not graphical summary of the data. Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality for all i,i’,k
Combinatorial clustering algorithms.Example: K-means clustering
Combinatorial clustering algorithms Most popular clustering algorithms directly assign each observation to a group or cluster without regard to a probability model describing the data. Notation: Label observations by an integer “i” in {1,...,N} and clusters by an integer k in {1,...,K}. The cluster assignments can be characterized by a many to one mapping C(i) that assigns the i-th observation to the k-th cluster: C(i)=k. (aka encoder) One seeks a particular encode C*(i) that minimizes a particular *loss* function (aka energy function).
Loss functions for judging clusterings One seeks a particular encode C*(i) that minimizes a particular *loss* function (aka energy function). Example: within cluster point scatters
Cluster analysis by combinatorial optimization Straightforward in principle: Simply minimize W(C) over all possible assignments of the N data points to K clusters. Unfortunately such optimization by complete enumeration is feasible only for small data sets. For this reason practical clustering algorithms are able to examine only a fraction of all possible encoders C. The goal is to identify a small subset that is likely to contain the optimal one or at least a good sub-optimal partition. Feasible strategies are based on iterative greedy descent. K-means clustering. The K-means clustering algorithm is one of the most popular iterative descent clustering methods. Setting: all variables are of the quantitative type and one uses a squared Euclidean distance. In this case the within cluster point scatter can be written as W(C)=.5 Sum_{k=1,...,K} Sum_{C(i)=k} Sum_{C(i')=k} ||x_i-x_i'||^2 (BTW: weighted Euclidean distance can be used by redefining the x_ij values). W(C)= Sum_{k=1,...,K} Sum_{C(i)=k} ||x_i-bar(x_k) ||^2 (homework!) where bar(x)_k is the mean vector associated with the k-th cluster. An iterative descent algorithm for solving C*=min_C Sum_{C(i)=k} ||x_i-bar(x_k) ||^2 can be obtained by noting that set of observations S bar(x)_S=argmin_m Sum_{i in S} ||x_i-m ||^2 Hence we can obtain C* by solving the enlarged optimization problem min_{C,{m_k}, k=1,...,K} = Sum_{k=1,...,K} Sum_{i:C(i)=k} ||x_i-m_k ||^2 A local minimum for this can be found by the following alternating optimization procedure: K-means clustering algorithm 1. For a given cluster assignment C, the total cluster variance TotVar=Sum_{k=1,...,K} Sum_{i:C(i)=k} ||x_i-m_k ||^2 is minimized with respect to {m_1,...,m_k} yielding the means of the currently assigned clusters. 2. Given the current set of means {m_1,...,m_k}, TotVar is minimized by assigning each observation to the closest (current) cluster mean. That is C(i)=argmin_k ||x_i-m_k||^2. 3. Steps 1 and 2 are iterated until the assignments do not change. Reminder: the solution may represent a (suboptimal) local minimum. One should start with many different random choices of starting means, and choose the solution having smallest value of the objective function.
K-means clustering is a very popular iterative descent clustering methods. Setting: all variables are of the quantitative type and one uses a squared Euclidean distance. In this case One can obtain the optimal C* by solving the enlarged optimization problem A local minimum for this can be found by the following alternating optimization procedure: K-means clustering algorithm 1. For a given cluster assignment C, the total cluster variance TotVar=Sum_{k=1,...,K} Sum_{i:C(i)=k} ||x_i-m_k ||^2 is minimized with respect to {m_1,...,m_k} yielding the means of the currently assigned clusters. 2. Given the current set of means {m_1,...,m_k}, TotVar is minimized by assigning each observation to the closest (current) cluster mean. That is C(i)=argmin_k ||x_i-m_k||^2. 3. Steps 1 and 2 are iterated until the assignments do not change. Reminder: the solution may represent a (suboptimal) local minimum. One should start with many different random choices of starting means, and choose the solution having smallest value of the objective function.
K-means clustering algorithm leads to a local minimum 1. For a given cluster assignment C, the total cluster variance is minimized with respect to {m1,...,mk} yielding the means of the currently assigned clusters. 2. Given the current set of means, TotVar is minimized by assigning each observation to the closest (current) cluster mean. That is C(i)=argmink ||xi-mk||2 3. Steps 1 and 2 are iterated until the assignments do not change. Recommendation: Start with many different random choices of starting means, and choose the solution having smallest value of the objective function.
Applications of prediction methods • Prognosis of cancer outcomes based on molecular (gene expression) information • Diagnosis of cancer (subtypes) • More esoteric: Cluster validation....
Overview Goal: use input (features) to predict the output (response). Denote quantitative outcomes by yi and qualitative outcomes and gi. We assume that we have available a set of measurements (xi,yi) or (xi,gi), i=1,...,N, known as the training data, with which to construct our prediction rule.
Least squares prediction Given a vector of inputs X=(X_1,...,X_p), we predict the output via the model How to fit a set of training data? Most popular method is least squares: minimize The solution is found via the Gauss Markov theorem:
Using a linear model for predicting a binary outcome: For binary responses, code the response Y as 1 and 0. The fitted value are converted to a fitted class variable according to the rule iff
Popular techniques are variants of linear prediction and nearest neighbor predictors • 1-nearest neighbor, the simplest of all, captures a large percentage of the market. • Linear model: stable but biased • Nearest Neighbor: unstable but less biased. Scenario 1: The training data in each class were generated according to 2 bivariate normal distributions with uncorrelated components and different means --> use a linear model Scenario 2: The training data in each class came from a mixture of 10 low variance Gaussian distributions, with individual means themselves distributed as Gaussian--> k nearest neighbor.
Nearest Neighbor methods Nearest-neighbor methods use those observations in the training set T closes in the feature space to x to form The k-nearest neighbor prediction at x is where Nk(x) is the neighborhood of x defined by the k closest points xi in the training sample. • requires a metric, e.g. the Euclidean distance • when dealing with a qualitative response, use majority voting to predict the class.
Conclusion Many ways of normalizing gene expression data. We discussed dChip Many ways of identifying differentially expressed genes. We discussed SAM Many ways of clustering genes. We discussed k-means and hierarchical clustering. Many ways of forming predictors. We discussed linear models and k-nearest neighbor predictors.