1.31k likes | 1.53k Views
Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV). Advanced Course Coverage. Introduction -fundamental concepts, expression vectors and distance metrics -fundamental statistical concepts encountered in mev analysis modules Algorithm Coverage
E N D
Analysis of Multiple ExperimentsTIGR Multiple Experiment Viewer (MeV)
Advanced Course Coverage • Introduction -fundamental concepts, expression vectors and distance metrics -fundamental statistical concepts encountered in mev analysis modules • Algorithm Coverage -Lecture / Hands on Exercises (refer to algorithm handout for order…)
Microarray Printers Microarray Scanners IAS-1 IAS-2 Lucidea Axon-1 Axon-2 MD MD3 Others ScanArray Others Reports Data Entry Pages Study Experiment Probe Source Study Slidetype Slide Probe Slide Probe MAGE-ML Hybridization Database MUSAGE Database Others… Database MAD Scan Expression Analysis TIGR THE INSTITUTE FOR GENOMIC RESEARCH Scheduler (Machine Scheduling) Microarray Data Flow SliTrack (Machine Control) Exp Designer MABCOS (Barcode System) PCR Score .tiff Image File Spotfinder (Image Analysis) MADAM (Data Manager) Expression Data Raw .tav File Miner (.tav File Creator) Raw .tav File MIDAS (Normalization) GenePix Converter Normalized .tav File Query Window MeV (Data Analysis) Interpretation…
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 The Expression Matrix is a representation of data from multiple microarray experiments. Each element is a log ratio (usually log 2 (Cy5 / Cy3) ) Black indicates a log ratio of zero, i. e., Cy5 and Cy3 are very close in value Green indicates a negative log ratio , i.e., Cy5 < Cy3 Gray indicates missing data Red indicates a positive log ratio, i.e, Cy5 > Cy3
1.5 -0.8 1.8 0.5 -0.4 -1.3 1.5 0.8 Expression Vectors -Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. Log2(cy5/cy3)
Expression Vectors As Points in‘Expression Space’ Exp 1 Exp 2 Exp 3 G1 -0.8 -0.3 -0.7 G2 -0.7 -0.8 -0.4 G3 Similar Expression -0.4 -0.6 -0.8 G4 0.9 1.2 1.3 G5 1.3 0.9 -0.6 Experiment 3 Experiment 2 Experiment 1
Distance and Similarity -the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms -distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression -selection of a distance metric defines the concept of distance
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 x1A x2A x3A x5A Gene A x4A x6A Gene B x1B x2B x3B x4B x5B x6B 6 6 • Manhattan: i = 1 |xiA – xiB| Distance: a measure of similarity between genes. p1 • Some distances: (MeV provides 11 metrics) • Euclidean: i = 1(xiA - xiB)2 p0 3. Pearson correlation
Distance Metric: EuclideanPearson(r*-1) D D Distance is Defined by a Metric 1.4 -0.90 4.2 -1.00
Probability distributions The probability of an event is the likelihood of its occurring. It is sometimes computed as a relative frequency (rf), where the number of “favorable” outcomes for an event rf = ---------------------------------------------------------------- the total number of possible outcomes for that event. The probability of an event can sometimes be inferred from a theoretical probability distribution, such as a normal distribution.
Normal distribution σ = std. deviation of the distribution X = μ (mean of the distribution)
Mean 1 Mean 2 Population 2 Population 1 Sample mean “s” Less than a 5% chance that the sample with mean s came from population 1, i.e., s is significantly different from “mean 1” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population 2.
Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal distribution. But expression measurements? Probably not. Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other assumptions used in these tests. Randomization / resampling based tests can be used to get around the violation of the normality assumption. Even when parametric statistical tests (the ones that make use of normal and other distributions) are valid, randomization tests are still useful.
Outline of a randomization test - 1 • Compute the value of interest (i.e., the test-statistic s) from your data set. s Original data set • Make “fake” data sets from your original data, by taking a random sub-sample of the data, or by re-arranging the data in a random fashion. • Re-compute sfrom the “fake” data set. “fake” s “fake” s “fake” s . . . Randomized data sets
Outline of a randomization test - 2 4. Repeat steps 2 and 3 many times (often several hundred to several thousand times). Keep a record of the “fake” s values from step 3. 5. Draw inferences about the significance of your original s value by comparing it with the distribution of the randomized (“fake”) s values. Original s value: could be significant as it exceeds most of the randomized s values Range of randomized s values
Outline of a randomization test - 3 Rationale Ideally, we want to know the “behavior” of the larger population from which the sample is drawn, in order to make statistical inferences. Here, we don’t know that the larger population “behaves” like a normal distribution, or some other idealized distribution. All we have to work with are the data in hand. Our “fake” data sets are our best guess about this behavior (i.e., if we had been pulling data at random from an infinitely large population, we might expect to get a distribution similar to what we get by pulling random sub-samples, or by reshuffling the order of the data in our sample)
The problem of multiple testing • (adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics, • Dept. Computational Molecular Biology, Berlin, Germany • http://www.bioconductor.org/workshops/Heidelberg02/mult.pdf) • Let’s imagine there are 10,000 genes on a chip, AND • None of them is differentially expressed. • Suppose we use a statistical test for differential • expression, where we consider a gene to be differentially expressed if it meets the criterion at a • p-value of p < 0.05.
The problem of multiple testing – 2 • Let’s say that applying this test to gene “G1” yields a p-value of p = 0.01 • Remember that a p-value of 0.01 means that there is a 1% chance that the gene is not differentially expressed, i.e., • Even though we conclude that the gene is differentially expressed (because p < 0.05), there is a 1% chance that our conclusion is wrong. • We might be willing to live with such a low probability • of being wrong • BUT .....
The problem of multiple testing – 3 • We are testing 10,000 genes, not just one!!! • Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be differentially expressed, because we have decided to “live with” a p-value of 0.05 • If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t sound too good.
The problem of multiple testing - 4 • There are “tricks” we can use to reduce the severity of • this problem. • They all involve “slashing” the p-value for each test • (i.e., gene), so that while the critical p-value for the entire • data set might still equal 0.05, each gene will be • evaluated at a lower p-value. • We’ll go into some of these techniques later.
Don’t get too hung up on p-values. • Ultimately, what matters is biological relevance. • P-values should help you evaluate the strength of the • evidence, rather than being used as an absolute yardstick • of significance. Statistical significance is not necessarily • the same as biological significance.
i.e., you don’t want to belong to “that group of people whose aim in life is to be wrong 5% of the time”!!!* *Kempthorne, O., and T.E. Deoerfler 1969 The behaviour of some significance tests under experimental randomization. Biometrika 56:231-248, as cited in Manly, B.J.F. 1997. Randomization, bootstrap and Monte Carlo methods in biology: pg. 1. Chapman and Hall / CRC
Y X Y X • Pearson correlation coefficient – r • Indicates the degree to which a linear relationship can be approximated between two variables. • Can range from (–1.0) to (+1.0). • Positive r between two variables X and Y: as X increases, so does Y on the whole. • Negative r: as X increases, Y generally decreases. • The higher the magnitude of r (in the positive or negative direction), the more linear the relationship.
Pearson correlation - 2 • Sometimes, a p-value is associated with the correlation coefficient r. • This p-value is computed from a theoretical distribution of the correlation coefficient, similar to the normal distribution. Population correlation coefficient = 0 Sample correlation coefficient r p < 0.05 range, i.e., reject the null hypothesis that the variables are not correlated, since the sample correlation coefficient is in the rejection range of the correlation coefficient distribution that has a mean = 0 • This is the p-value for the null hypothesis that the X and Y data for our sample come from a population in which their correlation is zero, i.e., the null hypothesis is that there is no linear relationship between X and Y. •If p is sufficiently small (often p < 0.05), we can reject the null hypothesis, i.e., we conclude that there is indeed a linear relationship between X and Y.
Pearson correlation - 3 The square of the Pearson correlation, r2, also known asthe coefficient of determination, is a measure of the “strength” of the linear relationship between X and Y. It is the proportion of the total variation in X and Y that is explained by a linear relationship.
Hierarchical Clustering (HCL) HCL is an agglomerative clustering method which joins similar genes into groups. The iterative process continues with the joining of resulting groups based on their similarity until all groups are connected in a hierarchical tree. (HCL-1)
g1 g1 g1 g8 g2 g8 g3 g4 g2 g2 g3 g4 g5 g4 g3 g5 g6 g5 g7 g6 g6 g7 g8 g7 Hierarchical Clustering g1 is most like g8 g4 is most like {g1, g8} (HCL-2)
g1 g1 g1 g8 g8 g8 g4 g4 g4 g2 g5 g2 g3 g3 g7 g5 g2 g5 g6 g7 g3 g7 g6 g6 Hierarchical Clustering g5 is most like g7 {g5,g7} is most like {g1, g4, g8} (HCL-3)
g1 g8 g4 g5 g7 g2 g3 g6 Hierarchical Tree (HCL-4)
Hierarchical Clustering During construction of the hierarchy, decisions must be made to determine which clusters should be joined. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods. (HCL-5)
Agglomerative Linkage Methods • Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. • Three linkage methods that are commonly used are: • Single Linkage • Average Linkage • Complete Linkage (HCL-6)
Single Linkage Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. DAB = min ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB (HCL-7)
Average Linkage Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. DAB = 1/(NANB) S S ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB (HCL-8)
Complete Linkage Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. DAB = max ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB (HCL-9)
Single Ave. Complete Comparison of Linkage Methods (HCL-10)
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 2 Exp 2 Exp 3 Exp 4 Exp 4 Exp 4 Exp 1 Exp 1 Exp 3 Exp 5 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Bootstrapping (ST) Bootstrapping – resampling with replacement Original expression matrix: Various bootstrapped matrices (by experiments): Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 1 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 3 Exp 4 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 Jackknifing – resampling without replacement Jackknifing (ST) Original expression matrix: Various jackknifed matrices (by experiments):
Analysis of Bootstrapped and Jackknifed Support Trees • Bootstrapped or jackknifed expression matrices are created many times by randomly resampling the original expression matrix, using either the bootstrap or jackknife procedure. • Each time, hierarchical trees are created from the resampled matrices. • The trees are compared to the tree obtained from the original data set. • The more frequently a given cluster from the original tree is found in the resampled trees, the stronger the support for the cluster. • As each resampled matrix lacks some of the original data, high support for a cluster means that the clustering is not biased by a small subset of the data.
K-Means / K-Medians Clustering (KMC)– 1 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 1. Specify number of clusters, e.g., 5. 2. Randomly assign genes to clusters.
G3 G6 G1 G8 G4 G5 G2 G10 G9 G12 G13 G11 G7 K-Means Clustering – 2 3. Calculate mean / median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean / median expression profile (calculated in step 3) is the closest to that gene’s expression profile. 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached. K-Means / K-Medians is most useful when the user has an a-priori hypothesis about the number of clusters the genes should group into.
Principal Components (PCAG and PCAE) – 1 • PCA simplifies the “views” of the data. • Suppose we have measurements for each gene on multiple • experiments. • Suppose some of the experiments are correlated. • PCA will ignore the redundant experiments, and will take a • weighted average of some of the experiments, thus possibly making • the trends in the data more interpretable. • 5. The components can be thought of as axes in n-dimensional • space, where n is the number of components. Each axis represents a • different trend in the data.
x z y “Cloud” of data points (e.g., genes) in 3-dimensional space Data points resolved along 3 principal component axes. PCAG and PCAE - 2 In this example, x-axis could mean a continuum from over-to under-expression (“blue” and “green” genes over-expressed, yellow genes under-expressed) y-axis could mean that “gray” genes are over-expressed in first five expts and under expressed in The remaining expts, while “brown” genes are under-expressed in the first five expts, and over-expressed in the remaining expts. z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true for “purple” genes. Interpretation of components is somewhat subjective.
Cluster Affinity Search Technique (CAST) -uses an iterative approach to segregate elements with ‘high affinity’ into a cluster -the process iterates through two phases -addition of high affinity elements to the cluster being created -removal or clean-up of low affinity elements from the cluster being created
G3 G13 G8 Empty cluster C1 G2 G14 G4 G12 G1 G5 G15 G9 G6 G11 G7 G10 Unassigned genes Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as %age of maximum affinity at that point Clustering Affinity Search Technique (CAST)-1 1. Create a new empty cluster C1. 2. Set initial affinity of all genes to zero 3. Move the two most similar genes into the new cluster. 4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1) ADD GENES: 5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds the user-specified threshold affinity, pick the unassigned gene whose affinity is the highest, and add it to cluster C1. Update the affinities of all the genes accordingly.
REMOVE GENES: CAST – 2 6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, remove the lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene. 7. Repeat step 6 while C1 contains a low-affinity gene. G3 G13 G8 Current cluster C1 G2 G4 G6 G14 G12 G5 G9 G11 G7 G1 G10 G15 Unassigned genes 8. Repeat steps 5-7 as long as changes occur to the cluster C1. 9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps 1-8. 10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster
G9 G6 G8 G11 G4 G10 G11 G1 G2 G5 G7 G3 “Seed” gene G12 Currently unassigned genes Current cluster QT-Clust (from Heyer et. al. 1999) (HJC) -1 • Compute a jackknifed distance between all pairs of genes • (Jackknifed distance: The data from one experiment are excluded from both genes, and the • distance is calculated. Each experiment is thus excluded in turn, and the maximum distance • between the two genes (over all exclusions) is the jackknifed distance. This is a conservative • estimate of distance that accounts for bias that might be introduced by single outlier experiments.) 2. Choose a gene as the seed for a new cluster. Add the gene which increases cluster diameter the least. Continue adding genes until additional genes will exceed the specified cluster diameter limit. 3. Repeat step 2 for every gene, so that each gene has the chance to be the seed of a new cluster. All clusters are provisional at this point.
G7 G11 G4 G11 G8 G1 G1 G10 G2 G8 “Seed” gene G9 “Seed” gene G3 G7 G12 G5 4. Choose the largest cluster obtained from steps 2 and 3. In case of a tie, pick one of the largest clusters at random. QT-Clust – 2 G4 G9 G3 “Seed” gene Pick this cluster 5. All genes that are not in the cluster selected above are treated as currently unassigned. Repeat steps 2-4 on these unassigned genes. 6. Stop when the last cluster thus formed has fewer genes than a user-specified number. All genes that are not in a cluster at this point are treated as unassigned.
SOTA - 1 Self Organizing Tree Algorithm • Dopazo, J. , J.M Carazo, Phylogenetic reconstruction using and unsupervised growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. 44:226-233, 1997. • Herrero, J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17(2):126-136, 2001.