660 likes | 898 Views
Microarray Data Normalization and Analysis. John Quackenbush 15 February 2003 MGED/AAAS Denver. The Experimental Design. The Experimental Design dictates a good deal of what you can do with the data Good normalization and processing reflects the experimental design
E N D
Microarray Data Normalizationand Analysis John Quackenbush 15 February 2003 MGED/AAAS Denver
The Experimental Design The Experimental Design dictates a good deal of what you can do with the data Good normalization and processing reflects the experimental design The design also facilitates certain comparisons between samples and provides the statistical power you need for assigning confidence limits to individual measurements The design must reflect experimental reality The most straight-forward designs compare expression in two classes of samples to look for patterns that distinguish them.
Loops and Reference Designs A B C D E R R R R R A E B 10 hybs Proposed loop expt D C A R B 0 new hybs Proposed loop expt with reference to provide direct comparison with reference D C A E B Proposed loop expt with reference to show invariance w.r.t order 3 new hybs C D 23 Hybs 10 hybs Standard flip-dye expt S. Wang , K. Kerr, J. Quackenbush, G. Churchill
Loops and Reference Designs Both approaches can give equivalent results S. Wang , K. Kerr, J. Quackenbush, G. Churchill
Loop vs. Reference Designs • Loop design • Can provide direct measurements • Give more data on each experimental sample with the same number of hybs • Require more RNA per sample • Can “unwind” with a bad sample or for a gene with bad data • Reference design • Easily extensible • Simple interpretation of all results • Requires less RNA per sample • Less sensitive to bad RNA samples and bad array elements
Basic Design Principles Biological replicas are more informative than correlated replicas (independent RNA, independent slides) More replicas are better – higher statistical power For loops, hybridizations of individual samples should be “balanced” (as many Cy3 as Cy5 labelings) Self-self hybs add data on reproducibility and can be used to produce error models At a minimum, should use dye swap replicates to compensate for any dye biases in labeling or detection
Genotype Environment Reference Sample Assay Variation One Possible Experimental Paradigm:Examining Genotype, Phenotype, and Environment Parental - stressed Derived - stressed Parental - unstressed Derived - unstressed
Microarray Overview Collect HybridizationData Normalize dataand reducecomplexity MAD Normalizationand Filtering Explorepatterns of expression MAD
Replicates: Filtering Questionable Data In every array, there are questionable – or bad – data for some elements Replicates can help identify those elements We can use an unbiased filter to eliminate those from future consideration
Replicates: Applied to Filtering Data We expect to see A1B2 B1 A2 = 1 R = Log2(A1B2/B1A2) = 0 * Consider two replicates with dyes swapped A1 and B2 B1 A2 We can calculate R and eliminate spots with the greatest uncertainty: R >2
The Effects of flip-dye replicate trim Red data are eliminated as inconsistent
Significance: Z-scores • Z = log2(Ri/Gi) • local log2(R/G) • The uncertainty in measurements increases as intensity decreases • Measurements close to the detection limit are the most uncertain • Fold-change measurements ignore these effects • We can calculate an intensity-dependent Z-score that measures the ratio relative to the standard deviation in the data:
“Slice Analysis” (Intensity-dependent Z-score) Z > 2 is at the 95.5% confidence level
Error Models Problem is to estimate the variability in the data based on empirical measurement This requires a number of self-self hybridizations to create an estimate of the inherent variability in the assay This can be done as a function of intensity or as an estimate of the variability for individual genes Genes failing to meet the significance criteria
Self-self Hybridizations Estimate Variability This is then used to construct an error model
Variance stabilization/regularization Measurements of expression vary between any two assays This can be affected by changes in the mean expression level, but normalization can help reduce those differences However, the variance, or spread in the data, can be quite different between replicates (or pen groups) Variance stabilization can rescale the data for each experiment to make these more comparable
A Box Plot can show the difference in variancebetween replicates
Assumptions (one approach): All groups (pen groups, hybs) should have the same spread. True ratio is mij where i represents different groups, j represents different spots. Observed is Mij, where Mij = aimij Robust estimate of ai is MADi = medianj { |yij - median(yij) | } Variance Regularization
Finding Significant Genes Assume we will compare two conditions with multiple replicates for each class Our goal is to find genes that are significantly different between these classes These are the genes that we will use for later data mining
Finding Significant Genes ??? • Average Fold Change Difference for each gene • suffers from being arbitrary and not taking into account systematic variation in the data
Finding Significant Genes t = signal = difference between means = <Xq> – <Xc>_ noise variability of groups SE(Xq-Xc) • t-test for each gene • Tests whether the difference between the mean of the query and reference groups are the same • Essentially measures signal-to-noise • Calculate p-value (permutations or distributions) • May suffer from intensity-dependent effects
Finding Significant Genes • Significance Analysis of Microarrays (SAM) • Uses a modified t-test by estimating and adding a small positive constant to the denominator • Significant genes are those which exceed the expected values from permutation analysis.
Finding Significant Genes log10(p-value) Mean log(ratio) • Volcano Plots • Combines t-tests and fold change measures • Significant genes appear in upper corners
Finding Significant Genes ??? • Analysis of Variation (ANOVA) • Which genes are most significant for separating classes of samples? • Calculate p-value (permutations or distributions) • Reduces to a t-test for 2 samples • May suffer from intensity-dependent effects
Microarray Overview MIDAS Performs data normalizationand filtering, including, soon, ANOVA MAD MIDAS MAD
MIDAS: Data Analysis Wei Liang Available with source
MeV: Data Mining Tools Alexamder Saeed ]Alexander Sturn Nirmal Bhagabati John Braistead Syntek Inc. Datanaut, Inc. Available with source
Multiple Experiments? • Goal is identify genes (or experiments) which have“similar” patterns of expression • This is a problem in data mining • “Clustering Algorithms” are most widely used although many others exist • Types • Agglomerative clustering: Hierarchical • Divisive clustering: k-means, SOMs • Others: Principal Component Analysis (PCA) • All depend on how one measures distance
z y Similar expression x Expression Vectors • Crucial concept for understanding clustering • Each gene is represented by a vector where coordinates are its values log(ratio) in each experiment • x = log(ratio)expt1 • y = log(ratio)expt2 • z = log(ratio)expt3 • etc.
Expression Vectors • Crucial concept for understanding clustering • Each gene is represented by a vector where coordinates are its values log(ratio) in each experiment • x = log(ratio)expt1 • y = log(ratio)expt2 • z = log(ratio)expt3 • etc. • For example, if we do six experiments, • Gene1 = (-1.2, -0.5, 0, 0.25, 0.75, 1.4) • Gene2 = (0.2, -0.5, 1.2, -0.25, -1.0, 1.5) • Gene3 = (1.2, 0.5, 0, -0.25, -0.75, -1.4) • etc.
Expt 1 Expt 2 Expt 3 Expt 4 Expt 5 Expt 6 ExpressionMatrix • These gene expression vectors of log(ratio) values can be used to construct an expression matrix • Gene1 -1.2 -0.5 0 0.25 0.75 1.4 • Gene2 0.2 -0.5 1.2 -0.25 -1.0 1.5 • Gene3 1.2 0.5 0 -0.25 -0.75 -1.4 • etc. • This is often represented as a red/green colored matrix
Exp 1 Exp 4 Exp 5 Exp 6 Exp 2 Exp 3 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 ExpressionMatrix The Expression Matrix is a representation of data from multiple microarray experiments. Each element is a log ratio, usually log 2 (Cy5/Cy3) Black indicates a log ratio of zero, i. e., Cy5 and Cy3 are very close in value Green indicates a negative log ratio , i.e., Cy5 < Cy3 Gray indicates missing data Red indicates a positive log ratio, i.e, Cy5 > Cy3
Expression Vectors As Points in‘Expression Space’ Exp 1 Exp 2 Exp 3 G1 G2 G3 G4 z G5 y x Similar Expression Experiment 3 Experiment 2 Experiment 1
Distance metrics • Distances are measured “between” expression vectors • Distance metrics define the way we measure distances • Many different ways to measure distance: • Euclidean distance • Pearson correlation coefficient(s) • Manhattan distance • Mutual information • Kendall’s Tau • etc. • Each has different properties and can reveal different features of the data
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Distance Matrix Once a distance metric has been selected, the starting point for all clustering methods is a “distance matrix” Gene1 0 1.5 1.2 0.25 0.75 1.4 • Gene2 1.5 0 1.3 0.55 2.0 1.5 • Gene3 1.2 1.3 0 1.3 0.75 0.3 • Gene4 0.25 0.55 1.3 0 0.25 0.4 • Gene5 0.75 2.0 0.75 0.25 0 1.2 • Gene6 1.4 1.5 0.3 0.4 1.2 0 • The elements of this matrix are the pair-wise distances. Note that the matrix is symmetric about the diagonal.
G1 G6 G6 G5 G1 G2 G5 G2 G3 G4 G3 G4 1. Calculate the distance between all genes. Find the smallest distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives. 2. Fuse the two selected clusters to produce a new cluster that now contains at least two objects. Calculate the distance between the new cluster and all other clusters. 3. Repeat steps 1 and 2 until only a single cluster remains. Hierarchical Clustering 4. Draw a tree representing the results.
g1 g1 g1 g2 g8 g8 g2 g3 g4 g2 g3 g4 g3 g5 g4 g5 g5 g6 g7 g6 g6 g7 g8 g7 g1 is most like g8 g4 is most like {g1, g8} Hierarchical Clustering (HCL-2)
g1 g1 g1 g8 g8 g8 g4 g4 g4 g5 g2 g2 g3 g7 g3 g5 g2 g5 g6 g3 g7 g7 g6 g6 g5 is most like g7 {g5,g7} is most like {g1, g4, g8} Hierarchical Clustering (HCL-3)
g1 g8 g4 g5 g7 g2 g3 g6 Hierarchical Tree (HCL-4)
Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. • Three linkage methods that are commonly used are: • Single Linkage • Average Linkage • Complete Linkage Agglomerative Linkage Methods (HCL-6)
Single Linkage Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. DAB = min ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB (HCL-7)
Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. DAB = 1/(NANB) S S ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB Average Linkage DAB (HCL-8)
Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. DAB = max ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB Complete Linkage DAB (HCL-9)
Comparison of Linkage Methods Average Single Complete (HCL-10)
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 2 Exp 2 Exp 3 Exp 4 Exp 4 Exp 4 Exp 1 Exp 1 Exp 3 Exp 5 Exp 5 Exp 6 Gene 1 Gene 2 Gene 1 Gene 3 Gene 2 Gene 4 Gene 3 Gene 5 Gene 4 Gene 6 Gene 5 Gene 6 Bootstrapping – resampling with replacement Original expression matrix: Bootstrapping Various bootstrapped matrices (by experiments):
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 1 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 3 Exp 4 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 Jackknifing – resampling without replacement Original expression matrix: Jackknifing Various jackknifed matrices (by experiments):
Bootstrapped or jackknifed expression matrices are created many times by randomly resampling the original expression matrix, using either the bootstrap or jackknife procedure. Each time, hierarchical trees are created from the resampled matrices. The trees are compared to the tree obtained from the original data set. The more frequently a given cluster from the original tree is found in the resampled trees, the stronger the support for the cluster. As each resampled matrix lacks some of the original data, high support for a cluster means that the clustering is not biased by a small subset of the data. Analysis of bootstrapped and jackknifed support trees
K-Means/Medians Clustering – 1 1. Specify number of clusters, e.g., 5. 2. Randomly assign genes to clusters. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13