240 likes | 569 Views
Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data. C. F. Jeff Wu University of Michigan (joint work with G. Dyson). Outline. Current Methods Proposed Methodology Analysis Plan Example Conclusions. What are microarrays?. Two major types
E N D
Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G. Dyson)
Outline • Current Methods • Proposed Methodology • Analysis Plan • Example • Conclusions
What are microarrays? • Two major types • Oligonucleotide gene chips • Spotted glass arrays • Perfect match (PM) and mismatch (MM) probes are spotted onto a gene chip • ~20 probes make up a probe set (or gene) • MM probe for each gene has the middle base set to the complement of its PM probe • Hybridize labeled RNA corresponding to PM probes • Glass arrays involve the competitive hybridization of two RNA pools to cDNA spotted onto a glass slide • Typically thousands on genes on a slide
Multiplicity Problem • When we make more than one comparison in a hypothesis testing situation, p-value interpretation falls through • Control of family error rate is necessary in order to preserve nominal type I error rate • Various approaches to correct the chance of making a type I error for multiplicity, including Tukey, Bonferroni and Holms
Microarray Analysis Techniques • Westfall Young step down (WY) • Significance Analysis of Microarrays (SAM) • Empirical Bayes (EB) • Bayesian (MCMC) • Mixture Modeling • Dimension reduction techniques • Machine learning
and Westfall Young (WY) • Compute ranks of original test statistic rjsuch that • Construct b balanced permutations of the samples, computing the same test statistic as above for each b • Compute • Repeat B times and calculate the adjust p-value as • Less conservative than Bonferroni
Significance Analysis of Microarrays (SAM) • Use a t-like statistic • Use balanced permutation method from previous slide to estimate null distribution, assuming all effects are null • Call genes that fall outside D bars significant
Analysis Plan • Robust measures of location and scale • Summary statistic • Two half-normal plots (for upward-regulated and downward-regulated genes) • Segment determination • Find • insignificant, borderline, significant • Repeat the procedure, using as base
Robust Measures of Location and Scale • Perform transformation and suitable normalization • Compute median and Maximum Absolute Deviation (MAD) for each gene • Reasonable estimates • Less affected by outliers than mean and SD • Interested in robustness rather than efficiency
Summary Statistic • Compute quasi two-sample t-statistic using robust values from above: • c is chosen to minimize for the middle 100*(1-2e)% of the ssl. • Tusher et al. (2001) chose c to minimize the coefficient of variation • Efron et al. (2001)used the 90th percentile of the gene standard error estimates for c
Two Half-Normal Plots • Construct two half-normal plots: one for the p positive and r negative ssl. • Run the procedure separately on each set • Denote the ordered p positive effects by • Plot abssiagainst half-normal distribution quantiles, i.e. the points • Goal: obtain set of noise effects • Yield a baseline against which to test the rest of the effects
Segment Determination: • Given b, initialize null set as points abss1: abssk • Regress null set on 1:k half-normal quantiles (Q1:Qk) • Produce predicted values at the remaining quantile values (Qh:h>k) • Compute predicted statistics with • Find
Segment Determination: (cont) • The initial null set of k genes becomes k + m (= ) null genes • Now re-do the segment determination procedure, using the k + m genes as base null set • Continue until no new genes are added • Do for each k less than p-1 • Store the end point • Set the most frequent to
Sample • Let k = 200, total effects = 500 • First 200 ordered positive effects regressed on first 200 half-normal quantiles • Test ordered effects 201 to 500 using absolute value of predicted statistics • For example, effect 239 is the largest h less than the t-critical value • So would initially be 239 • Redo the above, with k = 239 effects; so we test effects 240 to 500 • Say statistic 242 is the largest h less than t-critical value based on new regression line • So the new would be 242 • Redo the above again with k = 242, test effects 243 to 500 • No statistics are less than t critical value • So is 242
Find • Will test all effects after using same statistics • To adjust for multiple testing, define NC as the number of consecutive significant effects necessary to call all subsequent effects significant • Use the Bonferroni adjustment (does not require independence): • Instead of doing thousands of comparisons, only need to do NC to determine significance • Define • Now we have identified the change points in the graph for segment detection
Error Rate Estimation: FDR • False Discovery Rate (FDR) is the expected proportion of falsely rejected hypotheses • Permute the condition labels, maintaining balance • Example: 8 replicates in conditions A and B • Each A’ and B’ will have 4 replicates from A and 4 from B • Compute the robust statistics, keeping the same c from the actual data • Determine the average number of effects that fall above the positive or below the negative boundary of the significant sets • Divide that number by the total number of called significant effect
Speed Data: Analysis and Comparison • WY found 8 genes significant, with Type I error = 0.05
Lemon Data: Analysis and Comparison • WY found 253 genes significant, with Type I error = 0.05
Conclusions • Proposed a new method for determining differential expression in genes • Dealt with the multiplicity problem by using only a small subset of genes • Can extend to other large data sets • Allow scientists to play a role in sequential decision making • Incorporate a priori knowledge of experiment with selection of c