1 / 34

Test of significance for small samples Javier Cabrera

Test of significance for small samples Javier Cabrera. Outline. Differential Expression for small samples. C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87

atira
Download Presentation

Test of significance for small samples Javier Cabrera

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Test of significance for small samples Javier Cabrera

  2. Outline

  3. Differential Expression for small samples C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27 G10 5.12 4.85 3.79 4.13 3.12 4.79 G11 4.67 3.50 4.77 4.09 3.86 2.88 G12 6.22 6.42 5.02 6.38 6.54 6.80 G13 2.88 3.76 2.78 2.98 4.81 4.15 ....... • Preprocessed data. • Perform a t-test for each gene. • Select the most significant subset.

  4. The pooled variances T-test

  5. Plot t vs sp • Only genes that have small sp are differentially expressed. • Moderately and Highly expressed genes are unlikely to have small sp so they will not be picked up. • Most genes that are picked up are low expressers. 300

  6. Is this effect statistical or biological? This graph was generated using IID normal samples

  7. Comparison of distribution of sp for differentially and non-differentially expressed genes Differentially expressed genes have small sp 300 21983

  8. Often the sample size per group is small.  unreliable variances (inferences)  dependence between the test statistics (tg) and the standard error estimates (sg) borrow strength across genes (LPE/EB)  regularize the test statistics (SAM) work with tg|sg (Conditional t). The effect of small sample size

  9. SAM: Significance Analysis for Microarray Tibshirani(2001) • 1. Determine c • Obtain significant genes by doing a simulation and • use the False Discovery Ratio (FDR) to find D . • 3. Significant Genes

  10. Determining c Start with the pairs {rg ,sg} Let s be the th percentile of the {sg} values and let Compute the percentiles, q1 q2 … q100, of the sg values. For {0, 5, 10, …, 100}, compute vj(a) = mad{ Tg(s)sgqj, qj+1) }, j = 1, 2, …, n, Compute cv(), the coefficient of variation of the {vj(a)} values. Choose as the value of  that minimizes cv(). Fix as the value .

  11. Determining c For each a v1(a)=mad{ Tg} cv() v2(a) v3(a) v4(a) v5(a) v6(a) v7(a) Tg Min sg

  12. Simulation and use the False Discovery Ratio (FDR) to find D . • For each gene B permutations are generated. For each perm. • Expected order statistic

  13. SAM : The t statistics D

  14. SAM output table

  15. Interpreting the SAM table (1) Choose a value of the FDR (say 5% or 1%) and use the corresponding value of . In our example Suppose we choose FDR (90% ) = 1% this corresponds to =1.5. (2) Some scientists find the choice of FDR a hard one to make and are more comfortable with a more ‘classical’ strategy of choosing that correspond to a fixed proportion of false positives, say 0.01. This method would produce =1.1. (3) A third strategy would be to start with strategy (2), then check the FDR and depending on the value if the FDR is too high we may increase  as long as (i) there is an important reduction of the FDR and as long as (ii) the number of called genes does not decrease substantially. In our example we may argue that =1.1 corresponds to an FDR of 4.5% which maybe good enough.

  16. Concerns about SAM • Permutations of 6? • c just a 1st order correction D= 0.70 D= 1.05 D= 1.33

  17. Conditional t: Basic Model Let Xgijdenote the preprocessed intensity measurement for gene g in array i of group j. Model: Xgij = mgj + sg egij Effect of interest:tg=mg2- mg1 Error model:egij ~ F(location=0, scale=1) Gene mean-variance model:(mg1,sg2) ~ Fm,s with marginals: mg1 ~ Fm and sg2 ~ Fs

  18. Possible approaches Parametric: Assume functional forms for F and Fm,s and apply either a Bayes or Empirical Bayes procedure. Nonparametric:

  19. Procedure

  20. Procedure (cont.)

  21. Roadblock Let{Xij} be a sample from the model with s2 ~ Fs and let the variance obtained from the {Xij} be s2 Then Var(s2) > Var(s2) For example, if we assume that Fs = c32, n=4 and e~ N(0,1), then Var(s2)=6and Var(s2)=15. Fix by target estimation.

  22. Example: Checking for the distribution of g Compare the distr. of sg vs simulation with: 1. Df=0.5 2. Df=2 Mice Data 1. Df=0.5 3. Df=6 2. Df=2 3. Df=6

  23. Another Example Compare the distr. of sg vs simulation with: Df=0.5 Df=0.5 Df=3 Df=6 Df=3 Df=3 Df=6 Df=6

  24. Fixing the variance distribution

  25. Fixing the variance distribution (contd) Proceed as before …

  26. Plot t vs sp Differentially expressed genes may have large sp 130

  27. Comparison of distribution of sp for differentially and non-differentially expressed genes selected by CT Differentially expressed genes may have large sp

  28. Generating p-values

  29. Extensions  F test: - Condition on the sqrt(MSE)  Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE)  Gene Ontology. - Test for the significance of groups. - Use Hypergeometric Statistic, mean t, mean p-value, or other. - Condition on log of the number of genes per group

  30. Conditional F

  31. Target Estimation • Target Estimation: • Cabrera, Fernholz (1999) • - Bias Reduction. • - MSE reduction. • Recent Applications: • - Ellipse Estimation (Multivariate Target). • - Logistic Regression: • Cabrera, Fernholz, Devas (2003) • Patel (2003) Target Conditional MLE (TCMLE) • Implementation in StatXact (CYTEL) and • logXact Proc’s in SAS(by CYTEL).

  32. g( ) E (T) T(x1,x2,…,xn) E (T) =  Target Estimation

  33. Target Estimation: Algorithms: - Stochastic approximation. - Simulation and iteration. - Exact algorithm for TCMLE

  34. GO Ontology: Conditioning on log(n) Abs(T) Log(n)

More Related