1 / 33

Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Ama

Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Amaratunga , Johnson & Johnson Pharmaceutical Research & Development . Outline. Microarray Experiments and Differential expression Small sample size issues

marrim
Download Presentation

Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Ama

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Amaratunga, Johnson & Johnson Pharmaceutical Research & Development

  2. Outline • Microarray Experiments and Differential expression • Small sample size issues • Conditional t approach • Comparison with other methods • Extensions • Reference: Exploration and Analysis of DNA • Microarray and Protein Array Data. Wiley.2004. • Amaratunga, Cabrera. • Software: DNAMR and DNAMRweb • http://www.rci.rutgers.edu/~cabrera/DNAMR

  3. Genes: A geneis a segment of DNA whose sequence of bases (nucleotides) codes for a specific protein. AKAP6: CATCATGCAGCAGGTCAAACAAGG CATCTCCTAGTATTGCATCCTACA…… The central dogma of molecular biology A gene is expressed via the process: DNAmRNAprotein transcription translation replication

  4. Microarray experiment cDNA or oligonucleotide preparation Glass slide Biological sample mRNA Reversetranscribe and label Print or synthesize + Microarray Sample 5k-50k genes arrayed in rectangular grid; one spot per gene Hybridize, wash and scan Image Quantifyspot intensities Gene expression data

  5. Differential gene expression • An organism’s genome is the complete • set of genes in each of its cells. Given • an organism, every one of its cells has • a copy of the exact same genome, but • different cells express different genes • different genes express under different conditions • differential gene expression leads to altered cell states

  6. Differential Expression for small samples C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27 G10 5.12 4.85 3.79 4.13 3.12 4.79 G11 4.67 3.50 4.77 4.09 3.86 2.88 G12 6.22 6.42 5.02 6.38 6.54 6.80 G13 2.88 3.76 2.78 2.98 4.81 4.15 ....... • Preprocessed data. • Perform a t-test for each gene. • Select the most significant subset.

  7. The pooled variances T-test

  8. Random Data Plot t vs sp Distribution of sp 300 21983 Differentially expressed genes have smaller sp. Is this effect Statistical or Biological?

  9. 500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2) 100 genes are differentially express with mean diff = +1 or -1 2=1 CONSTANT, a=0.05 False Discoveries True Discoveries T-test 44 22 z-test 43 29 2 from Chi-square(df=3), a=0.05 False Discoveries True Discoveries T-test 43 28 z-test 53 13

  10. The effect of small sample size Often the sample size per group is small.  unreliable variances (inferences)  dependence between the test statistics (tg) and the standard error estimates (sg) borrow strength across genes (LPE/EB)  regularize the test statistics (SAM) work with tg|sg (Conditional t).

  11. Analysis results Top 10 genes (sorted by t-test p-value) Gene Fold Dir p p(Bonf) G6546 2.36 D 0.000004 0.0964 G19945 3.25 U 0.000005 0.1102 G21586 1.64 U 0.000008 0.1765 G18970 2.52 U 0.000019 0.4220 G7432 3.70 D 0.000033 0.7248 G19057 1.85 U 0.000046 1.0000 G17361 4.34 D 0.000067 1.0000 G8525 5.57 D 0.000067 1.0000 G425 18.11 D 0.000078 1.0000 G8524 4.74 D 0.000109 1.0000

  12. SAM: Determining c For each a v1(a)=mad{ Tg} cv() v2(a) v3(a) v4(a) v5(a) v6(a) v7(a) Tg Min sg

  13. SAM: Gene selection D = Expected value of under permutations

  14. Conditional t: Basic Model Let Xgijdenote the preprocessed intensity measurement for gene g in array i of group j. Model: Xgij = mgj + sg egij Effect of interest:tg=mg2- mg1 Error model:egij ~ F(location=0, scale=1) Gene mean-variance model:(mg1,sg2) ~ Fm,s with marginals: mg1 ~ Fm and sg2 ~ Fs

  15. Possible approaches Parametric:Assume functional forms for F and Fm,s and apply either a Bayes or Empirical Bayes procedure. Nonparametric: 1. or For small samples is not a good estimator of F Use method of moments = Target estimation 2. Proceed via resampling and estimate the distribution: t|sp(Conditional t).

  16. Procedure

  17. Procedure (cont.)

  18. Roadblock Let{Xij} be a sample from the model with s2 ~ Fs and let the variance obtained from the {Xij} be s2 Then Var(s2) > Var(s2) For example, if we assume that Fs = c32, n=4 and e~ N(0,1), then Var(s2)=6and Var(s2)=15. Fix by target estimation: Method of moments. Shrink towards the center

  19. Example: Checking for the distribution of g Compare the distr. of sg vs simulation with: 1. Df=0.5 2. Df=2 Mice Data 1. Df=0.5 3. Df=6 2. Df=2 3. Df=6

  20. Another Example Compare the distr. of sg vs simulation with: Df=0.5 Df=0.5 Df=3 Df=6 Df=3 Df=3 Df=6 Df=6

  21. Fixing the variance distribution

  22. Fixing the variance distribution (contd) Proceed as before …

  23. Plot t vs sp Differentially expressed genes may have large sp 191 22092

  24. 500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2) 100 genes are differentially express with mean diff = +1 or -1 2=1 CONSTANT False Discoveries True Discoveries T-test 44 22 z-test 43 29 C-t 45 30 2 from Chi-square(df=3) False Discoveries True Discoveries T-test 43 28 z-test 53 13 C-t 42 38

  25. Using 8 iid samples from Khan Data, we make changes to 50 genes to make them differentially expressed for high level. T-test SAM Ct

  26. Generating p-values

  27. Extensions  F test: - Condition on the sqrt(MSE)  Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE)  Gene Ontology. - Test for the significance of groups. - Use Hypergeometric Statistic, mean t, mean p-value, or other. - Condition on log of the number of genes per group

  28. Conditional F

  29. GO Ontology: Conditioning on log(n) Abs(T) Log(n)

  30. Reference • Exploration and Analysis of DNA Microarray • and Protein Array Data. Wiley . Jan 2004. • Amaratunga, Cabrera. • Email • cabrera@stat.rutgers.edu • damaratu@prdus.jnj.com • Webpage for DNAMR and DNAMRweb • http://www.rci.rutgers.edu/~cabrera/DNAMR The Details:

  31. Target Estimation • Target Estimation: • Cabrera, Fernholz (1999) • - Bias Reduction. • - MSE reduction. • Recent Applications: • - Ellipse Estimation (Multivariate Target). • - Logistic Regression: • Cabrera, Fernholz, Devas (2003) • Patel (2003) Target Conditional MLE (TCMLE) • Implementation in StatXact (CYTEL) and • logXact Proc’s in SAS(by CYTEL).

  32. g( ) E (T) T(x1,x2,…,xn) E (T) =  Target Estimation

  33. Target Estimation: Algorithms: - Stochastic approximation. - Simulation and iteration. - Exact algorithm for TCMLE

More Related