330 likes | 521 Views
Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Amaratunga , Johnson & Johnson Pharmaceutical Research & Development. Outline. Microarray Experiments and Differential expression Small sample size issues
E N D
Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Amaratunga, Johnson & Johnson Pharmaceutical Research & Development
Outline • Microarray Experiments and Differential expression • Small sample size issues • Conditional t approach • Comparison with other methods • Extensions • Reference: Exploration and Analysis of DNA • Microarray and Protein Array Data. Wiley.2004. • Amaratunga, Cabrera. • Software: DNAMR and DNAMRweb • http://www.rci.rutgers.edu/~cabrera/DNAMR
Genes: A geneis a segment of DNA whose sequence of bases (nucleotides) codes for a specific protein. AKAP6: CATCATGCAGCAGGTCAAACAAGG CATCTCCTAGTATTGCATCCTACA…… The central dogma of molecular biology A gene is expressed via the process: DNAmRNAprotein transcription translation replication
Microarray experiment cDNA or oligonucleotide preparation Glass slide Biological sample mRNA Reversetranscribe and label Print or synthesize + Microarray Sample 5k-50k genes arrayed in rectangular grid; one spot per gene Hybridize, wash and scan Image Quantifyspot intensities Gene expression data
Differential gene expression • An organism’s genome is the complete • set of genes in each of its cells. Given • an organism, every one of its cells has • a copy of the exact same genome, but • different cells express different genes • different genes express under different conditions • differential gene expression leads to altered cell states
Differential Expression for small samples C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27 G10 5.12 4.85 3.79 4.13 3.12 4.79 G11 4.67 3.50 4.77 4.09 3.86 2.88 G12 6.22 6.42 5.02 6.38 6.54 6.80 G13 2.88 3.76 2.78 2.98 4.81 4.15 ....... • Preprocessed data. • Perform a t-test for each gene. • Select the most significant subset.
Random Data Plot t vs sp Distribution of sp 300 21983 Differentially expressed genes have smaller sp. Is this effect Statistical or Biological?
500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2) 100 genes are differentially express with mean diff = +1 or -1 2=1 CONSTANT, a=0.05 False Discoveries True Discoveries T-test 44 22 z-test 43 29 2 from Chi-square(df=3), a=0.05 False Discoveries True Discoveries T-test 43 28 z-test 53 13
The effect of small sample size Often the sample size per group is small. unreliable variances (inferences) dependence between the test statistics (tg) and the standard error estimates (sg) borrow strength across genes (LPE/EB) regularize the test statistics (SAM) work with tg|sg (Conditional t).
Analysis results Top 10 genes (sorted by t-test p-value) Gene Fold Dir p p(Bonf) G6546 2.36 D 0.000004 0.0964 G19945 3.25 U 0.000005 0.1102 G21586 1.64 U 0.000008 0.1765 G18970 2.52 U 0.000019 0.4220 G7432 3.70 D 0.000033 0.7248 G19057 1.85 U 0.000046 1.0000 G17361 4.34 D 0.000067 1.0000 G8525 5.57 D 0.000067 1.0000 G425 18.11 D 0.000078 1.0000 G8524 4.74 D 0.000109 1.0000
SAM: Determining c For each a v1(a)=mad{ Tg} cv() v2(a) v3(a) v4(a) v5(a) v6(a) v7(a) Tg Min sg
SAM: Gene selection D = Expected value of under permutations
Conditional t: Basic Model Let Xgijdenote the preprocessed intensity measurement for gene g in array i of group j. Model: Xgij = mgj + sg egij Effect of interest:tg=mg2- mg1 Error model:egij ~ F(location=0, scale=1) Gene mean-variance model:(mg1,sg2) ~ Fm,s with marginals: mg1 ~ Fm and sg2 ~ Fs
Possible approaches Parametric:Assume functional forms for F and Fm,s and apply either a Bayes or Empirical Bayes procedure. Nonparametric: 1. or For small samples is not a good estimator of F Use method of moments = Target estimation 2. Proceed via resampling and estimate the distribution: t|sp(Conditional t).
Roadblock Let{Xij} be a sample from the model with s2 ~ Fs and let the variance obtained from the {Xij} be s2 Then Var(s2) > Var(s2) For example, if we assume that Fs = c32, n=4 and e~ N(0,1), then Var(s2)=6and Var(s2)=15. Fix by target estimation: Method of moments. Shrink towards the center
Example: Checking for the distribution of g Compare the distr. of sg vs simulation with: 1. Df=0.5 2. Df=2 Mice Data 1. Df=0.5 3. Df=6 2. Df=2 3. Df=6
Another Example Compare the distr. of sg vs simulation with: Df=0.5 Df=0.5 Df=3 Df=6 Df=3 Df=3 Df=6 Df=6
Fixing the variance distribution (contd) Proceed as before …
Plot t vs sp Differentially expressed genes may have large sp 191 22092
500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2) 100 genes are differentially express with mean diff = +1 or -1 2=1 CONSTANT False Discoveries True Discoveries T-test 44 22 z-test 43 29 C-t 45 30 2 from Chi-square(df=3) False Discoveries True Discoveries T-test 43 28 z-test 53 13 C-t 42 38
Using 8 iid samples from Khan Data, we make changes to 50 genes to make them differentially expressed for high level. T-test SAM Ct
Extensions F test: - Condition on the sqrt(MSE) Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE) Gene Ontology. - Test for the significance of groups. - Use Hypergeometric Statistic, mean t, mean p-value, or other. - Condition on log of the number of genes per group
GO Ontology: Conditioning on log(n) Abs(T) Log(n)
Reference • Exploration and Analysis of DNA Microarray • and Protein Array Data. Wiley . Jan 2004. • Amaratunga, Cabrera. • Email • cabrera@stat.rutgers.edu • damaratu@prdus.jnj.com • Webpage for DNAMR and DNAMRweb • http://www.rci.rutgers.edu/~cabrera/DNAMR The Details:
Target Estimation • Target Estimation: • Cabrera, Fernholz (1999) • - Bias Reduction. • - MSE reduction. • Recent Applications: • - Ellipse Estimation (Multivariate Target). • - Logistic Regression: • Cabrera, Fernholz, Devas (2003) • Patel (2003) Target Conditional MLE (TCMLE) • Implementation in StatXact (CYTEL) and • logXact Proc’s in SAS(by CYTEL).
g( ) E (T) T(x1,x2,…,xn) E (T) = Target Estimation
Target Estimation: Algorithms: - Stochastic approximation. - Simulation and iteration. - Exact algorithm for TCMLE