Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans

Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical GenomicsWashington University School of Medicine

Type I error False Positive Low power Multiple Comparison(strategy 1) High power Type II error False negative • P value adjustment/correction (Bonferroni, FDR) • Empirical p value (permutation, bootstrap)

Larger sample size • Meta analysis • Biological info or evidence • …… • More powerful statistical approach Multiple Comparison(strategy 2) Type I error False Positive SMDP: Sequential Multiple Decision Procedure Type II error False negative

What is SMDP? • A generalized framework for ranking and selection, using optimum sample sizes • A combination of sequential analysis and multiple hypothesis test

Feature 1 of SMDPSequential Analysis Start from a small sample size Increase sample size, sequential testat each stage Stop when stopping rule is satisfied n0+1 n0+2 n0 … n0+i …

SNP1 SNP2 SNP2 SNP3 SNP3 SNP4 SNP4 SNP5 SNP5 SNP6 SNP6 … … SNPn SNPn Feature 2 of SMDP Multiple Decision Independent testBinary hypothesis test Simultaneous testMultiple hypothesis test test 1 test 2 test 3 test 4 test 5 test 6 test n Signal group SNP1 Noise group

SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn Binary Hypothesis Testused by traditional methods test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0 test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0 test 3 …… test 4 …… test 5 …… test 6 …… test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0 SNP1 test-wise error and genome-wise error multiple testing issue

SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn Multiple Hypothesis Testused by SMDP H1: SNP1,2,3 are truly different from the others H2: SNP1,2,4 are truly different from the others H3 …… H4 …… H5: SNP4,5,6 are truly different from the others H6 …… … Hu: SNPn,n-1,n-2 are truly different from the others Goal: search the best one H: any t SNPs are truly different from the others (n-t) u= number of all possible combination of t out of n

Sequential statistic at stage h General Rule of SMDP (Bechhofer et al., 1968)Selecting the t best of M K-D populations U possible combinations of t out of M Sequential Sampling 1 2 … h h+1 … Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. k+1 Pop. k+2 : Pop. M For each combination u Y1,h Y2,h : : Yt,h : : : : YM,h D Stopping rule Prob. of correct selection (PCS) > P*, whenever D>D*

The freq/density function of a K-D population can be written in the form: f(x)=exp{P(x)Q(θ)+R(x)+S(θ)} The normal density function with unknown mean and known variance; The normal density function with unknown variance and known mean; The exponential density function with unknown scale parameter and known location parameter; The Poisson distribution with unknown mean; …… The distance of two K-D populations Koopman-Darmois(K-D) Populations(Bechhofer et al., 1968)

Combine SMDP With Regression Model(M.A. Province, 2000, page 319) Case B : the normal density function with unknown variance and known mean;

SMDP - Regression (M.A. Province, 2000) Data pairs for a marker Z1 , X1 Z2 , X2 Z3 , X3 : : Zh , Xh Zh+1 , Xh+1 : : ZN , XN Sequential sum of squares of regression residuals Yi,h denotes Y for marker i at stage h (see slide 7)

A Real Data Example (M.A. Province, 2000, page 308)

Simulation Results M.A. Province, 2000, page 312

SMDP: Computational Problem Sequential stage 1 2 3 : h h+1 : N Y1,h Y2,h : Yk,h Yk+1,h Yk+2,h : YM,h U sums of U possible combinations of t out of M Each sum contains t members of Yi,h Computer time ?

Simplified Stopping Rule U-S+1= Top Combination Number (TCN) TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule How to choose TCN? Balance between computational accuracy and computational time Zhang & Province, 2005

Application to Pharmacal Genetics Data P*=0.95 D*=10 TCN=10000 72 SNPs P<0.01

SMDP for GAWS Some technical/programming problems 1. Computer time (approximation & parallelization) 2. Missing data 3. Stability at early stage 4. Rare SNPs Now SMDP can done for an analysis of GWAS data (500K chip, 1000 subjects) within 10 hours via cluster

Simulation 15000 SNPs1 true signal500 replications

Simulation 2: Multiple signals Genotype data: GAW16 problem 3, 500K SNP data; Phenotype data: Simulated LDL (measured at the first visit), ~6500 subjects, 200 replications Analyses: For each replication, randomly draw 1000 SNPs without true effects and 10 SNPs with minor poly-gene effects and keep all 6 SNPs with relatively major effects to create a subset of genotypes. Recode the genotypes to 0, 1 and 2 according the copy number of minor alleles; Apply SMDP to the selected data and repeat the analysis over 200 replications.

Modified SMDP(analysis procedure) Start analysis (or experiment) from a small sample size; (2) Perform multiple decision analysis to simultaneously test if a group of makers are significant; (3) Eliminate significant markers from the list (if identified); (4) Add one or multiple new samples to the data; (5) Repeat (2),(3),(4) … (6) Stop the procedure when all samples have been used and no makers are identified any more .

ROC Curves of SMDP and Regular Regression Analyses Ar, Br : Regular regression using all samples As, Bs: SMDP analyses Ars, Brs: Regular regression using SMDP’s average sample sizes (ASN) Ar, As and Ars: Analysis of SNPs with major effects; Br, Bs and Brs: Anaysis of SNPs with minor effects. ASN: the average sample size used in SMDP, presented as proportion of the entire sample size.

Power comparison of SMPD and regular regression(type I error rate = 0.0025) *Proportion of significant tests (P<0.05), based on regression using the rest of samples after SMDP stops. *ASN: Average sample number used in SMDP Conclusion: given the same sample size, SMDP-regression is more powerful than regular regression.

Application to Real Data The NHLBI Family Heart Study Illumina HuamanMap550 array data 983 subjects Coronary Artery Calcification (CAC) SMDP identifies 69 SNPs using less than 811 samples Traditional regression analysis of all 983 samples identifies 46122 SNPs (p<0.05) 15 SNPs (FDR<0.05) 11 identified by SMDP 1 SNPs (p<0.05/500K) also identified by SMDP

Summary of SMDP(advantages) • Efficient use of sample size, extra sample size after stopping can be used for validation • Simultaneously test group of signals, avoid one-by-one test and p-value adjustment • Increase power (or decrease false positives) given the same average sample size • Flexible experimental design. Extra N

Summary of SMDP(limitations) • Compute time (needs approximation & parallelization ) • Requirement of Koopman-Darmois distribution family

P* arbitrary, 0.95 t fixed or varied D* indifference zone Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. t+1 Pop. t+2 : : : Pop. M SMDP stopping rule SMDP: P*, t, D* Prob. of correct selection (PCS) > P* whenever D>D* Correct selection Populations with Q(θ)>Q(θt)+D* are selected Q(θt) D* Q(θt)+D*

References R.E. Bechhofer, J. Kiefer., M. Sobel. 1968. Sequential identification and ranking procedures. The University of Chicago Press, Chicago. M.A. Province. 2000. A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19:301-332 . Q. Zhang, M.A. Province．2005. Simplified sequential multiple decision procedures for genome scans．2005 Proceedings of American Statistical Association. Biometrics section:463~468

Application to GWAS slide 9 slide 10

Simplified Stopping Rule U-S+1= Top Combination Number (TCN) TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule How to choose TCN? Balance between computational accuracy and computational time Zhang & Province, 2005

Zhang & Province,2005,page 467 P*=0.95 D*=10 TCN=10000 72 SNPs P<0.01

Simplified Stopping Rule M.A. Province, 2000 page 321-322

A Real Data Example (M.A. Province, 2000, page 310)

Simulation Results (2) M.A. Province, 2000, page 313

Simplified SMDP(Bechhofer et al., 1968) U-S+1= Top Combination Number (TCN) How to choose TCN? Balance between computational accuracy and computational time

Zhang & Province,2005,page 465 Relation of W and t (h=50, D*=10) Effective Top Combination Number ETCN

Zhang & Province,2005,page 466 ETCN Curve

Zhang & Province,2005,page 466 t =?

SMDP Summary Advantages: • Test, identify all signals simultaneously, no multiple comparisons • Use “Minimal” N to find significant signals, efficient • Tight control statistical errors (Type I, II), powerful • Save rest of N for validation, reliable Further studies: • Computer time • Extension to more methods/models • Extension to non-K-D distributions

Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans