Sequential Multiple Decision Procedures (SMDP) for Genome Scans

Sequential Multiple Decision Procedures (SMDP)for Genome Scans Q.Y. Zhang and M.A. Province Division of Statistical GenomicsWashington University School of Medicine Statistical Genetics Forum, April, 2006

References R.E. Bechhofer, J. Kiefer., M. Sobel. 1968. Sequential identification and ranking procedures. The University of Chicago Press, Chicago. M.A. Province. 2000. A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19:301-332 . Q.Y. Zhang, M.A. Province．2005. Simplified sequential multiple decision procedures for genome scans．2005 Proceedings of American Statistical Association. Biometrics section:463~468

SMDP SequentialMultiple Decision ProceduresSequential testMultiple hypothesis test

Idea 1: Sequential Start from a small sample size Increase sample size, sequential testat each stage (SPRT) Stop when stopping rule is satisfied n0+1 n0+2 … n0 n0+i … Experiment in next stage Extra data for validation

SNP1 SNP2 SNP2 SNP3 SNP3 SNP4 SNP4 SNP5 SNP5 SNP6 SNP6 … … SNPn SNPn Independent testBinary hypothesis test Simultaneous testMultiple hypothesis test Idea 2: Multiple Decision test 1 test 2 test 3 test 4 test 5 test 6 test n Signal group SNP1 Noise group test-wise error and experiment-wise error p value correction

SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn Binary Hypothesis Test test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0 test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0 test 3 …… test 4 …… test 5 …… test 6 …… test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0 SNP1

SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn Multiple Hypothesis Test H1: SNP1,2,3 are truly different from the others H2: SNP1,2,4 are truly different from the others H3 …… H4 …… H5: SNP4,5,6 are truly different from the others H6 …… … Hu: SNPn,n-1,n-2 are truly different from the others H: any t SNPs are truly different from the others (n-t) u= number of all possible combination of t out of n

SMDP Sequential test Multiple hypothesis test Sequential Multiple Decision Procedure

The freq/density function of a K-D population can be written in the form: f(x)=exp{P(x)Q(θ)+R(x)+S(θ)} The normal density function with unknown mean and known variance; The normal density function with unknown variance and known mean; The exponential density function with unknown scale parameter and known location parameter; The Bernoulli distribution with unknown probability of “success” on a single trial; The Poisson distribution with unknown mean; …… The distance of two K-D populations is defined as : Koopman-Darmois(K-D) Populations(Bechhofer et al., 1968)

SMDP (Bechhofer et al., 1968)Selecting the t best of M K-D populations U possible combinations of t out of M Sequential Sampling 1 2 … h h+1 … Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. t+1 Pop. t+2 : Pop. M For each combinationu Y1,h Y2,h : : Yi,h : : : YM,h D Stopping rule Prob. of correct selection (PCS) > P*, whenever D>D*

P* arbitrary, 0.95 t fixed or varied D* indifference zone Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. t+1 Pop. t+2 : : : Pop. M SMDP stopping rule SMDP: P*, t, D* Prob. of correct selection (PCS) > P* whenever D>D* Correct selection Populations with Q(θ)>Q(θt)+D* are selected Q(θt) D Q(θt)+D D* Q(θt)+D*

SMDP: Computational Problem Sequential stage 1 2 3 : h h+1 : N Y1,h Y2,h : Yt,h Yt+1,h Yt+2,h : YM,h U sums of U possible combinations of t out of M Each sum contains t members of Yi,h Computertime ?

Simplified Stopping Rule(Bechhofer et al., 1968) U-S+1= Top Combination Number (TCN) TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule How to choose TCN? Balance between computational accuracy and computational time

SMDP Combined With Regression Model(M.A. Province, 2000, page 320-321) Data pairs for a marker Z1 , X1 Z2 , X2 Z3 , X3 : : Zh , Xh Zh+1 , Xh+1 : : ZN , XN Sequential sum of squares of regression residuals Yi,h denotes Y for marker i at stage h

Combine SMDP With Regression Model(M.A. Province, 2000, page 319) Case B : the normal density function with unknown variance and known mean;

Simplified Stopping Rule M.A. Province, 2000 page 321-322

A Real Data Example (M.A. Province, 2000, page 310)

A Real Data Example (M.A. Province, 2000, page 308)

Simulation Results (1) M.A. Province, 2000, page 312

Simulation Results (2) M.A. Province, 2000, page 313

Simplified SMDP(Bechhofer et al., 1968) U-S+1= Top Combination Number (TCN) How to choose TCN? Balance between computational accuracy and computational time

Data

Zhang & Province,2005,page 465 Relation of W and t (h=50, D*=10) Effective Top Combination Number ETCN

Zhang & Province,2005,page 466 ETCN Curve

Zhang & Province,2005,page 466 t =?

Zhang & Province,2005,page 467 P*=0.95 D*=10 TCN=10000 72 SNPs P<0.01

SMDP Summary Advantages: • Test, identify all signals simultaneously,no multiple comparisons • Use “Minimal” N to find significant signals,efficient • Tight control statistical errors (Type I, II), powerful • Save rest of N for validation,reliable Further studies: • Computer time • Extension to more methods/models • Extension to non-K-D distributions

Thanks !

Sequential Multiple Decision Procedures (SMDP) for Genome Scans