770 likes | 1.18k Views
Outline. Basic Concepts Sample Size Calculation Precision analysis Power analysis Power analysis for three most-frequently-used regressions Logistic Regression Cox Regression Linear Regression. Basic Concepts. There are two kinds of errors occur when testing hypotheses.
E N D
Outline • Basic Concepts • Sample Size Calculation • Precision analysis • Power analysis • Power analysis for three most-frequently-used regressions • Logistic Regression • Cox Regression • Linear Regression
Basic Concepts • There are two kinds of errors occur when testing hypotheses. • Type I error: If the null hypothesis is rejected when it is true, then a type I error occurs. • Type II error: If the null hypothesis is not rejected when it is false, then a type II error is made.
Basic Concepts • Probabilities of making type I and II errors • Significance level: an upper bound for . • Power: the probability of correctly rejecting the null hypothesis when the null hypothesis is false, i.e.
Sample Size Calculation • In practice, sample size may be determined based on either precision analysis or power analysis. Next a few slides will tell you what each analysis is in details. Since power analysis is more practical, the discussion will be focused on power analysis.
Precision Analysis • For a confidence interval, the precision of the interval depends on its width. The narrower the interval is, the more precise the inference is. Therefore, the precision analysis for sample size determination is to consider the maximum half width of the confidence interval of the unknown parameter that one is willing to accept. The maximum half width of the confidence interval is usually referred to as the maximum error of an estimate of the unknown parameter.
Precision Analysis • For example, let be independent and identically distributed normal random variables with mean and variance . When is known, a confidence interval for can be obtained as where is the th percentile of the standard normal distribution.The maximum error, denoted by , is then defined as
Precision Analysis • Thus, the sample size required to achieve the desired maximum error can be chosen as • An Example: suppose that we wish to have a 95% assurance that the error in the estimated mean is less than 10% of the standard deviation (i.e., 0.1 ). The required sample size is
Power Analysis • Since a type I error is usually considered to be more important and serious error which one would like to avoid, a typical approach in hypothesis testing is to control at an acceptable level and try to minimize by choosing an appropriate sample size. In other words, the null hypothesis can be tested at pre-determined level of significance with a desired power ( ). This concept for determination of sample size is usually referred to as power analysis for sample size determination.
Power Analysis • For determination of sample size based on power analysis, the investigator is required to specify the following information. First of all, select a significance level at which the chance of wrongly concluding that a difference exists when in fact there is no real difference (type I error). Typically, 0.05 is chosen. Secondly, select a desired power at which the chance of correctly detecting a difference when the difference truly exists. A conventional choice of power is either 90% or 80%.
Power Analysis • Thirdly, specify a clinically meaningful difference, denoted by . The larger is, the larger the sample size is needed. Finally, the knowledge regarding the standard deviation( i.e. ), of the primary endpoint considered in the study is also required for sample size determination. A very precise method of measurement( small ) will permit detection of any given difference with a much smaller sample size than would be required with a less precise measurement.
Power Analysis • Suppose there are two groups of observations, namely (treatment) and (control). Assume that and are independent and normally distributed with means and and variances and respectively. Suppose the hypotheses of interest are For illustration purpose, we assume (i) and are known, and (ii) . Under these assumptions, a Z-statistic can be used to test the mean difference.
Power Analysis • The Z-test is given by Under the null hypothesis of no treatment difference, Z is distributed as N(0,1). Hence, we reject the null hypothesis when Under the alternative hypothesis that , Z is distributed as , where
Power Analysis • The corresponding power is then given by • To achieve the desired power of , we set • This leads to the required sample size
Power Analysis • An example: suppose the objective of the study is to compare a test drug with a control and the standard deviation for the treatment group is 1 and the standard deviation of the control group is 2. Then, by choosing , we have • Thus, a total of 106 subjects is required for achieving a 90% power for detection of a clinically meaningful difference of at the 5% level of significance.
Overview of Programs for Sample Size Calculation • There are a variety of programs that are available for sample size calculation. They are different in terms of cost and coverage of sample size calculation scenarios. For a comprehensive review, go to the following link: http://www.biostat.ucsf.edu/sampsize.html
Recommendation • While one can decide the type of the program to be used for sample size calculation, the program called PASS has been used as the most reliable, comprehensive and acceptable program in academic settings especially in NIH grant submissions. However, PASS is not a free program. The cost is about 650$ per license.
Simple Logistic Regression • Simple logistic regression expresses the relationship between a binary response variable( ) and a covariate( ). The simple logistic regression model relates the probability of to by the formula where P is the probability of given the value of the covariate .
Power Analysis for Simple Logistic Regression: Continuous Covariate • Suppose one wants to test the null hypothesis that where is the odds ratio comparing the odds at one standard deviation of above the mean with the odds at the mean of . • Hsieh, Block, and Larsen (1998) gave the following sample size formula when is normally distributed.
Power Analysis for Simple Logistic Regression : Continuous covariate • The sample size formula indicates that to determine the required sample size, one needs to know the following factors: • : Significance level • : desired Power • : odds ratio to be detected: • : probability of at the mean of the covariate
Power Analysis for Simple Logistic Regression: Continuous Covariate • Example 1 : A study is to be undertaken to study the relationship between post-traumatic stress disorder and heart rate after viewing video tapes containing violent sequences. Heart rate is assumed to be normally distributed. The post-traumatic stress disorder rate is thought to be 7% among the soldiers with mean heart rate. The researchers want a sample size large enough to detect an odds ratio of 1.5 with 90% power at the 0.05 significance level with a two-sided test.
Power Analysis for Simple Logistic Regression: Continuous Covariate • The example described on previous slide indicates that • Plugging these values into the sample size formula, we have
Power Analysis for Simple Logistic Regression: Continuous Covariate • The following Splus codes can be used to carry out the hand-calculation on previous slide simple.logistic.regression.continuous<-function(alpha,beta,p0,B){ # alpha---significance level # beta---one minus power # p0---probability at the mean of the covariate # B---odds ratio comparing the odds at one standard deviation of the covariate # above the mean with the odds at the mean N<-(qnorm(1-alpha/2)+qnorm(1-beta))**2/(p0*(1-p0)*(log(B))**2) N } simple.logistic.regression.continuous(0.05,0.1,0.07,1.5)
Power Analysis for Simple Logistic Regression: Continuous Covariate • Summary statements: A logistic regression of post-traumatic stress disorder on heart rate (assuming normal distribution) with a sample size of 982 observations achieves 90% power at a 0.05 significance level to detect an odds ratio of 1.5: the ratio of the odds at the mean of heart rate to the odds at one standard deviation above the mean.
Power Analysis for Simple Logistic Regression: Binary Covariate • When is a binary covariate, one also wants to test the null hypothesis that where is the odds ratio comparing the odds with the odds at . Notice that the interpretation of is different from that in the continuous covariate case.
Power Analysis for Simple Logistic Regression: Binary Covariate • Hsieh, Block, and Larsen (1998) also gave a sample size formula when is binomially distributed. the sample size formula is where
Power Analysis for Simple Logistic Regression : Binary Covariate • The sample size formula indicates that to determine the required sample size, one needs to know the following factors: • : Significance level • : desired Power • : odds ratio to be detected: • : probability of at • : the proportion of the sample with
Power Analysis for Simple Logistic Regression: Binary Covariate • Example 2: A study is to be undertaken to study the relationship between post-traumatic stress disorder and gender. The post-traumatic stress disorder rate is thought to be 7% among the males, and the proportion of female is 50%. The researchers want a sample size large enough to detect an odds ratio of 1.5 with 90% power at the 0.05 significance level with a two-sided test.
Power Analysis for Simple Logistic Regression: Binary Covariate • The example described on previous slide indicates that • To apply the sample size formula, we still need to calculate . They can be obtained by
Power Analysis for Simple Logistic Regression: Binary Covariate • Plugging these values into the sample size formula, we have
Power Analysis for Simple Logistic Regression: Continuous Covariate • The following Splus codes can be used to carry out the hand-calculation on the previous slide simple.logistic.regression.binary<-function(alpha,beta,p0,B,R){ # alpha---significance level # beta---one minus power # p0---probability at the mean of the covariate # B---odds ratio to be detected # R—the proportion of the sample with x1=1 p1<-B*p0/(1-p0+B*p0) pbar<-(1-R)*p0+R*p1 temp1<-pbar*(1-pbar)/R temp2<-p0*(1-p0)+p1*(1-p1)*(1-R)/R temp3<-(p1-p0)^2*(1-R) N<-(qnorm(1-alpha/2)*sqrt(temp1)+qnorm(1-beta)*sqrt(temp2))^2/temp3 N } simple.logistic.regression.binary(0.05,0.2,0.1,2,0.5)
Power Analysis for Simple Logistic Regression: Binary Covariate • Summary statements: A logistic regression of post-traumatic stress disorder on gender with a sample size of 565 observations achieves 80% power at a 0.05 significance level to detect an odds ratio of 1.5: the ratio of odds when one is a female to the odds when one is a male .
Multiple Logistic Regression • Multiple logistic regression expresses the relationship between a binary response variable, , and two or more covariate, . The multiple logistic regression model relates the probability of to by the formula where P is the probability of Y=1 given the values of the covariates.
Power Analysis for Multiple Logistic Regression • When there are multiple covariates, the following adjustment was given by Hsieh, Block, and Larsen (1998) to give the adjusted sample size, • Where is the sample size resulting from the simple logistic regression with (the variable of interest) being the covariate, and is the multiple correlation coefficient between and the remaining covariates, and is equal to the proportion of the variance of explained by the remaining covariates.
Power Analysis for Multiple Logistic Regression • Example 3 : A study is to be undertaken to study the relationship between post-traumatic stress disorder and heart rate after viewing video tapes containing violent sequences. Heart rate is assume to be normally distributed. The post-traumatic stress disorder rate is thought to be 7% among the soldiers with mean heart rate. In addition to heart rate, two more covariates: gender and age, are intended to be included in the model. The multiple correlation of heart rate with gender and age is 0.2. The researchers want a sample size large enough to detect an odds ratio of 1.5 with 90% power at the 0.05 significance level with a two-sided test.
Power Analysis for Multiple Logistic Regression • From example 1, when only heart rate is the covariate in the model, the required sample size is Thus the adjusted sample size with two more covariates with multiple correlation of 0.2 added in the model becomes
Power Analysis for Multiple Logistic Regression • Summary statements: A multiple logistic regression of post-traumatic stress disorder on heart rate, gender and age with a sample size of 1023 observations achieves 90% power at a 0.05 significance level to detect an odds ratio of 1.5: the ratio of the odds at the mean of heart rate to the odds at one standard deviation above the mean of heart rate, controlling for gender and age.
Cox Regression • Cox proportional hazards regression models the relationship between the hazard function of survival time and k covariates using the following formula where is the baseline hazard.
Power Analysis for Simple Cox Regression: Continuous Covariate • Suppose one wants to test the null hypothesis where is the hazard ratio : the ratio of the hazard rate at one standard deviation of above the mean to the hazard rate at the mean of • Hsieh and Lavori (2000) gave the following sample size formula when is normally distributed.
Power Analysis for Simple Cox Regression: Continuous Covariate • The sample size formula indicates that to determine the required sample size, one needs to know the following factors: • : Significance level • : desired Power • : Hazard ratio to be detected: • : The proportion of subjects that become incidence cases • : The variance of
Power Analysis for Simple Cox Regression: Continuous Covariate • Compute required sample size to detect a hazard ratio of 1.5 for a continuous covariate of interest with standard deviation 0.3, assuming only 85% of subjects survive until the end of the study
Power Analysis for Simple Cox Regression: Continuous Covariate • Example 4: Compute required sample size to achieve power 80% in detecting a hazard ratio of 1.5 for a continuous covariate of interest with standard deviation 0.3, assuming only 85% of subjects survive until the end of the study
Power Analysis for Simple Cox Regression: Binary Covariate • When is binary covariate, one also wants to test the null hypothesis where is the hazard ratio : the ratio of the hazard at to the hazard at • Schoenfeld (1983) gave the following sample size formula when is binomally distributed.
Power Analysis for Simple Cox Regression: Binary Covariate • The sample size formula indicates that to determine the required sample size, one needs to know the following factors: • : Significance level • : desired Power • : Hazard ratio to be detected: • : The proportion of subjects that become incidence cases • : The proportion of the sample with
Power Analysis for Simple Cox Regression: Binary Covariate • The sample size formula indicates that to determine the required sample size, one needs to know the following factors: • : Significance level • : desired Power • : Hazard ratio to be detected: • : The proportion of subjects that become incidence cases • : The proportion of the sample with
Power Analysis for Simple Cox Regression: Binary Covariate • Example 5: Compute required sample size to achieve power 80% in detecting a hazard ratio of 1.5 for a binary covariate of interest with exposure rate of 0.2, assuming only 85% of subjects survive until the end of the study
Power Analysis for Multiple Cox Regression • When there are multiple covariates, the following adjustment was given by Hsieh, and Lavori (2000) to give the adjusted sample size, • Where is the sample size resulting from the simple Cox regression with (the variable of interest) being the covariate, and is the multiple correlation coefficient between and the remaining covariates, is equal to the proportion of the variance of explained by the remaining covariates.
Power Analysis for Multiple Cox Regression • From example 4, when only the continuous covariate is in the model, the required sample size is Thus the adjusted sample size with two more covariates with multiple correlation of 0.2 added in the model becomes
Linear Regression • Linear regression expresses the relationship between a continuous response variable, , and one or more covariate, . The multiple logistic regression model relates to by the formula where is a normally distributed random variable with mean 0 and variance
Power Analysis for Linear Regression • Suppose one wants to test the null hypothesis where C refers to the variable controlled, and T refers the variables tested. • Let be the achieved when is regressed on those in set C, and when on those in both sets T and C.
Power Analysis for Linear Regression • The formula for computing the power is where (1) is the (1- )% percentile of central F distribution with u and v degrees of freedom. The value of u is the number of variable in T, v=n-u-k-1, and k is the number of variables in C , and (2) F is distributed as a non-central F with u and v degrees of freedom and non-centrality parameter . The value of