1.17k likes | 1.32k Views
IE241: Introduction to Hypothesis Testing. Topic Slide Hypothesis testing ……………………………………… ..3 Light bulb example ……………………………………… ..4 Null and alternative hypotheses ……………… .. ……… .5 Two types of error ………………………………………… 8
E N D
Topic Slide Hypothesis testing………………………………………..3 Light bulb example………………………………………..4 Null and alternative hypotheses………………..……….5 Two types of error…………………………………………8 Decision rule……………………………………..……….11 test statistic……………………………………………11 critical region………………………………………….12 Power of the test……………………………….…….17 Simple hypothesis testing……………………………...18 Neyman-Pearson lemma……………….…….…….19 example………………………………………………..21 Composite hypothesis testing ……………...………..26 example…………………………………..……………29 Likelihood ratio test………………………………….34 relationship to mean…………………………….38 Examples of 1-sided composite hypotheses drug to help sleep……………………………………42 civil service exam………....................…………..44 difference between two proportions ……….46 effect of size of n…………..………………….51 railroad ties………………………………………..…. 55 fertilizer to improve yield of corn…………………..58 test of two variances…………………………..62 F distribution ……………………………………63 Tests of correlated means…………………………..…69 Bayes’ likelihood ratio test…………………………..…77 example…………………...……………………..…..78 Topic Slide Chi-square tests……………………………………….81 goodness of fit…………………………………….82 independence in contingency tables…………..91 testing sample vs hypothesized variance……..108 Significance testing…………………………………….111
We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to the second major area of statistics, hypothesis testing. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the hypothesis. What is a statistical hypothesis? A statistical hypothesis is an assumption about f(X) if X is continuous or p(X) if X is discrete.
Let’s look at an example. A buyer of light bulbs bought 50 bulbs of each of two brands. When he tested them, Brand A had an average life of 1208 hours with a standard deviation of 94 hours. Brand B had a mean life of 1282 hours with a standard deviation of 80 hours. Are brands A and B really different in quality?
We set up two hypotheses. The first, called the null hypothesis Ho, is the hypothesis of no difference. Ho: μA = μB The second, called the alternative hypothesisHa, is the hypothesis that there is a difference. Ha: μA ≠ μB
On the basis of the sample of 50 from each of the two populations of light bulbs, we shall either reject or not reject the hypothesis of no difference. In statistics, we always test the null hypothesis. The alternative hypothesis is the default winner if the null hypothesis is rejected.
We never really accept the null hypothesis; we simply fail to reject it on the basis of the evidence in hand. Now we need a procedure to test the null hypothesis. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the null hypothesis. There are two possible decisions, reject or not reject. This means there are also two kinds of error we could make.
If we reject HowhenHois in fact true, then we make a type 1 error. The probability of type 1 error is α. If we do not reject Ho when Ho is really false, then we make a type 2 error. The probability of a type 2 error is β.
Now we need a decision rule that will make the probability of the two types of error very small. The problem is that the rule cannot make both of them small simultaneously. The one type of error the experimenter has under his control is α error. He can choose the size of α. Because in science we have to take the conservative route and never claim that we have found a new result unless we are really convinced that it is true, we choose a very small α, the probability of type 1 error.
Then among all possible decision rules given α, we choose the one that makes β as small as possible. The decision rule consists of a test statistic and a critical region where the test statistic may fall. For means from a normal population, the test statistic is where the denominator is the standard deviation of the difference between two independent means.
The critical region is a tail of the distribution of the test statistic. If the test statistic falls in the critical region, Ho is rejected. Now, how much of the tail should be in the critical region? That depends on just how small you want α to be. The usual choice is α = .05, but in some very critical cases, α is set at .01. Here we have just a non-critical choice of light bulbs, so we’ll choose α = .05. This means that the critical region has probability = .025 in each tail of the t distribution.
For a t distribution with .025 in each tail, the critical value of t = 1.96, the same as z because the sample size is greater than 30. The critical region then is |t |> 1.96. In our light bulb example, the test statistic is
Now 4.23 is much greater than 1.96 so we reject the null hypothesis of no difference and declare that the average life of the B bulbs is longer than that of the A bulbs. Because α = .05, we have 95% confidence in the decision we made.
We cannot say that there is a 95% probability that we are right because we are either right or wrong and we don’t know which. But there is such a small probability that t will land in the critical region if Ho is true that if it does get there, we choose to believe that Ho is not true. If we had chosen α = .01, the critical value of t would be 2.58 and because 4.23 is greater than 2.58, we would still reject Ho. This time it would be with 99% confidence.
How do we know that the test we used is the best test possible? We have controlled the probability of Type 1 error. But what is the probability of Type 2 error in this test? Does this test minimize it subject of the value of α?
To answer this question, we need to consider the concept of test power. The power of a statistical test is the probability of rejecting Ho when Ho is really false. Thus power = 1-β. Clearly if the test maximizes power, it minimizes the probability of Type 2 error β. If a test maximizes power for given α, it is called an admissible testing strategy.
Before going further, we need to distinguish between two types of hypotheses. A simple hypothesis is one where the value of the parameter under Ho is a specified constant and the value of the parameter under Ha is a different specified constant. For example, if you test Ho: μ = 0 vs Ha: μ = 10 then you have a simple hypothesis test. Here you have a particular value for Ho and a different particular value for Ha.
For testing one simple hypothesis Ha against the simple hypothesis Ho, a ground-breaking result called the Neyman-Pearson lemma provides the most powerful test. λ is a likelihood ratio with the Ha parameter MLE in the numerator and the Ho parameter MLE in the denominator. Clearly, any value of λ > 1 would favor the alternative hypothesis, while values less than 1 would favor the null hypothesis.
Basically, this likelihood ratio says that if there exists a critical region A of size α and a constant k such that inside A and outside A then A is a best (most powerful) critical region of size α.
Consider the following example of a test of two simple hypotheses. A coin is either fair or has p(H) = 2/3. Under Ho, P(H) = ½ and under Ha, P(H) = 2/3. The coin will be tossed 3 times and a decision will be made between the two hypotheses. Thus X = number of heads = 0, 1, 2, or 3. Now let’s look at how the decision will be made.
First, let’s look at the probability of Type 1 error α. In the table below, Ho⇒ P(H) =1/2 and Ha⇒ P(H) = 2/3. Now what should the critical region be?
Under Ho, if X = 0, α = 1/8. Under Ho, if X = 3, α = 1/8. So if either of these two values is chosen as the critical region, the probability of Type 1 error would be the same. Now what if Ha is true? If X = 0 is chosen as the critical region, the value of β = 26/27 because that is the probability that X ≠ 0. On the other hand, if X = 3 is chosen as the critical region, the value of β = 19/27 because that is the probability that X ≠ 3. Clearly, the better choice for the critical region is X=3 because that is the region that minimizes β for fixed α. So this critical region provides the more powerful test.
In discrete variable problems like this, it may not be possible to choose a critical region of the desired α. In this illustration, you simply cannot find a critical region where α = .05 or .01. This is seldom a problem in real-life experimentation because n is usually sufficiently large so that there is a wide variety of choices for critical regions.
This problem to illustrate the general method for selecting the best test was easy to discuss because there was only a single alternative to Ho. Most problems involve more than a single alternative. Such hypotheses are called composite hypotheses.
Examples of composite hypotheses: Ho: μ = 0 vs Ha: μ ≠ 0 which is a two-sided Ha. A one-sided Ha can be written as Ho: μ = 0 vs Ha: μ > 0 or Ho: μ = 0 vs Ha: μ < 0 All of these hypotheses are composite because they include more than one value for Ha. And unfortunately, the size of β here depends on the particular alternative value of μ being considered.
In the composite case, it is necessary to compare Type 2 errors for all possible alternative values under Ha. So now the size of Type 2 error is a function of the alternative parameter value θ. So β(θ) is the probability that the sample point will fall in the noncritical region when θ is the true value of the parameter.
Because it is more convenient to work with the critical region, the power function 1-β(θ) is usually used. The power function is the probability that the sample point will fall in the critical region when θ is the true value of the parameter. As an illustration of these points, consider the following continuous example.
Let X = the time that elapses between two successive trippings of a Geiger counter in studying cosmic radiation. The density function is f(x;θ) = θe-θx where θ is a parameter which depends on experimental conditions. Under Ho, θ = 2. Now a physicist believes that θ < 2. So under Ha, θ < 2.
Now one choice for the critical region is the right tail of the distribution, X ≥ 1 Another choice is the left tail, X ≤ .07 for which α = .135. That is, Now let’s examine the power for the two competing critical regions.
For the right-tail critical region X > 1, and for the left-tail critical region X <.07, The graphs of these two functions are called the power curves for the two critical regions.
These two power functions are Note that the power function for X>1 region is always higher than the power function for X<.07 region before they cross at θ = 2. Since the alternative θ values in the problem are all θ<2, clearly the right-tail critical region X>1 is more powerful than the left-tail region.
What we just saw was a 1-sided composite alternative hypothesis test. Unfortunately, with two-sided composite alternative hypotheses, there is no best test that covers all alternative values. Clearly, if the alternative were θa < θo , the left tail would be best, and if the alternative were θa > θo , the right tail would be best. This shows that best critical regions exist only if the alternative hypothesis is suitably restricted.
So for composite hypotheses, a new principle needs to be introduced to find a good test. This principle is called a likelihood ratio test. where the denominator is the maximum of the likelihood function with respect to all the parameters, and the numerator is the maximum of the likelihood function after some or all of the parameters have been restricted by Ho.
Consequently, the numerator can never exceed the denominator, so λ can assume values only between 0 and 1. A value of λ close to 1 lends support to Ho because then it is clear that allowing the parameters to assume values other than those possible under Ho would not increase the likelihood of the sample values very much, if at all. If, however, λ is close to 0, then the probability of the sample values of X is very low under Ho, and Ho is therefore not supported by the data.
Because increasing values of λ correspond to increasing degrees of belief in Ho, λ may serve as a statistic for testing Ho, with small values leading to rejection of Ho. Now the MLEs are functions of the values of the random variable X, so λ is also a function of these values of X and is therefore an observable random variable. λ is often related to whose distribution is known so it is not necessary to find the distribution of λ.
Suppose we have a normal population with σ = 1 and we are interested in testing whether the mean = μo. That is, Let’s see how we would construct a likelihood ratio test.
In this case, Since maximizing L(μ) is equivalent to maximizing log L(μ), so and therefore
Under Ho, there are no parameters to be estimated, so and λ then is
This expression shows a relationship between λ and , such that for each value of λ, there are two critical values of , which are symmetrical with respect to = μo. So the 5% critical region for λ corresponds to the two 2.5% tails of the normal distribution given by Thus the likelihood ratio test is identical to the t test and serves as a compromise test when no best test is available.
It is because of the concept of power that we simply fail to reject the null hypothesis and do not accept it when the test value does not fall into the rejection region. The reason is that if we had a more powerful test, we might have been able to reject Ho. Now let’s look at some examples.
As an example of a one-sided composite hypothesis test, suppose a new drug is available which claims to produce additional sleep. The drug is tested on 10 patients with the results shown. We are testing the hypothesis Ho: μ = 0 vs Ha: μ > 0
The mean hours gained = 1.24 and s = 1.45. So the t statistic is which has 9 df. For df = 9 and α = .05, the required t = 2.262. Since our obtained t is greater then the required t, we can, with 95% confidence, reject Ho. So in this case, even with only 10 patients, we can endorse the drug for obtaining longer sleep.
Now let’s take a second example. A civil service exam is given to a group of 200 candidates. Based on their total scores, the 200 candidates are divided into two groups, the top 30% and the bottom 70%. Now consider the first question in the examination. In the upper 30% group, 40 had the right answer. In the lower 70% group, 80 had the right answer. Is the question a good discriminator between the top scorers and the lower scorers?
To answer this question, we first set up the two hypotheses. In this case, the null hypothesis is Ho: pu = pl and the alternative is Ha: pu > pl because we would expect the upper group to do better than the lower group on all questions.
In binomial situations, we must deal with proportions instead of counts unless the two sample sizes are the same. The proportion of successes p = x/n may be assumed to be normally distributed with mean p and variance pq/n if n is large.
Then the difference between two sample proportions may also be approximately normally distributed if n is large. In this situation, μp1-p2 = p1-p2 and Just as for the binomial distribution, the normal approximation will be satisfactory if each nipi exceeds 5 when p ≤ ½ and niqi exceeds 5 when p > ½.
The test statistic is We need the common estimate of p under Ho to use in the denominator, so we use the estimate for the entire group. So p = 120/200 = 3/5 =.6 and q = .4. The p for the upper group = 40/60 = .67. The p for the lower group = 80/140≈.57.
So inserting our values into the test statistic, we get Our critical region is t > 1.65 because we have set α = .05 as the critical value in this 1-tailed test. Because of the large sample size, t.95 = z.95 .
Because the obtained t = 1.32 is lower than the required t = 1.65, we cannot reject the null hypothesis because the data didn’t allow us to do so. So, given the data, we conclude that the first question is not a good one for distinguishing between the upper scorers and the lower scorers on the entire test.