1.23k likes | 1.35k Views
Applied Quantitative Methods MBA course Montenegro. Peter Balogh PhD baloghp @ agr.unideb.hu. Statistical inference. Introduction. So far, most of this book has been about describing situations .
E N D
AppliedQuantitative MethodsMBA course Montenegro Peter Balogh PhD baloghp@agr.unideb.hu
Introduction • So far, most of this book has been about describing situations. • This is useful and helps in communication, but would not justify doing a whole book or course. • In this part we are going to the next step and looking at ways of extending or generalizing our results so that they not only applyto the group of people or set of objects which we have measured, but also to the whole population. • As we saw in Part 1, most of the time we are only actually examining a sample, and not the whole population. • Although we will take as much care as possible to ensure that this sample is representative of the population, there may be times when it cannot represent everything about the whole group. • The exact results which we get from a sample will depend on chance, since the actual individuals chosen to take part in a survey may well be chosen by random sampling (where every person has a particular probability of being selected).
Introduction • We need to distinguish between values obtained from a sample, and thus subject to chance, and those calculated from the whole population, which will not be subject to this effect. • We will need to also distinguish between those true population values that we have calculated, and those that we can estimate from our sample results. • Some samples may be 'better' than others and we need some method of determining this. • Some problems may need results that we can be very sure about, others may just want a general idea of which direction things are moving. • We need to begin to say how good our results are.
Introduction • Sample values are no more than estimates of the true population values (or parameters or population parameters). • To know these values with certainty, your sample would have to be 100%, or a census. • In practice, we use samples that are only a tiny fraction of the population for reasons of cost, time and because they are adequate for the purpose. • How close the estimates are to the population parameters will depend upon the size of the sample, the sample design (e.g. stratification can improve the representativeness of the sample), and the variability in the population. • It is also necessary to decide how certain we want to be about the results; if, for example, we want a very small margin of sampling error, then we will need to incur the cost of a larger sample design. • The relationship between sample size, variability of the population and the degree of confidence required in the results is the key to understanding the chapters in this part of the book.
Introduction • The approach in Chapter 13 is different, as it is concerned with data that cannot easily or effectively be described by parameters (e.g. the mean and standard deviation). • If we are interested in characteristics (e.g. smoking/non-smoking), ranking (e.g. ranking chocolate products in terms of appearance) or scoring (e.g. giving a score between 1 and 5 to describe whether you agree or disagree with a certain statement), a number of tests have been developed that do not require description by the use of parameters. • After working through these chapters you should be able to say how good your data is, and test propositions in a variety of ways.
Inference quick start • Inference is about generalizing your sample results to the whole population. • The basic elements of inference are: • confidence intervals • parametric significance tests • non-parametric significance tests. • The aim is to reduce the time and cost of data collection while enabling us to generalize the results to the whole population. • It allows us to place a level of confidence on our results which indicates how sure we are of the assertions we are making. Results follow from the central limit theorem and the characteristics of the Normal distribution for parametric tests.
Inference quick start Key relationships are: Ninety-five percent confidence interval for a mean: Ninety-five percent confidence interval for a percentage:
Inference quick start Where there is no cardinal data, then we can use non-parametric tests such as chi-squared.
11. Confidence intervals • This chapter allows us to begin to answer the question: • 'What can we do with the sample results we obtain, and how do we relate them to the original population?' • Sampling, as we have seen in Chapter 3, is concerned with the collection of data from a (usually small) group selected from a defined, relevant population. • Various methods are used to select the sample from this population, the main distinction being between those methods based on random sampling and those which are not. • In the development of statistical sampling theory it is assumed that the samples used are selected by simple random sampling, although the methods developed in this and subsequent chapters are often applied to other sampling designs. • Sampling theory applies whether the data is collected by interview, postal questionnaire or observation. • However, as you will be aware, there are ample opportunities for bias to arise in the methods of extracting data from a sample, including the percentage of non-respondents. • These aspects must be considered in interpreting the results together with the statistics derived from sampling theory.
11. Confidence intervals • The only circumstance in which we could be absolutely certain about our results is in the unlikely case of having a census with a 100% response rate, where everyone gave the correct information. • Even then, we could only be certain at that particular point in time. • Mostly, we have to work with the sample information available. • It is important that the sample is adequate for the intended purpose and provides neither too little nor too much detail. • It is important for the user to define their requirements; the user could require just a broad 'picture' or a more detailed analysis. • A sample that was inadequate could provide results that were too vague or misleading, whereas a sample that was overspecifiedcould prove too time-consuming and costly.
11.1 Statisticalinference The central limit theorem (see Section 10.4) provides a basis for understanding how the results from a sample may be interpreted in relation to the parent population; in other words, what conclusions can be drawn about the population on the basis of the sample results obtained. This result is crucial, and if you cannot accept the relationship between samples and the population, then you can draw no conclusions about a population from your sample. All you can say is that you know something about the people involved in the survey.
11.1 Statisticalinference • For example, if a company conducted a market research survey in Buxton and found that 50% of their customers would like to try a new flavour of their sweets, what useful conclusions could be drawn about all existing customers in Buxton? • What conclusions could be drawn about existing customers elsewhere? • What conclusions could be drawn about potential customers? • It is important to clarify the link being made between the selected sample and a larger group of interest. • It is this link that is referred to as inference. • To make an inference the sample has got to be sufficiently representative of the larger group, the population. • It is for the researcher to justify that the inference is valid on the basis of problem definition, population definition and sample design.
11.1 Statisticalinference • Often results are required quickly, for example the prediction of election results, or the prediction of the number of defectives in a production process may not allow sufficient time to conduct a census. • Fortunately a census is rarely needed since a body of theory has grown up which will allow us to draw conclusions about a population from the results of a sample survey. • This is statistical inference or sampling theory. • Taking the sample results back to the problem is often referred to as business significance. • It is possible, as we shall see, to have results that are of statistical significance but not of business significance, e.g. a clear increase in sales of 0.001%.
11.1 Statisticalinference • Statistical inference draws upon the probability results as developed in Part 3, especially from the Normal distribution. • It can be shown that, given a few basic conditions, the statistics derived from a sample will follow a Normal distribution. • To understand statistical inference it is necessary to recognize that three basic factors will affect our results; these are: • the size of the sample • the variability in the relevant population • the level of confidence we wish to have in the results.
11.1 Statisticalinference • As illustrated in Figure 11.2, these three factors tend to pull in opposite directions and the final sample may well be a compromise between the factors. • Increases in samplesize will generally make the results more accurate (i.e. closer to the results which would be obtained from a census), but this is not a simple linear relationship so that doubling the sample size does not double thelevel of accuracy. • Very small samples, for example under 30, tend to behave in a slightly different way from larger samples and we will look at this when we consider the use of the t-distribution. • In practice, sample sizes can range from about 30 to 3000. • Many national samples for market research or political opinion polling require a sample size of about 1000. • Increasing sample size, also increases cost.
11.1 Statisticalinference • If there was no variation in the original population, then it would only be necessary to take a sample of one; for example, if everyone in the country had the same opinion about a certain government policy, then knowing the opinion of one individual would be enough. • However, we do not live in such a homogeneous (boring)world, and there are likely to be a wide range of opinions on such issues as government policy. • The design of the sample will need to ensure that the full range of opinions is represented. • Even items which are supposed to be exactly alike turn out not to be so, for example, items coming off the end of a production line should be identical but there will be slight variations due to machine wear, temperature variation, quality of raw materials, skill of the operators, etc.
11.1 Statisticalinference Since we cannot be 100% certain of our results, there will always be a risk that we will be wrong; we therefore need to specify how big this risk will be. Do you want to be 99% certain you have the right answer, or would 95% certain be sufficient? How about 90% certain? As we will see in this chapter, the higher the risk you are willing to accept of being wrong, the less exact the answer is going to be, and the lower the sample size needs to be.
11.2 Inferenceabout a population • Calculations based on a sample are referred to as sample statistics. • The mean and standard deviation, for example, calculated from sample information, will often be referred to as the sample mean and the sample standard deviation, but if not, should be understood from their context. • The values calculated from population or census information are often referred to as population parameters. • Ifall persons or items are included, there should be no doubt about these values (no sampling variation) and these values (population statistics) can be regarded as fixed within the particular problem context. • (This may not mean that they are 'correct' since asking everyone is no guarantee that they will all tell the truth!)
11.2 Inferenceabout a population If you have access to the web, try looking at the spreadsheet sampling.xls which takes a very small population (of size 10) and shows every possible sample of size 2, 3 or 4. The basic population data is as follows: A quick calculation would tell you that the population parameters are as follows: Mean = 13; Standard deviation = 2.160247
11.2 Inferenceabout a population By clicking on the Answer tab, you can find that, for a sample of 2, the overall mean is 13, with an overall standard deviation of 1.36626. You may wish to compare these answers with those shown, theoretically, later in the chapter.
The overall variation for samples of 2 is shown by a histogram in Figure 11.3.
11.2 Inferenceabout a population Look through the spreadsheet for the other answers. Can you find a pattern in the results?
11.2 Inferenceabout a population As we are now dealing with statistics from samples and making inferences to populations we need a notational system to distinguish between the two. Greek letters will be used to refer to population parameters, µ (mu) for the mean and σ(sigma) for the standard deviation, and N for the population size, while ordinary(roman) letters will be used for sample statistics, for the mean, s for the standard deviation, and n for the sample size. In the case of percentages, Π is used for the population and p for the sample.
11.3 Confidence interval for the population mean • When a sample is selected from a population, the arithmetic mean may be calculated in the usual way, dividing the sum of the values by the size of the sample. • If a second sample is selected, and the mean calculated, it is very likely that adifferent value for the sample mean will be obtained. • Further samples will yield more (different) values for the sample mean. • Note that the population mean is always the same throughout this process, it is only the different samples which give different answers. • This is illustrated in Figure 11.5.
11.3 Confidence interval for the population mean Since we are obtaining different answers from each of the samples, it would not bereasonable to just assume that the population mean was equal to any of the sample means. In fact each sample mean issaidto provide a point estimate for the population mean, but it has virtually no probability of being exactly right; if it were, this would be purely by chance. We mayestimate that the population mean lies within a small interval around the mean; this interval represents the sampling error.
11.3 Confidence interval for the population mean Thus the population mean isestimated to lie in the region: ± sampling error Thus, we are attempting to create an interval estimate for the population mean.
11.3 Confidence interval for the population mean You should recall from Chapter 10 that the area under a distribution curve can be used to represent the probability of a value being within an interval. Weare therefore in a position to talk about the population mean being within theinterval with a calculated probability. As we have seen in Section 10.4, the distribution of all sample means will follow a normal distribution, at least for large samples, with a mean equal to the population mean and a standard deviation equal to σ/√n.
11.3 Confidence interval for the population mean • The central limit theorem (for means) states that if a simple random sample of size n (n > 30) is taken from a population with mean µ and a standard deviation σ, the sampling distribution of the sample mean is approximately Normal with mean µ and standard deviation σ/√n. • This standard deviation is usually referred to as the standard error when we are talking about the sampling distribution of the mean. • This is a more general result than that shown in Chapter 10, since it does not assume anything about the shape of the population distribution; it could be any shape.
11.3 Confidence interval for the population mean • Compare this to the result of the sampling.xls spreadsheet. • There and the standard deviation obtained from all samples was 1.36626, but remember that here the sample size was only 2. • The spreadsheet result is intended only to illustrate that the standard deviation for the distribution of sample means is lower than the population standard deviation.
11.3 Confidence interval for the population mean From our knowledge of the Normal distribution (see Chapter 10 or Appendix C) we know that 95% of the distribution lies within 1.96 standard deviations of the mean. Thus, for the distribution of sample means, 95% of these will lie in the interval as shown in Figure 11.6.
11.3 Confidence interval for the population mean • This may also be written as a probability statement: • This is a fairly obvious and uncontentious statement which follows directly from the central limittheorem. • As you can see, a larger sample size would narrow the width of the interval (since we are dividing by rootn). • If we were to increase the percentage of the distribution included, by increasing the 0.95, we would need to increase the 1.96 values, and the interval would get wider.
11.3 Confidence interval for the population mean By rearranging the probability statement we can produce a 95% confidence interval for the population mean: This is the form of the confidence interval which we will use, but it is worth stating what it says in words: the true population mean (which we do not know) will lie within 1.96 standard errors of the sample mean with a 95% level of confidence.
11.3 Confidence interval for the population mean In practice you would only take a single sample, but this result utilizes the central limit theorem to allow you to make the statement about the population mean. There is also a 5% chance that the true population mean lies outside this confidence interval, for example, the data from sample 3 in Figure 11.7.
11.3 Confidence interval for the population mean Case study 4 In the Arbour Housing Survey (see Case 4) 100 respondents had mortgages, paying on average £253 per month. If it can be assumed that the standard deviation for mortgages in the area of Tonnelle is £70, calculate a 95% confidence interval for the mean. The sample size is n = 100, the sample mean, = 253 and the population standard deviation, σ= 70.
11.3 Confidence interval for the population mean Case study 4 • By substituting into the formula given above, we have • We are fairly sure (95% confident) that the average mortgage for the Tonnelle area is between £239.28 and £266.72. • There is a 5% chance that the true population mean lies outside of this interval.
11.3 Confidence interval for the population mean • So far our calculations have attempted to estimate the unknown population mean from the known sample mean using a result found directly from the centrallimit theorem. • However, looking again at our formula, we see that it uses the value of the population standard deviation, σ, and if the population mean is unknown it is highly unlikely that we would know this value. • To overcome this problem we may substitute the sample estimate of the standard deviation, s, but unlike the examples in Chapter 5, here we need to divide by [n - 1) rather than n in the formula. • This follows from a separate result of sampling theory which states that the sample standard deviation calculated in this way is a better estimator of the population standard deviation than that using a divisor of n. • (Note that we do not intend to prove this result which is well documented in a number of mathematical statistics books.)
11.3 Confidence interval for the population mean The structure of the confidence interval is still valid provided that the sample size is fairly large. Thus the 95% confidence interval which we shall use will be: For a 99% confidence interval, the formula would be:
11.3 Confidence interval for the population mean • As this last example illustrates, the more certain we are of the result (i.e. thehigher the level of confidence), the wider the interval becomes. • That is, the sampling error becomes larger. • Sampling error depends on the probability excluded in the extreme tail areas of the Normal distribution, and so, as the confidence level increases, the amount excluded in the tail areas becomes smaller. • This example also illustrates a further justification for sampling, since the measurement itself is destructive (length of life), and thus if all items were tested, there would be none left to sell.
11.3.1 Confidence intervals using survey data It may well be the case that you need to produce a confidence interval on the basis of tabulated data. Case study The following example uses the table produced in the Arbour Housing Survey (and reproduced as Table 11.1) showing monthly rent.
11.3.1 Confidence intervals using survey data Table 11.1 Monthly rent
11.3.1 Confidence intervals using survey data We can calculate the mean and sample standard deviation using
11.3.1 Confidence intervals using survey data • The sample standard deviation, s,sometimes denoted by, is being used as an estimator of the population standard deviation σ. • The sample standard deviation will vary from sample to sample in the same way that the sample mean,,varies from sample to sample. • The sample mean will sometimes be too high or too low, but on average will equal the population mean µ • You will notice that the distribution of sample means in Figure 11.6 is symmetrical about the population mean µ.