1 / 25

Topic-8

Topic-8. Sampling Methods and Distributions. Sampling Methods and Distributions . Why Sampling from A Population? * Physical Impossibility of checking All in the Population * Cost of studying all is prohibitive (too expensive)

cyndi
Download Presentation

Topic-8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic-8 Sampling Methods and Distributions

  2. Sampling Methods and Distributions Why Sampling from A Population? * Physical Impossibility of checking All in the Population * Cost of studying all is prohibitive (too expensive) * Time required for studying all may be prohibitive (too length) * Destructive Nature of certain tests: Test all in a book of match? * Results from Sampling are Adequate and Satisfactory: It has been proven that through an appropriate sampling technique, studying a sample can provide adequate and satisfactory information about the population. Two Types of Samples: Probability vs. Non-probability (Samples) * Probability Sample: a sample is randomly selected in such a way that each item in the population has a known chance (nonzero) of being included in the sample - those samples are viewed as nonbiased. There are several probability sampling methods. * Non-Probability Sample: a sample is selected in which not all items have a chance of being included, - those samples are often viewed as biased (e.g., Panel or Convenience Sampling).

  3. Probability Sampling Methods There is no single “best” Probability Sampling method, several methods have been developed under the consideration of - how to determine the chance of each item to be included in the sample. 1. Simple Random Sampling: a sample will be chosen that each item in population has an “equal” chance of being selected (e.g., Door Prize Selection). * Random Numbers (tables or computer printouts) should be used to avoid any personal “bias”. 2. Systematic Random Sampling: after the population is arranged in a selected manner and the first sample item (starting point) is chosen in a random way, then every following Kth member will be picked systematically picked into the sample. (e.g., Student Project Group Selection). * When there is a predetermined pattern within the population, such a systematic way may generate a biased result.

  4. 3. Stratified Random Sampling: a population is first divided into a number of subgroups - called as “Strata”, based on a selected set of criteria, then a proportional sample (or a subsample) will be chosen from each stratum. * If a large and more heterogeneous population is sampled, it may be desirable to divide some subgroups further for a better representation. * A key guideline for “stratification” of sampling is to maintain that the number of items in each stratum should be in same proportion as those in the population. * If a non proportional stratified sample is used, the final results from the sample must be weighted accordingly to correct the potential bias. * Stratified Random sampling will generate better results than both Simple Random Sampling and Systematic RandomSampling methods, especially for a larger and heterogeneous population. (Examples)

  5. 4. Cluster Random Sampling: In order to reduce the cost when sampling a large population which is scattered over a huge geographic area, the population is first divided into several (geographic) strata - called “primary units”, then a sample will be chosen from some selected (or each) strata (“geographic units” like a “cluster”) and combining those selected items into the final sample. Note: The four sampling methods discussed above can be used jointly (i.e., combined) when necessary. In addition to these four methods, there are many other sampling techniques that have been developed and applied in various statistical practice. Sampling Error: the difference (i.e., deviations) between a sample statistics (e.g., sample means and standard deviations) and their corresponding population parameters (e.g., population means and standard deviations), used a measure for sampling effectiveness, or “representativeness” of samples. Those measures can be determined through studying: Sampling Distribution of Sample Means (or Sample Variances).

  6. Sampling Distribution of Sample Means Sampling Distribution of the Sample Means: In order to measure the representativeness of sampling, all possible sample means of a given sample size selected from a population along with the probability of occurrence associated with each sample mean can be constructed into a probability distribution. * The mean of the “sample means” is equal to the population mean. * The “Sampling Distribution of Sample Means” will always approximate a “Normal Distribution”. * The dispersion (i.e., spreadness) in the “Sampling Distribution of Sample Means” may be different from the dispersion in the population. (Examples) Note: Similarly, if necessary, “sampling distribution of sample variance” or “sampling distribution of sample standard deviations” can also be constructed by a similar fashion.

  7. Central Limit Theorem * As proved in the “Sampling Distribution of Sample Means”, the “mean” of “sample means” is equal to the population mean and the distribution of “sample means” will approximate a Normal distribution. • It has been further proved that regardless of the “shape” of the population distribution (Normal or not), the distribution of “sample means” will always follow a Normal distribution andits “mean” will be equal to the population mean - which is the basis for the “Central Limit Theorem”. Central Limit Theorem: For a population (any types of distributions) with a mean (µ) and a variance (r2), the sampling distribution of “sample means” (with a size of n generated from the population) will be approximately normally distributed (with its mean = µ and its variance = r2 /n ), when the sample size (n) is large enough. “Central Limit Theorem” has been the “foundation for statistical estimation and hypothesis testing.

  8. Statistical EstimationBased on Central limit Theorem * “Large enough” sample size requirement above is a relative term if the population is a Normal distribution, the “sample means” will be normally distributed even with a small size (n < 5), but if the population is not normally distributed (e.g., highly skewed), then a larger sample size (n > 25) may be in demand for the sample means to be approximately normally distributed. * Inferential Statistics, based on “Central Limit Theorem”, include: Estimation: (Point vs. Interval) from the incomplete information (e.g., a sample). Hypothesis Testing: about a unknown regarding the population. Inferential Statistical Estimation: * Point Estimate: a single value (point) estimate for a population parameter (e.g., a sample mean for a population mean). * Interval Estimate: a range (interval) estimate within which a population parameter will likely to fall upon - often called as: “Confidence Interval” (e.g., 95% or 99% level). (Examples)

  9. Confidence Interval: Interpretations * Knowing an estimate about a population parameter from a sample never can be 100% accurate, a better alternative is to offer an estimated range (or “interval”) and indicate with certain confidence (in terms of percentage) that the interested population parameter will fall within this range (e.g., a1 < µ1 < a2). * Two most used confidence intervals are 95% (z = 1.96) and 99% (z = 2.58). Upon specific interests, a 95% confidence interval can be interpreted as, for example: -- about 95% of the similarly constructed intervals from the same population will contain the parameter being estimated. -- that 95% of the sample parameters (e.g., sample means) for a specific sample size (n) from the same population will lie within ! 1.96 standard deviation of the (hypothesized) population parameter (e.g., population mean). Note: * Other intervals (90% or 85%) have been used in practice. * For a confidence interval (95%), some intervals constructed will not include the parameter being estimated, nor exactly 95 out of 100 intervals will contain the estimated parameter.

  10. Constructing Confidence Intervals * Standard Error of Sample Means: the standard deviation of sampling distribution of sample means - a measure of sampling “representativeness”, expressed by: (when r is known) rx = (r/Å n) (r is the population standard deviation.) If r is unknown and n > 30, then the sample standard deviation (s) can approximate the population standard deviation (r) as: sx = (s/Å n) Note: As shown in both above formula (the denominator), as the sample size (n) become larger, the standard error will become smaller, reliable and stabilized, i.e., larger sample wills provide more accurate estimates about the population parameters. * When (n > 30), 95% and 99% Confidence Intervals (CI) can be constructed as: CI [z = P(X)%] -- X ± z (s/Å n) (for example) CI (95%) -- X ± 1.96 (s/Å n), or CI (99%) -- X ± 2.58 (s/Å n), or CI (92%) -- X ± 1.75 (s/Å n). (Examples)

  11. Confidence Interval: for A Population Proportion * Sometimes, a certain proportion of the population (other than means) needs to be estimated from a sample, with very similar procedures: Let (p) stands for the interested proportion: CI [z = P(X)%] -- p ± z (Å p•(1 - p)/n), where rp = Å p• (1 - p)/n) (standard error of the proportion) (Examples) * Finite Population Correction Factor: when the population is finite (has a fixed upper bound, limited even relative small), the estimation about confidence intervals needs to be adjusted by multiplying a “correction factor”: cf = Å (N - n)/(n - 1), such as: rx = (r/Å n)•cf (and) rp = Å p• (1 - p)/n)•cf Where (N, n) are the sizes of the population and sample. -- If (n/N) < .05, this correction factor can be ignored. -- The effect of such a factor is to reduce the size of standard error of the estimated parameter - depending on its relative ratio of (n/N).

  12. Determining Sample Sizes * One key decision in sampling design is to determine the size of same to be taken from the population: -- not “too” large (unnecessary time and cost), or -- not “too” small (a biased result). A few common misconceptions include: -- a certain percent (e.g., 5%) of population must be taken, or -- a certain proportion of the population must be taken more in the sampling process than others, or -- there must be a proportional relationship between the sample size and population size. * Three Key Factors in Determine Sample Sizes are: 1) Degree of Confidence Level selected (z): (e.g., 95%), 2) Maximum Allowable Sampling Error (E): (acceptable error), 3) Variation in the Population (s): measured by standard deviation, can be estimated through a pilot survey in population, so the formula is given as: n = (z•s/E)² Note: The sample size (n) above is only a rough estimate, not a exactly correct size actually needed.

  13. Determine Sample Size for Population Proportions Sample Size for Population Proportions: Similar procedures for determining sample sizes for a population proportion estimation: Step-1: Specify a Level of Confidence (z): (e.g., 95%), Step-2: Specify an Acceptable Allowable Sampling Error (E): Step-3: Approximate the Population Proportion (p): through past experience or a small pilot survey, so the formula now is: n = p•(1 - p)•(z/E)² (Examples)

  14. Summary * As a key to a success sampling is the representativeness of sample in relation to the population. As such, a wide knowledge and deep understanding about the population will be a critical factor. * The size of a sample is more dependent on your desire of: -- Accuracy of estimation (the higher the accuracy, the larger the sample should be -- acceptable error). -- Range of estimation (the narrower the range, the larger the sample size -- confidence level). -- Variation in population (the more heterogeneous the population, the larger the sample size needed). * Finally, the sample size (n) determined through the above formula is only a rough estimate. Whenever the cost is not a major concern, larger samples should be always favored in any serious research projects.

  15. Sampling Distribution of the Sample Means Example 1: The law firm of Typo and Associates has five partners. At their weekly partners meeting each reported the number of hours they charged clients for their professional services last week. The results are given on the next slide. • Two partners are randomly selected. How many different samples are possible? • This is the combination of 5 objects taken 2 at a lime. That is, 5C2 = (5!)/ [(2!)(3!)]= 10 • List the possible samples of size 2 and compute the mean.

  16. Example 1 (continued) Organize the sample means into a sampling distribution. The sampling distribution is shown below. • Compute the mean of the sample means and compare it with the population mean • The population mean, • µ=(22+26+30+26+22)/5 = 25.2 • The mean of the sample means = [(22)(1)+(24)(4)+(26)(3)+(28)(2)]/10=25.2 • Observe that the mean of the sample means is equal to the population mean

  17. Example-2 • The Dean of students at Penta Tech wants to estimate the mean number of hours worked per week by students. A sample of 49 students showed a mean of 24 hours with a standard deviation of 4 hours. • What is the point estimate of the mean number of hours worked per week by students? • The point estimate is 24 hours (sample mean). • What is the 95% confidence interval for the average number of hours worked per week by the students? • Using formula (8-3), we have ±1.96 (4/7) or we have 22.88 to 25.12. • What are the 95% confidence limits? • The endpoints of the confidence interval ate the confidence limits. The lower confidence limit is 22.88 and the upper confidence limit is 22.12. • What degree of confidence is being used? • The degree of confidence (level of confidence) is 0.95 • If we had time to select 100 samples of size 49 from the population of the number of hours worked per week by students at Penta tech and compute the sample means and 95% confidence intervals, the population mean of the number of hours worked by the students per week would be found in about 95 out of 100 confidence intervals. Either a confidence interval contains the population mean or it does not. About 5 out of the 100 confidence intervals would not contain the population mean.

  18. Example-3 • Chris Cooper, a financial planner, is studying the retirement plans of young executives. A sample of 500 young executives who owned their own home revealed that 175 planned to sell their homes and retire to Arizona. Develop a 98% confidence interval for the proportion of executives that plan to sell and move to Arizona. • Here n = 500, p̄ =175/500=0.35, and z = 2.33 • The 98% CI is 0.35 ±2.33 √(0.35)(0.65)/500 or 0.35 ± 0.0497. • Interpret *Since 0.35+0.0497= 0.3997, and (500X0.3997) = 199.80 0.35 – 0.0497 = 0.3003, and (500X0.3003) = 150.1 So, if many similar samples of similar 500 young executives are taken with same questions: • The proportion of which will plan to sell their homes and retire to Arizona in each sample will lie between 30% (150) to 40% (200) • The 98% of the sample proportion will be within 2.33 standard deviation of the (hypothesized) population proportion of 35%.

  19. Example-4 • The Dean of students at Penta Tech wants to estimate the mean number of hours worked per week by students. A sample of 49 students showed a mean of 24 hours with a standard deviation of 4 hours. Construct a 95% confidence interval for the mean number of hours worked per week by the students if there are only 500 students in campus. • Now n/N= 49/500 = 0.098>0.05, so we have to use the finite population correction factor. • 24 ± 1.96 X 4/√49X • = [22.9352, 25.1065] Since √(500-49)/(500-1) = √0.9038 = 0.95 z (б/√(n-1) = 1.96(4/√49) = 1.12 µ(1)= 24±1.12 = {22.88 to 25.12} µ(2)= 24±1.120 X0.95 = {22.93 to 25.10} That is, the sampling standard error is reduced by 5% with a smaller (confidence) interval.

  20. Example-5 • A consumer group would like to estimate the mean monthly electric bill for a single family house in July. Based on similar studies the standard deviation is estimated to be $20.00. A 99% level of confidence is desired, with an accuracy if ±$5.00. How large a sample is required? • n= [(2.58)(20/5]2 = 106.5024≈107.

  21. Example-6 • The American Kennel Club wanted to estimate the proportion of children that have a dog as a pet. If the club wanted the estimate to be within 3% of the population proportion, how many children would they need to contact? Assume a 95% level of confidence and that the Club estimated that 30% of the children have a dog as a pet. • n= (0.30)(0.70)(1.96/0.03)2 = 896.3733≈897.

  22. Exercises 8-A: the following is a list of Marco’s Pizza stores in Lucas County. Also noted is whether the store is corporate-owned(C) or manager-owned (M). A sample of four locations is to be selected and inspected for customer convenience, safety, cleanliness and other features.

  23. The random numbers selected are 08, 18, 11, 54, 02, 41, and 54. Which stores are selected? • Use the table of random numbers to select your own sample of locations • A sample is to consist of every seventh location. The number 03 is the starting point. Which locations will be included in the sample? • Suppose a sample is to consist of three locations, of which two are corporate-owned and one is manger-owned. Select a sample accordingly. 8-B: A sample of 10 observations is selected from a normal population for which the population standard deviation is known to be 5. The sample mean is 20. • Determine the standard error of the mean. • Explain why we can use formula (8-3) to determine the 95 percent confidence interval even though the sample is less than 30. • Determine the 95 percent confidence interval for the population mean. 8-C: A research firm concluded a survey to determine the mean amount steady smokers spend on cigarettes during a week. A sample of 49 steady smokers revealed that X̄ = $20 and s =$5. • What is the point estimate? Explain what it indicates. • Using the 95 percent level of confidence, determine the confidence interval for µ. Explain what is indicates.

  24. 8-D: The commercial banks in region III are to be surveyed. Some of them are very large, with assets of more than $500 million; others are medium-size, with assets between $100 million and $500 million; and the remaining banks have assets of less than $100 million. Explain how you would select a sample of these banks. 8-E: The Hunington National Bank, like most other large banks, found that using automatic teller machines (ATMs) reduces the cost of routine bank transactions. Hunington installed an ATM in the corporate offices of the Fun Toy Company. The ATM is for the exclusive use of Fun’s 605 employees. After several months of operation, a sample of 100 employees revealed the following use of the ATM machine by Fun employees in a month.

  25. What is the estimate of the proportion of employees who do not use the ATM in a month? Develop a 95 percent confidence interval for this estimate. Can Hunington be sure that at least 40 percent of the employees of Fun Toy Company will use the ATM? How many transactions does the average Fun employee make per month? Develop a 95 percent confidence interval for the mean number of transactions per month? Is it possible that the population mean is 0? Explain.

More Related