400 likes | 536 Views
Statistics: Analyzing and Comparing Data . Module 3. Outline. Estimating the true mean using sample average Estimating the true variance using sample variance Confidence intervals and hypothesis tests for means and variances. Population Mean and Sample Mean.
E N D
Outline • Estimating the true mean using sample average • Estimating the true variance using sample variance • Confidence intervals and hypothesis tests for means and variances K. McAuley
Population Mean and Sample Mean If the complete population contains N values, the average is Often N is very large or infinite, so we collect a random sample and estimate using the sample mean Why is a random variable? What happens to the quality of the estimate as n changes? K. McAuley
Sample Average • What is the difference between: and K. McAuley
Definition - Random Sample Independent random variables X1, X2 …Xn with the same underlying distribution are called a random sample A STATISTIC is any function of the random variables in a random sample. • The statistics we calculate most often are the sample mean and sample variance. Parameters in models are also statistics. K. McAuley
Statistics • Is the same as ? • Is the same as ? • Is a random variable? What about ? What about ? • The probability distribution of a statistic arises from the probability distribution of underlying population. • The variability of Xi influences the variability of • Statisticians call the probability distribution for a statistic a sampling distribution K. McAuley
Sampling Distribution for the Sample Average • Let’s determine the mean and variance of the sample average The expected value of the sample average is the true meanof the population. We say that the sample average is an UNBIASED estimatorfor the mean of the population. The variance of the sample average is smaller than the variance of the underlying population. Let’s do some proofs! K. McAuley
Sampling Distribution for the Sample Average Mean: K. McAuley
Variance We use the following theorem to find the variance of X. If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then Var( a X+ b Y) = a2 Var(X) + b2 Var(Y) K. McAuley
Sampling Distribution for the Sample Average Variance • as n becomes larger, variance of sample average becomes smaller • as more data are used, the estimate for the true mean becomes more precise K. McAuley
Distribution of the Sample Average • In the preceding slides, no assumption was made about the distribution of the population (e.g., normal, uniform, exponential) • The Central Limit Theorem says that the distribution of sample average approaches a Normal distribution when number of samples becomes large • Even if underlying population is non-Normal • When using hypothesis tests and confidence limits and control charts, we will assume Normality for K. McAuley
Sample Variance The variance of the population can be estimated using the sample variance People are sloppy and use lower case for both the observed value and the random variable s2 K. McAuley
Sample Variance Observed value: Expected value of the sample variance: Sample variance is an UNBIASED estimator of population variance. K. McAuley
Sample Standard Deviation Sample standard deviation is simply the square root of the sample variance K. McAuley
Outline • Estimating the true mean using sample average • Estimating the true variance using sample variance • Confidence intervals and hypothesis tests for means and variances K. McAuley
Confidence Intervals Consider the sample average We can standardize this to have zero mean and unit variance: “Normally distributed with mean and variance” “is distributed as” K. McAuley
Getting Confidence Intervals for the True Mean using Sampled Data Distribution for standard normal: Start with - If X is normally distributed: Let’s rearrange to get in the middle K. McAuley
Confidence Intervals Rearranging gives: Interpretation - • limits of interval have uncertainty - if we get a new set of samples and re-estimate the average and re-compute the limits, the endpoints change somewhat BUT95% of the time, the interval will contain the true value of the mean. 5% of the time the true mean will be outside the limits. • What is the “true mean” anyway? RANDOM NOT random RANDOM K. McAuley
Confidence Intervals Imagine repeating a set of experiments eight times and calculating confidence intervals on the mean from each set of experiments. The true mean wouldn’t move, but the confidence limits would. true value of mean K. McAuley
Confidence Intervals What if we want 90% or 99% confidence interval? 100(1-)% confidence interval given by: where - • z/2 – “fence” value for which P(Z> z/2 ) = /2 • value obtained from tables • For 95%, =0.05 and z/2 = 1.96 • For 99%, =0.01 and z/2 = 2.57 • Why do we find z/2 instead of z? K. McAuley
Confidence Intervals for Mean When population variance is “known”, the 100(1-)% confidence interval is Known variance - • We might be comfortable assuming that we “know” the variance when the process has been operating steadily for a long period of time • on the basis of extensive operating experience and a large number of data points But we usually don’t know the variance! K. McAuley
Confidence Intervals for Mean What if variance is unknown? This is the usual situation! • Estimate using sample variance s2 • Issue - s2 is a random variable • this approximate quantity no longer has a standard Normal distribution Solution - • What is the probability distribution of when data are Normally distributed? • Student’s t distribution K. McAuley
Student’s t Distribution When the data are from a Normally distributed population: follows a Student’s t distribution with n-1 degrees of freedom Degrees of freedom (read p. 24 of text) • Number of independent pieces of information used to compute sample variance. • Recall that when we calculate s2, we divide by n-1. • One degree of freedom gets used up because we calculateand use it to obtain s2 K. McAuley
Student’s t Distribution … has a shape similar to that of Normal distribution but the tails are heavier • symmetric • Cumulative t distribution is in Table II on pg. A-4. 3 degrees of freedom K. McAuley
Student t Distribution K. McAuley
Confidence Intervals for Population Mean True Variance Unknown • 100(1-)% case • , the number of degrees of freedom, is (n-1) when n data points are used to compute sample variance. • Let’s do a proof. • How do we get confidence intervals in general? K. McAuley
General Approach for Obtaining Confidence Intervals • Determine a quantity with a known distribution that depends on the parameter of interest • Write a probability statement using fences with a known probability • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest K. McAuley
Example #1 Conversion in a chemical reactor using new catalyst • Average conversion computed using 10 data points is 76.1% • Prior operating history indicates that variance of conversion is 4.41 %2 • Determine 95% confidence interval for mean conversion using the new catalyst, and use this to determine whether the new conversion is significantly different than mean conversion obtained with old catalyst, which is known to be 70% • What assumptions will we need to make to find the answer? Do they bother you? K. McAuley
Example #1 • Confidence interval - 95% • upper tail area is 2.5% • confidence interval • conclusion - interval doesn’t contain conversion of 70% for the old catalyst, so we conclude that the new preparation is providing a significant change (increase) in conversion K. McAuley
Example #2 Conversion in a chemical reactor using new catalyst • Average conversion computed using 10 data points is 76.1% • Data set of 10 points was used to calculate the sample variance, which is 5.3 %2 • determine the 95% confidence interval for mean conversion using the new catalyst, and use this to determine whether the new conversion is significantly different than the conversion obtained using the old catalyst, which is known to be 70% • How would we calculate the sample variance? K. McAuley
Example #2 • Confidence interval - 95% • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9 • upper tail area is 2.5% • confidence interval • Conclusion - interval doesn’t contain conversion of 70% --> new catalyst is providing a significant change (increase) in conversion K. McAuley
Confidence Intervals for Variance First, we need to know the sampling distribution of the sample variance: If the data are Normally distributed, s2 is the sum of squared Normal random variables K. McAuley
Chi-squared distribution • 2is the name given to the distribution of a squared standard Normal random variable • Chi-squared random variable with 1 degree of freedom • degrees of freedom = number of independent Normal random variables being squared • e.g., • 3 degrees of freedom 3 degrees of freedom K. McAuley
Chi-squared distribution • Functional form of 2 distribution is in Montgomery and Runger. • Integrals are available in Table III in Appendix A. • The 2 distribution is asymmetric. It goes from 0 to . • Why can’t random samples from 2 be negative? K. McAuley
Sampling distribution -sample variance Sample variance • Looks like it might be the sum of n squared Normal random variables • However, the calculated sample average introduces a constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the n-1 variables and the average) • sample variance really contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1 K. McAuley
Confidence Intervals for True Variance • Form probability statement • Re-arrange statement • 100(1-)% interval is K. McAuley
equal tail areas Confidence Limits for Variance Notes 1) If the tail areas are equal, the confidence interval is asymmetric about 2 • consequence of asymmetry of Chi-squared distribution K. McAuley
Variance Confidence Intervals - Example Temperature controller has been implemented on a polymerization reactor - • variance under previous operation was 4.7 °C2 • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C2 • is the variance under the new control operation significantly different? K. McAuley
Variance Confidence Intervals - Example Use confidence interval for variance • n-1 = 10-1 = 9 degrees of freedom • form 95% confidence interval ( = 0.05) • from tables: • interval for variance: • Conclusion: 4.7 is within this range. Variance reduction change insignificant • Notice that the interval isn’t symmetric about 3.2 C2 K. McAuley
Variance Confidence Intervals - Example Comment • Confidence intervals for variance are sensitive to degrees of freedom • We need a larger number of data points to obtain a precise estimate • e.g., if variance estimate was 3.2 °C2 with 30 degrees of freedom (31 data points), the interval would be: • Compare with previous interval with 10 data points Conclusion still doesn’t change, however. K. McAuley