1 / 39

Characterizing Variability and Comparing Patterns from Data

Characterizing Variability and Comparing Patterns from Data. “Statistics” Module 3. Outline. random samples notion of a statistic estimating the mean - sample average assessing the impact of variation on estimates - sampling distribution

gyda
Download Presentation

Characterizing Variability and Comparing Patterns from Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterizing Variability and Comparing Patterns from Data “Statistics” Module 3

  2. Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan

  3. Random Samples Scenario - • we have an underlying pattern of variability for a process which we would like to characterize -- the population • we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment • the underlying distribution in place during each experimental run is identical to that of the population • when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty • Xi represents the “i-th” act of sampling - referred to as a sample random variable J. McLellan

  4. Definition - Random Sample A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that • the Xi’s are independent • the Xi’s have distributions identical to that of X, i.e., Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables. What do we do with these sample values?... = F ( x ) F ( x ) X X i J. McLellan

  5. Sample Average • used to estimate the mean • given “n” samples, X1, …, Xn, compute • interpretation - a rule for computing the sample average, involving sampling • is a random variable • observed value n 1 = å X X i n = i 1 n Lower case is used to denote observed values of the sample random variables and average. 1 = å x x i n = i 1 J. McLellan

  6. Statistics • Sample average is an example of a “statistic” Definition A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters. • e.g., sample average estimates mean  and doesn’t depend on unknown parameters n 1 = å X X i n = i 1 J. McLellan

  7. Sampling Distribution A statistic is a random variable, with its own probability distribution • distribution arises from probability distribution of underlying population, via the sample random variables • distribution of the statistic is called the sampling distribution • characteristics of the sampling distribution depend on: • the form of the statistic - e.g., linear function of the sample random variables • the distribution of the underlying population J. McLellan

  8. Sampling Distribution for the Sample Average • determine the mean and variance of the sample average Mean ì ü ì ü n n 1 1 = = å å E { X } E X E X í ý í ý i i n n î þ î þ = = i 1 i 1 n n m 1 1 n = = m = = m å å E { X } i n n n = = i 1 i 1 Value expected on average of the sample average is the true mean of the process - sample average is an UNBIASED estimator for the mean. because of independence of sample random variables J. McLellan

  9. Sampling Distribution for the Sample Average Variance æ ö n 1 ç ÷ = å Var ( X ) Var X ç ÷ i n è ø = i 1 æ ö n n 1 1 ç ÷ = = å å Var X Var ( X ) ç ÷ i i 2 2 è ø n n = = i 1 i 1 2 2 s s n = = 2 n n J. McLellan

  10. Aside - Variance If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then Var( a X+ b Y) = a2 Var(X) + b2 Var(Y) J. McLellan

  11. Variance of Sample Average Interpretation • variance of sample average is 2 / n • as n becomes larger, variance of sample average becomes smaller • as more data is used, estimate becomes more precise • sample average represents a concentration of information J. McLellan

  12. Distribution of the Sample Average • in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential) • Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large • even if underlying population is non-Normal • important consequences for comparing values - hypothesis tests and confidence limits J. McLellan

  13. Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan

  14. Sample Variance … is estimated using the following statistic: Observed value: Mean of the sample variance: n 1 2 2 = - å s ( X X ) i - n 1 = i 1 n 1 2 2 = - å s ( x x ) i - n 1 = i 1 Sample variance is an UNBIASED estimator of variance. 2 2 = s E { s } J. McLellan

  15. Sample Standard Deviation … is simply the square root of the sample variance BUT • sample standard deviation is a biased estimator of population standard deviation • value on average does not tend to population value ¹ s E { s } J. McLellan

  16. Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan

  17. Confidence Intervals Consider the sample average We can standardize this to have zero mean and unit variance: 2 m s X ~ N ( , / n ) X X “Normally distributed with mean and variance” “is distributed as” - m X X = Z s / n X J. McLellan

  18. Confidence Intervals Distribution for standard normal: Start with - and consider Z - - < < = P ( 1 . 96 Z 1 . 96 ) 0 . 95 - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X Û m - s < < m + s = P ( 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X X J. McLellan

  19. Confidence Intervals Rearrange this last statement to obtain: Interpretation - • limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X RANDOM NOT random RANDOM J. McLellan

  20. Confidence Intervals • this interval DOES NOT imply that the mean  is uncertain Picture - sequence of intervals associated with repeated experimentation true value of mean J. McLellan

  21. Confidence Intervals General result for mean - 100(1-)% confidence interval given by: where - • z/2 - “fence” - value for which P(Z> z/2 ) = /2 • value obtained from tables • 95% - value is 1.96 - approximately 2 • 99% - value is 2.57 - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan

  22. Confidence Intervals General Approach • form a quantity with a known distribution that depends on the parameter of interest • form a probability statement - choose fences (limits) with a known probability • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest - m X X = Z s / n X - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X J. McLellan

  23. Confidence Intervals for Mean When population variance is “known”, 100(1-)% confidence interval is - Known variance - • knowledge of variance when process has been operating steadily for long period of time • on basis of extensive operating experience • “large number of data points” - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan

  24. Confidence Intervals for Mean What if variance is unknown? • Estimate using sample variance s2 Follow previous approach by forming standardized quantity: • issue - s2 is a statistic itself, and is a random variable • this quantity no longer has a standard Normal distribution Solution - • what is the probability distribution of this quantity, whendata are Normally distributed? - m X X s / n X J. McLellan

  25. Student’s t Distribution When the data are Normally distributed, follows a Student’s t distribution with n-1 degrees of freedom Degrees of freedom - • number of statistically independent pieces of information used to compute sample variance • recall that in s2, we divide by n-1 where n is the number of data points - m X X s / n X J. McLellan

  26. Student’s t Distribution … has a shape similar to that of Normal distribution • symmetric • values are available in tables • extra parameter in tables - degrees of freedom 3 degrees of freedom J. McLellan

  27. Confidence Intervals for Mean Variance Unknown • estimated using sample variance • 100(1-)% case •  is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average) • obtained following identical argument used in the known variance case - < m < + X t s / n X t s / n n a n a , / 2 X X , / 2 X J. McLellan

  28. Example #1 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • prior operating history indicates that variance of conversion is 4.41 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan

  29. Example #1 • Confidence interval - 95% • upper tail area is 2.5%  • standard devn = sqrt(4.41) = 2.1 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 Þ < m < 74 . 8 77 . 4 J. McLellan

  30. Example #2 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • current data set of 10 points used to estimate sample variance, which is 5.3 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan

  31. Example #2 • Confidence interval - 95% • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9 • upper tail area is 2.5%  • standard devn = sqrt(5.3) = 2.3 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 Þ < m < 74 . 5 77 . 7 J. McLellan

  32. Confidence Intervals for Variance First, we need to know the sampling distribution of the sample variance: • when data are Normally distributed, sample variance is the sum of squared Normal random variables • squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry n 1 2 2 = - å s ( X X ) i - n 1 = i 1 J. McLellan

  33. Chi-squared distribution • is the distribution of a squared standard Normal random variable • Chi-squared random variable with 1 degree of freedom • degrees of freedom = number of independent standard Normal random variables being squared • e.g., • 3 degrees of freedom 2 2 c Z ~ 1 2 2 2 2 + + c Z Z Z ~ 1 2 3 3 3 degrees of freedom J. McLellan

  34. Sampling distribution -sample variance Sample variance • is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average • given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average) • sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1 2 s 2 2 c s ~ - n 1 - n 1 J. McLellan

  35. Confidence Intervals - Sample Variance • Form probability statement • Re-arrange statement • 100(1-)% interval is 2 - ( n 1 ) s 2 2 c < < c = - a P ( ) 1 - - a - a n 1 , 1 / 2 n 1 , / 2 2 s 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < = - a P ( ) 1 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 J. McLellan

  36. Confidence Limits for Variance Notes 1) the tail areas are equal • symmetric tail areas however the interval can be asymmetric • consequence of asymmetry of Chi-squared distribution 2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom equal tail areas 2 c - - a n 1 , 1 / 2 J. McLellan

  37. Variance Confidence Intervals - Example Temperature controller has been implemented on a polymer reactor - • variance under previous operation was 4.7 C • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C • is the variance under the new control operation significantly better? • i.e., is variance under new operation significantly lower? J. McLellan

  38. Variance Confidence Intervals - Example Use confidence interval for variance • n-1 = 10-1 = 9 degrees of freedom • form 95% confidence interval ( = 0.05) • from tables: • interval for variance: • conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account • note that interval isn’t symmetric 2 c = 2 . 7 - 9 , 1 0 . 025 2 c = 19 . 0 9 , 0 . 025 2 < s < 1 . 52 10 . 67 J. McLellan

  39. Variance Confidence Intervals - Example Comment • variance is sensitive to degrees of freedom • need larger number of data points to obtain precise estimate • e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be: • cf. previous interval with 10 data points Conclusion still doesn’t change, however. 2 < s < 2 . 04 5 . 71 2 < s < 1 . 52 10 . 67 J. McLellan

More Related