390 likes | 520 Views
Characterizing Variability and Comparing Patterns from Data. “Statistics” Module 3. Outline. random samples notion of a statistic estimating the mean - sample average assessing the impact of variation on estimates - sampling distribution
E N D
Characterizing Variability and Comparing Patterns from Data “Statistics” Module 3
Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan
Random Samples Scenario - • we have an underlying pattern of variability for a process which we would like to characterize -- the population • we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment • the underlying distribution in place during each experimental run is identical to that of the population • when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty • Xi represents the “i-th” act of sampling - referred to as a sample random variable J. McLellan
Definition - Random Sample A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that • the Xi’s are independent • the Xi’s have distributions identical to that of X, i.e., Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables. What do we do with these sample values?... = F ( x ) F ( x ) X X i J. McLellan
Sample Average • used to estimate the mean • given “n” samples, X1, …, Xn, compute • interpretation - a rule for computing the sample average, involving sampling • is a random variable • observed value n 1 = å X X i n = i 1 n Lower case is used to denote observed values of the sample random variables and average. 1 = å x x i n = i 1 J. McLellan
Statistics • Sample average is an example of a “statistic” Definition A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters. • e.g., sample average estimates mean and doesn’t depend on unknown parameters n 1 = å X X i n = i 1 J. McLellan
Sampling Distribution A statistic is a random variable, with its own probability distribution • distribution arises from probability distribution of underlying population, via the sample random variables • distribution of the statistic is called the sampling distribution • characteristics of the sampling distribution depend on: • the form of the statistic - e.g., linear function of the sample random variables • the distribution of the underlying population J. McLellan
Sampling Distribution for the Sample Average • determine the mean and variance of the sample average Mean ì ü ì ü n n 1 1 = = å å E { X } E X E X í ý í ý i i n n î þ î þ = = i 1 i 1 n n m 1 1 n = = m = = m å å E { X } i n n n = = i 1 i 1 Value expected on average of the sample average is the true mean of the process - sample average is an UNBIASED estimator for the mean. because of independence of sample random variables J. McLellan
Sampling Distribution for the Sample Average Variance æ ö n 1 ç ÷ = å Var ( X ) Var X ç ÷ i n è ø = i 1 æ ö n n 1 1 ç ÷ = = å å Var X Var ( X ) ç ÷ i i 2 2 è ø n n = = i 1 i 1 2 2 s s n = = 2 n n J. McLellan
Aside - Variance If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then Var( a X+ b Y) = a2 Var(X) + b2 Var(Y) J. McLellan
Variance of Sample Average Interpretation • variance of sample average is 2 / n • as n becomes larger, variance of sample average becomes smaller • as more data is used, estimate becomes more precise • sample average represents a concentration of information J. McLellan
Distribution of the Sample Average • in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential) • Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large • even if underlying population is non-Normal • important consequences for comparing values - hypothesis tests and confidence limits J. McLellan
Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan
Sample Variance … is estimated using the following statistic: Observed value: Mean of the sample variance: n 1 2 2 = - å s ( X X ) i - n 1 = i 1 n 1 2 2 = - å s ( x x ) i - n 1 = i 1 Sample variance is an UNBIASED estimator of variance. 2 2 = s E { s } J. McLellan
Sample Standard Deviation … is simply the square root of the sample variance BUT • sample standard deviation is a biased estimator of population standard deviation • value on average does not tend to population value ¹ s E { s } J. McLellan
Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan
Confidence Intervals Consider the sample average We can standardize this to have zero mean and unit variance: 2 m s X ~ N ( , / n ) X X “Normally distributed with mean and variance” “is distributed as” - m X X = Z s / n X J. McLellan
Confidence Intervals Distribution for standard normal: Start with - and consider Z - - < < = P ( 1 . 96 Z 1 . 96 ) 0 . 95 - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X Û m - s < < m + s = P ( 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X X J. McLellan
Confidence Intervals Rearrange this last statement to obtain: Interpretation - • limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X RANDOM NOT random RANDOM J. McLellan
Confidence Intervals • this interval DOES NOT imply that the mean is uncertain Picture - sequence of intervals associated with repeated experimentation true value of mean J. McLellan
Confidence Intervals General result for mean - 100(1-)% confidence interval given by: where - • z/2 - “fence” - value for which P(Z> z/2 ) = /2 • value obtained from tables • 95% - value is 1.96 - approximately 2 • 99% - value is 2.57 - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan
Confidence Intervals General Approach • form a quantity with a known distribution that depends on the parameter of interest • form a probability statement - choose fences (limits) with a known probability • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest - m X X = Z s / n X - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X J. McLellan
Confidence Intervals for Mean When population variance is “known”, 100(1-)% confidence interval is - Known variance - • knowledge of variance when process has been operating steadily for long period of time • on basis of extensive operating experience • “large number of data points” - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan
Confidence Intervals for Mean What if variance is unknown? • Estimate using sample variance s2 Follow previous approach by forming standardized quantity: • issue - s2 is a statistic itself, and is a random variable • this quantity no longer has a standard Normal distribution Solution - • what is the probability distribution of this quantity, whendata are Normally distributed? - m X X s / n X J. McLellan
Student’s t Distribution When the data are Normally distributed, follows a Student’s t distribution with n-1 degrees of freedom Degrees of freedom - • number of statistically independent pieces of information used to compute sample variance • recall that in s2, we divide by n-1 where n is the number of data points - m X X s / n X J. McLellan
Student’s t Distribution … has a shape similar to that of Normal distribution • symmetric • values are available in tables • extra parameter in tables - degrees of freedom 3 degrees of freedom J. McLellan
Confidence Intervals for Mean Variance Unknown • estimated using sample variance • 100(1-)% case • is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average) • obtained following identical argument used in the known variance case - < m < + X t s / n X t s / n n a n a , / 2 X X , / 2 X J. McLellan
Example #1 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • prior operating history indicates that variance of conversion is 4.41 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan
Example #1 • Confidence interval - 95% • upper tail area is 2.5% • standard devn = sqrt(4.41) = 2.1 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 Þ < m < 74 . 8 77 . 4 J. McLellan
Example #2 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • current data set of 10 points used to estimate sample variance, which is 5.3 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan
Example #2 • Confidence interval - 95% • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9 • upper tail area is 2.5% • standard devn = sqrt(5.3) = 2.3 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 Þ < m < 74 . 5 77 . 7 J. McLellan
Confidence Intervals for Variance First, we need to know the sampling distribution of the sample variance: • when data are Normally distributed, sample variance is the sum of squared Normal random variables • squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry n 1 2 2 = - å s ( X X ) i - n 1 = i 1 J. McLellan
Chi-squared distribution • is the distribution of a squared standard Normal random variable • Chi-squared random variable with 1 degree of freedom • degrees of freedom = number of independent standard Normal random variables being squared • e.g., • 3 degrees of freedom 2 2 c Z ~ 1 2 2 2 2 + + c Z Z Z ~ 1 2 3 3 3 degrees of freedom J. McLellan
Sampling distribution -sample variance Sample variance • is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average • given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average) • sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1 2 s 2 2 c s ~ - n 1 - n 1 J. McLellan
Confidence Intervals - Sample Variance • Form probability statement • Re-arrange statement • 100(1-)% interval is 2 - ( n 1 ) s 2 2 c < < c = - a P ( ) 1 - - a - a n 1 , 1 / 2 n 1 , / 2 2 s 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < = - a P ( ) 1 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 J. McLellan
Confidence Limits for Variance Notes 1) the tail areas are equal • symmetric tail areas however the interval can be asymmetric • consequence of asymmetry of Chi-squared distribution 2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom equal tail areas 2 c - - a n 1 , 1 / 2 J. McLellan
Variance Confidence Intervals - Example Temperature controller has been implemented on a polymer reactor - • variance under previous operation was 4.7 C • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C • is the variance under the new control operation significantly better? • i.e., is variance under new operation significantly lower? J. McLellan
Variance Confidence Intervals - Example Use confidence interval for variance • n-1 = 10-1 = 9 degrees of freedom • form 95% confidence interval ( = 0.05) • from tables: • interval for variance: • conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account • note that interval isn’t symmetric 2 c = 2 . 7 - 9 , 1 0 . 025 2 c = 19 . 0 9 , 0 . 025 2 < s < 1 . 52 10 . 67 J. McLellan
Variance Confidence Intervals - Example Comment • variance is sensitive to degrees of freedom • need larger number of data points to obtain precise estimate • e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be: • cf. previous interval with 10 data points Conclusion still doesn’t change, however. 2 < s < 2 . 04 5 . 71 2 < s < 1 . 52 10 . 67 J. McLellan