290 likes | 301 Views
Learn about continuous population distributions, central tendency measures, mean versus median, and skewed distributions in statistics.
E N D
STA 291Spring 2010 Lecture 4 Dustin Lueker
Population Distribution • The population distribution for a continuous variable is usually represented by a smooth curve • Like a histogram that gets finer and finer • Similar to the idea of using smaller and smaller rectangles to calculate the area under a curve when learning how to integrate • Symmetric distributions • Bell-shaped • U-shaped • Uniform • Not symmetric distributions: • Left-skewed • Right-skewed • Skewed STA 291 Spring 2010 Lecture 4
Summarizing Data Numerically • Center of the data • Mean • Median • Mode • Dispersion of the data • Sometimes referred to as spread • Variance, Standard deviation • Interquartile range • Range STA 291 Spring 2010 Lecture 4
Measures of Central Tendency • Mean • Arithmetic average • Median • Midpoint of the observations when they are arranged in order • Smallest to largest • Mode • Most frequently occurring value STA 291 Spring 2010 Lecture 4
Sample Mean • Sample size n • Observations x1, x2, …, xn • Sample Mean “x-bar” STA 291 Spring 2010 Lecture 4
Population Mean • Population size N • Observations x1 , x2 ,…, xN • Population Mean “mu” • Note: This is for a finite population of size N STA 291 Spring 2010 Lecture 4
Mean • Requires numerical values • Only appropriate for quantitative data • Does not make sense to compute the mean for nominal variables • Can be calculated for ordinal variables, but this does not always make sense • Should be careful when using the mean on ordinal variables • Example “Weather” (on an ordinal scale) Sun=1, Partly Cloudy=2, Cloudy=3, Rain=4, Thunderstorm=5 Mean (average) weather=2.8 • Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale STA 291 Spring 2010 Lecture 4
Mean • Center of gravity for the data set • Sum of the differences from values above the mean is equal to the sum of the differences from values below the mean STA 291 Spring 2010 Lecture 4
Mean (Average) • Mean • Sum of observations divided by the number of observations • Example • {7, 12, 11, 18} • Mean = STA 291 Spring 2010 Lecture 4
Mean • Highly influenced by outliers • Data points that are far from the rest of the data • Not representative of a typical observation if the distribution of the data is highly skewed • Example • Monthly income for five people 1,000 2,000 3,000 4,000 100,000 • Average monthly income = • Not representative of a typical observation STA 291 Spring 2010 Lecture 4
Median • Measurement that falls in the middle of the ordered sample • When the sample size n is odd, there is a middle value • It has the ordered index (n+1)/2 • Ordered index is where that value falls when the sample is listed from smallest to largest • An index of 2 means the second smallest value • Example • 1.7, 4.6, 5.7, 6.1, 8.3 n=5, (n+1)/2=6/2=3, index = 3 Median = 3rd smallest observation = 5.7 STA 291 Spring 2010 Lecture 4
Median • When the sample size n is even, average the two middle values • Example • 3, 5, 6, 9, n=4 (n+1)/2=5/2=2.5, Index = 2.5 Median = midpoint between 2nd and 3rd smallest observations = (5+6)/2 = 5.5 STA 291 Spring 2010 Lecture 4
Mean and Median • For skewed distributions, the median is often a more appropriate measure of central tendency than the mean • The median usually better describes a “typical value” when the sample distribution is highly skewed • Example • Monthly income for five people 1,000 2,000 3,000 4,000 100,000 • Median monthly income: • Does this better describe a “typical value” in the data set than the mean of 22,000? STA 291 Spring 2010 Lecture 4
Measures of Central Tendency Mean - Arithmetic Average Median - Midpoint of the observations when they are arranged in increasing order Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured xi= Measurement of the ith unit Mode - Most frequent value. STA 291 Spring 2010 Lecture 4
Median for Grouped or Ordinal Data • Example: Highest Degree Completed STA 291 Spring 2010 Lecture 4
Calculate the Median • n = 177,618 • (n+1)/2 = 88,809.5 • Median = midpoint between the 88809th smallest and 88810th smallest observations • Both are in the category “High school only” • Mean wouldn’t make sense here since the variable is only ordinal • Median • Can be used for interval data and for ordinal data • Can not be used for nominal data because the observations can not be ordered on a scale STA 291 Spring 2010 Lecture 4
Mean vs. Median • Mean • Interval data with an approximately symmetric distribution • Median • Interval data • Ordinal data • Mean is sensitive to outliers, median is not STA 291 Spring 2010 Lecture 4
Mean vs. Median • Symmetric distribution • Mean = Median • Skewed distribution • Mean lies more toward the direction which the distribution is skewed STA 291 Spring 2010 Lecture 4
Median • Disadvantage • Insensitive to changes within the lower or upper half of the data • Example • 1, 2, 3, 4, 5 • 1, 2, 3, 100, 100 • Sometimes, the mean is more informative even when the distribution is skewed STA 291 Spring 2010 Lecture 4
Example • Keeneland Sales STA 291 Spring 2010 Lecture 4
Deviations • The deviation of the ith observation xi from the sample mean is the difference between them, • Sum of all deviations is zero • Therefore, we use either the sum of the absolute deviations or the sum of the squared deviations as a measure of variation STA 291 Spring 2010 Lecture 4
Sample Variance • Variance of nobservations is the sum of the squared deviations, divided by n-1 STA 291 Spring 2010 Lecture 4
Example STA 291 Spring 2010 Lecture 4
Interpreting Variance • About the average of the squared deviations • “average squared distance from the mean” • Unit • Square of the unit for the original data • Difficult to interpret • Solution • Take the square root of the variance, and the unit is the same as for the original data • Standard Deviation STA 291 Spring 2010 Lecture 4
Properties of Standard Deviation • s ≥ 0 • s = 0 only when all observations are the same • If data is collected for the whole population instead of a sample, then n-1 is replaced by n • s is sensitive to outliers STA 291 Spring 2010 Lecture 4
Variance and Standard Deviation • Sample • Variance • Standard Deviation • Population • Variance • Standard Deviation STA 291 Spring 2010 Lecture 4
Population Parameters and Sample Statistics • Population mean and population standard deviation are denoted by the Greek letters μ (mu) and σ (sigma) • They are unknown constants that we would like to estimate • Sample mean and sample standard deviation are denoted by and s • They are random variables, because their values vary according to the random sample that has been selected STA 291 Spring 2010 Lecture 4
Empirical Rule • If the data is approximately symmetric and bell-shaped then • About 68% of the observations are within one standard deviation from the mean • About 95% of the observations are within two standard deviations from the mean • About 99.7% of the observations are within three standard deviations from the mean STA 291 Spring 2010 Lecture 4
Example • Scores on a standardized test are scaled so they have a bell-shaped distribution with a mean of 1000 and standard deviation of 150 • About 68% of the scores are between • About 95% of the scores are between • If you have a score above 1300, you are in the top % • What percentile would this be? STA 291 Spring 2010 Lecture 4