510 likes | 753 Views
What to describe?. What is the
E N D
1. Descriptive Statistics for Numeric Variables Types of Measures:
measures of locationmeasures of spreadmeasures of shape
measures of relative standing
2. What to describe? What is the “location” or “center” of the data? (“measures of location”)
How do the data vary? (“measures of variability”)
3. Measures of Location Mean
Median
Mode
4. Mean Another name for average.
If describing a population, denoted as ?, the greek letter m, i.e. “mu”. (PARAMETER)
If describing a sample, denoted as , called “x-bar”. (STATISTIC)
Appropriate for describing measurement data.
Seriously affected by unusual values called “outliers”.
5. Calculating Sample Mean
6. Median Another name for 50th percentile.
Appropriate for describing measurement data.
“Robust to outliers,” that is, not affected much by unusual values.
7. Calculating Sample Median
8. Calculating Sample Median
9. Mode The value that occurs most frequently.
One data set can have many modes.
Appropriate for all types of data, but most useful for categorical data or discrete data with only a few number of possible values.
10. In JMP: Heart Attack Data Select Analyze ? Distribution (JMP Demo)
11. In JMP: Heart Attack Data
13. Most appropriate measure of location
Depends on whether or not data are “symmetric” or “skewed”.
Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.
14. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)
15. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)
16. Heights of College Students - Symmetric and Bimodal
17. Heights of College Students - Symmetric and Bimodal
18. Heights of College Students - Symmetric and Bimodal
19. Systolic Volume for Heart Attack Patients - Skewed Right
20. Time Until Outcome for Heart Attack Patients - Skewed Left
21. Choosing Appropriate Measure of Location If data are symmetric, the mean, median, and mode will be approximately the same.
If data are multimodal, report the mean, median and/or mode for each subgroup.
If data are skewed, report the median.
22. Measures of Variability Range
Interquartile range (IQR)
Variance and standard deviation
Coefficient of variation (CV)
23. Range The difference between largest and smallest data point.
Highly affected by outliers.
Best for symmetric data with no outliers.
24. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)
25. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)
26. Interquartile range The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values.
IQR = Q3-Q1
Robust to outliers or extreme observations.
Works well for skewed data.
27. Systolic Volume for Heart Attack Patients - Skewed Right
28. Variance If measuring variance of population, denoted by ?2 (“sigma-squared”).
If measuring variance of sample, denoted by s2 (“s-squared”).
Measures average squared deviation of data points from their mean.
Highly affected by outliers. Best for symmetric data.
Problem is units are squared.
29. Formula for the Sample Variance (s2)
30. Standard deviation Sample standard deviation is square root of sample variance, and so is denoted by s.
Units are the original units.
Measures “average” deviation of data points from their mean.
Also, highly affected by outliers.
31. Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers
32. Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers
33. Empirical Rule – The standard deviation and the normal distribution For unimodal, moderately symmetrical, sets of data approximately:
68% of observations lie within 1 standard deviation of the mean.
95% of observations lie within 2 standard deviations of the mean.
34. page 79 of textpage 79 of text
35. Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean. Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean.
38. Application of Empirical Rule – Medical Lab Tests When you have blood drawn and it is screened for different chemical levels, any results two standard deviations below or two standard deviations above the mean for healthy individuals will get flagged as being abnormal.
Example: For potassium, healthy individuals have a mean level 4.4 meq/l with a SD of .45 meq/l
Individuals with levels outside the range :
4.4 – 2(.45) to 4.4 + 2(.45)
3.5 meq/l to 5.3 meq/l
would be flagged as having abnormal potassium.
39. Coefficient of Variation (CV) Ratio of sample standard deviation to sample mean multiplied by 100.
Measures relative variability, that is, variability relative to the magnitude of the data.
Unitless, so good for comparing variation between two groups and for comparing variability of measurements in completely different scales and/or units.
40. Heart Attack Data: Which volume measure has more variation, systolic or diastolic?
42. Choosing Appropriate Measure of Variability If data are symmetric, with no serious outliers, use range and standard deviation.
If data are skewed, and/or have serious outliers, use IQR.
If comparing variation across two variables, use coefficient of variation if the variables are in different units and/or scales. If the scales and units are roughly the same direct comparison of the standard deviation is fine.
43. Measures of Shape – Skewness and Kurtosis Statistical software packages will give some measure of skewness and kurtosis for a given numeric variable.
Skewness measures departure from symmetry and is usually characterized as being left or right skewed as seen previously.
Kurtosis measures “peakedness” of a distribution and comes in two forms, platykurtosis and leptokurtosis.
44. Skewness Pearson’s Skewness Coefficient
Fisher’s Measure of Skewness has a complicated formula but most software packages compute it.
Fisher’s Skewness > 1.00 moderate right skewness > 2.00 severe right skewness
Fisher’s Skewness < -1.00 moderate left skewness
< -2.00 severe right skewness
45. Skewness
46. Kurtosis Measures peakedness of a distribution.
47. Kurtosis
48. Example 2: Kurtosis
49. Transformations to Improve Normality (removing skewness) Many statistical methods require that the numeric variables you are working with have an approximately normal distribution.
Reality is that this is often times not the case. One of the most common departures from normality is skewness, in particular, right skewness.
50. Because so many transformations available, need some way to organize – Tukey’s ladder.
Upper rungs -- squares, cubes, … that is, power > 1.
Lower rungs:
Roots – that is, 0 < power < 1.
Inverses – that is, power < 0.
Why multiply inverse transformations by -1?
Then, pop in the log:
What is a log? Ask them?
Log of number is power to which you raise a “base” to obtain the number itself:
Log10100 = 2, ‘cos 100 = 102.Log101000 = 3, ‘cos 100 = 103, etc.
What’s the log of 10?
What’s the log of 1?
What’s the log of 1/10?
What’s the log of 0?
What are logs to base 2? What are logs to base e?
Generally, further “up” or “down” the ladder you go, more dramatic the impact.
But, the question is:
How do you decide whether to go up or down?
How do you decide how far to go?
How do you decide whether to transform the outcome or the predictor?Because so many transformations available, need some way to organize – Tukey’s ladder.
Upper rungs -- squares, cubes, … that is, power > 1.
Lower rungs:
Roots – that is, 0 < power < 1.
Inverses – that is, power < 0.
Why multiply inverse transformations by -1?
Then, pop in the log:
What is a log? Ask them?
Log of number is power to which you raise a “base” to obtain the number itself:
Log10100 = 2, ‘cos 100 = 102.Log101000 = 3, ‘cos 100 = 103, etc.
What’s the log of 10?
What’s the log of 1?
What’s the log of 1/10?
What’s the log of 0?
What are logs to base 2? What are logs to base e?
Generally, further “up” or “down” the ladder you go, more dramatic the impact.
But, the question is:
How do you decide whether to go up or down?
How do you decide how far to go?
How do you decide whether to transform the outcome or the predictor?
51. Tukey’s Ladder of Powers To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V.5, V.333, V0, V-1, etc.
To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V2, V3, etc.
52. Removing Right Skewness