1 / 51

Descriptive Statistics for Numeric Variables

What to describe?. What is the

caroun
Download Presentation

Descriptive Statistics for Numeric Variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of shape measures of relative standing

    2. What to describe? What is the “location” or “center” of the data? (“measures of location”) How do the data vary? (“measures of variability”)

    3. Measures of Location Mean Median Mode

    4. Mean Another name for average. If describing a population, denoted as ?, the greek letter m, i.e. “mu”. (PARAMETER) If describing a sample, denoted as , called “x-bar”. (STATISTIC) Appropriate for describing measurement data. Seriously affected by unusual values called “outliers”.

    5. Calculating Sample Mean

    6. Median Another name for 50th percentile. Appropriate for describing measurement data. “Robust to outliers,” that is, not affected much by unusual values.

    7. Calculating Sample Median

    8. Calculating Sample Median

    9. Mode The value that occurs most frequently. One data set can have many modes. Appropriate for all types of data, but most useful for categorical data or discrete data with only a few number of possible values.

    10. In JMP: Heart Attack Data Select Analyze ? Distribution (JMP Demo)

    11. In JMP: Heart Attack Data

    13. Most appropriate measure of location Depends on whether or not data are “symmetric” or “skewed”. Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.

    14. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

    15. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

    16. Heights of College Students - Symmetric and Bimodal

    17. Heights of College Students - Symmetric and Bimodal

    18. Heights of College Students - Symmetric and Bimodal

    19. Systolic Volume for Heart Attack Patients - Skewed Right

    20. Time Until Outcome for Heart Attack Patients - Skewed Left

    21. Choosing Appropriate Measure of Location If data are symmetric, the mean, median, and mode will be approximately the same. If data are multimodal, report the mean, median and/or mode for each subgroup. If data are skewed, report the median.

    22. Measures of Variability Range Interquartile range (IQR) Variance and standard deviation Coefficient of variation (CV)

    23. Range The difference between largest and smallest data point. Highly affected by outliers. Best for symmetric data with no outliers.

    24. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

    25. Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

    26. Interquartile range The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values. IQR = Q3-Q1 Robust to outliers or extreme observations. Works well for skewed data.

    27. Systolic Volume for Heart Attack Patients - Skewed Right

    28. Variance If measuring variance of population, denoted by ?2 (“sigma-squared”). If measuring variance of sample, denoted by s2 (“s-squared”). Measures average squared deviation of data points from their mean. Highly affected by outliers. Best for symmetric data. Problem is units are squared.

    29. Formula for the Sample Variance (s2)

    30. Standard deviation Sample standard deviation is square root of sample variance, and so is denoted by s. Units are the original units. Measures “average” deviation of data points from their mean. Also, highly affected by outliers.

    31. Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers

    32. Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers

    33. Empirical Rule – The standard deviation and the normal distribution For unimodal, moderately symmetrical, sets of data approximately: 68% of observations lie within 1 standard deviation of the mean. 95% of observations lie within 2 standard deviations of the mean.

    34. page 79 of textpage 79 of text

    35. Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean. Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean.

    38. Application of Empirical Rule – Medical Lab Tests When you have blood drawn and it is screened for different chemical levels, any results two standard deviations below or two standard deviations above the mean for healthy individuals will get flagged as being abnormal. Example: For potassium, healthy individuals have a mean level 4.4 meq/l with a SD of .45 meq/l Individuals with levels outside the range : 4.4 – 2(.45) to 4.4 + 2(.45) 3.5 meq/l to 5.3 meq/l would be flagged as having abnormal potassium.

    39. Coefficient of Variation (CV) Ratio of sample standard deviation to sample mean multiplied by 100. Measures relative variability, that is, variability relative to the magnitude of the data. Unitless, so good for comparing variation between two groups and for comparing variability of measurements in completely different scales and/or units.

    40. Heart Attack Data: Which volume measure has more variation, systolic or diastolic?

    42. Choosing Appropriate Measure of Variability If data are symmetric, with no serious outliers, use range and standard deviation. If data are skewed, and/or have serious outliers, use IQR. If comparing variation across two variables, use coefficient of variation if the variables are in different units and/or scales. If the scales and units are roughly the same direct comparison of the standard deviation is fine.

    43. Measures of Shape – Skewness and Kurtosis Statistical software packages will give some measure of skewness and kurtosis for a given numeric variable. Skewness measures departure from symmetry and is usually characterized as being left or right skewed as seen previously. Kurtosis measures “peakedness” of a distribution and comes in two forms, platykurtosis and leptokurtosis.

    44. Skewness Pearson’s Skewness Coefficient Fisher’s Measure of Skewness has a complicated formula but most software packages compute it. Fisher’s Skewness > 1.00 moderate right skewness > 2.00 severe right skewness Fisher’s Skewness < -1.00 moderate left skewness < -2.00 severe right skewness

    45. Skewness

    46. Kurtosis Measures peakedness of a distribution.

    47. Kurtosis

    48. Example 2: Kurtosis

    49. Transformations to Improve Normality (removing skewness) Many statistical methods require that the numeric variables you are working with have an approximately normal distribution. Reality is that this is often times not the case. One of the most common departures from normality is skewness, in particular, right skewness.

    50. Because so many transformations available, need some way to organize – Tukey’s ladder. Upper rungs -- squares, cubes, … that is, power > 1. Lower rungs: Roots – that is, 0 < power < 1. Inverses – that is, power < 0. Why multiply inverse transformations by -1? Then, pop in the log: What is a log? Ask them? Log of number is power to which you raise a “base” to obtain the number itself: Log10100 = 2, ‘cos 100 = 102.Log101000 = 3, ‘cos 100 = 103, etc. What’s the log of 10? What’s the log of 1? What’s the log of 1/10? What’s the log of 0? What are logs to base 2? What are logs to base e? Generally, further “up” or “down” the ladder you go, more dramatic the impact. But, the question is: How do you decide whether to go up or down? How do you decide how far to go? How do you decide whether to transform the outcome or the predictor?Because so many transformations available, need some way to organize – Tukey’s ladder. Upper rungs -- squares, cubes, … that is, power > 1. Lower rungs: Roots – that is, 0 < power < 1. Inverses – that is, power < 0. Why multiply inverse transformations by -1? Then, pop in the log: What is a log? Ask them? Log of number is power to which you raise a “base” to obtain the number itself: Log10100 = 2, ‘cos 100 = 102.Log101000 = 3, ‘cos 100 = 103, etc. What’s the log of 10? What’s the log of 1? What’s the log of 1/10? What’s the log of 0? What are logs to base 2? What are logs to base e? Generally, further “up” or “down” the ladder you go, more dramatic the impact. But, the question is: How do you decide whether to go up or down? How do you decide how far to go? How do you decide whether to transform the outcome or the predictor?

    51. Tukey’s Ladder of Powers To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V.5, V.333, V0, V-1, etc. To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V2, V3, etc.

    52. Removing Right Skewness

More Related