770 likes | 1.63k Views
“In God we trust. All others must use data” – W. Edwards Deming. Descriptive Statistics Introduction to Summary Statistics. Overview. Data types Summary statistics Central tendency Dispersion Distribution shape Relative position Exercises. Data Types.
E N D
“In God we trust. All others must use data” – W. Edwards Deming Descriptive StatisticsIntroduction to Summary Statistics
Overview • Data types • Summary statistics • Central tendency • Dispersion • Distribution shape • Relative position • Exercises
Data Types • Different types of concepts are represented with different types of data • Level of measurement determines the kinds of statistical analysis that can be performed with the data • Discrete & Continuous • Discrete data can only assume values within some finite set • Continuous data can take any value within some interval
Scale Definition Examples Descriptive Statistics Race, gender marital status Percentages, mode Non-ordered categories Nominal Ordered relation between categories Attitudes, social class Percentiles, median Ordinal Temperature Interval Ordered relation, equality of differences Range, mean, standard deviation Ordered relation, equality of differences, absolute zero Elapsed time, costs, number of customers All of above, coefficient of variation Ratio Data Types
Summary Statistics • Summarizing a set of data typically involves describing three main attributes • Central tendency • Dispersion • Shape
Central Tendency • Measures of central tendency provide a focal point for making decisions based on the data • Types of measures • Mean (average) • Median • Mode • Trimmed means • Address problems with outliers
Mean • Mean is the arithmetic average of data set • Data: • Average: • Can be applied to ratio and interval data • Exercise • Calculate averages of data sets in summary stats.xls, sheet mean
-6 -2 3 5 11 Mean • Mean serves as a measure of central tendency since it is the value that balances positive and negative deviations 5-11 = -6 9-11 = -2 14-11 = 3 16-11 = 5 5,9,14,16 Average 11
Mean • The mean is sensitive to outlier values in the data set • Mean can change substantially because of a few very large or small data points • Mean is not a robust estimator of central tendency • Mean is sensitive to data entry errors in data set • Exercise • Vary first data point of data sets in summary stats.xls, sheet mean and note changes in mean values
Mean • Always check integrety of data before calculating statistics • Check reasonableness of maximum and minimum of data set • Exercise • Calculate maximum and minimum of flight time data in summary stats.xls, sheet flt time • Calculate mean with and without anomalous data
Mean and outliers • When possible plot data in the order that it was collected to help spot outliers and and identify possible data collection errors mean = 170.35 mean without outliers = 150.14
Median • Median is that value such that half the data is less than the median and half is greater • Can be applied to ratio, interval and ordinal data
Median • Median is a more robust measure of central tendency than mean • Exercise • Calculate median of of flight time data with and without anomalous data in summary stats.xls, sheet flt time median w/outliers= 151 median w/o outliers= 149
Trimmed Mean • Trimmed mean is the arithmetic mean after excluding the smallest and greatest x% of the data • More robust to outliers than standard mean • Typically eliminate smallest/greatest 5% or 10% • Exercise • Calculate 5% and 10% trimmed mean for flight time data in summary stats.xls, sheet flt time
Which Central Tendency Measure? • Use median if ordinal data • If ratio or interval data, can calculate mean and median • Check data integrety • Plot data • If analyze data without outliers, report and explain outliers • Use median or trimmed means if robust measure needed
Which Central Tendency Measure? • Create histogram to check shape of data • Many statistical studies involve studying the difference between population means • So the reporting the mean may be dictated by objective of study
Which Central Tendency Measure? • If data is unimodal and fairly symmetric • Mean is approximately equal to median • Then mean is a reasonable measure of central tendency
Which Central Tendency Measure? • If data is unimodal and asymmetric • Median is better measure of central tendency • May report both median and mean • Difference between mean and median indicative of asymmetry
Asymmetric Distributions • Median better indicator of central tendency for asymmetric distributions • Life expectancy • U.S. males: mean = 80.1, median = 83 • U.S. females: mean = 84.3, median = 87 • Household income • Mean = $51,855, median = $38,885 • .3% account for 12% of income • Net worth • Mean = $282,500, median = $71,600
Which Central Tendency Measure? • If data is not unimodal • Then there is not a central tendency to the data • Neither mean nor median provide good summaries of data set • Analyze data for distinct groups • Identify groups and consider providing summary statistics for each group
Central Tendency and Time Series • Time series data is collected periodically over some time interval • Types of time series • Stationary processes • Data varies around some central value with approximately same variation over time • Nonstationary processes • Data has trend and/or changes in variation over time
Central Tendency and Time Series • Standard mean or median can be used as central tendency for stationary time series • Moving averages can used to provide a (moving) central tendency value for nonstationary time series • Tends to smooth out random variations in data • Control amount of historical data used in average
Central Tendency and Time Series • Arithmetic moving average • Average of consecutive data points for a specified number of periods
Central Tendency and Time Series • Exercise • Calculate moving averages with for data in summary stats.xls, sheet time series • Vary length of averaging interval
Lack of Central Tendency • Central tendency measures can be misleading or non-informative if there is not a “central tendency” in the data • Bi or multi-modal • U-shaped distributions • Uniform distributions • Highly skewed • Heavy tails
Limitations of Central Tendency • Any single number summary may not adequately represent data and may hide differences between data sets • Example
Measures of Dispersion • Measures of dispersion provide ways to quantify the amount of variation within a data set • Dispersion measures also provide context to evaluate significance of departures from central tendency • Types of measures • Range • Standard deviation • IQR
Range • Range:max - min
Standard Deviation • Root mean square difference from the mean • Data • Calculate mean
m = 100 m = 100 Standard Deviation • Example
Standard Deviation • While form of standard deviation is not particularly intuitive, many data sets can be characterized using just the mean and SD • If the values of the data set are distributed in an approximately bell shape, the • ~68% of the data will be within 1 SD unit of mean, ~95% will be within 2 SD units and nearly all will be within 3 SD units
SD Coefficient of Variation • When comparing relative variation between data sets, often useful to adjust SD to a common scale • Coefficient of variation adjusts scale of SD using the mean
Coefficient of Variation • Example
Standard Deviation • Exercise • Calculate range and standard deviation for data in summary stats.xls, sheet dispersion • Both range and standard deviation are sensitive to outliers • Exercise • Vary first data point of data sets in summary stats.xls, sheet dispersion and note changes in range and standard deviation
Measures of Dispersion • A robust measure of dispersion is the interquartile range (IQR) • The IQR specifies the range over which the middle 50% of the data is spread • Q1 or 25th percentile: value such that 25% of data less than, and 75% greater than • Q3: value such that 75% less than, and 25% greater than • IQR = Q3 - Q1
IQR • Example • Like the median the IQR is less sensitive to outliers since it is based on relative ranking of data points as opposed to their actual values 1 98 99 100 100 100 102 102 104 95 98 99 100 100 100 102 102 104 98.5 IQR = 102 – 98.5 = 3.5 102
IQR • Exercise • Calculate IQR for data in summary stats.xls, sheet dispersion • Vary first data point of data sets in summary stats.xls, sheet dispersion and note change (or lack of) in IQR
Dispersion • The more spread out or dispersed the data, the larger the range, SD and IQR • The more concentrated or homogeneous the data, the smaller the range, SD, and IQR • If all the data elements are the same, then the dispersion will be 0 • Note neither range, SD nor IQR can be negative
Grouped Data • Often summary measures are given for groups of data • Then statistics are needed for the data aggregated together • Means • SD’s • Frequencies
Grouped Data • Aggregate mean • Aggregate SD
Grouped Data • Exercise • Suppose average salary of group of 50 employees is $65K with SD of $2K, and average salary of a second group of 30 employees is $85K with SD of $4K • Find mean and SD of salary for entire group of 80 employees
Relative Position • An important aspect of data analysis is examining the relative position of individual data points within the entire data set • Standard units • Percentile
Standard Units and Z-scores • Translating a data point into standard units indicates the position of the data relative to the mean with respect to standard deviation units • The z-score of a data point is given by
Z-scores • A z-score greater than 0 indicates the data point is greater than the mean • A z-score less than 0 indicates the data point is less than the mean • A z-score equal to 0 indicates the data point is equal to the mean • A z-score between –1 and 1 indicates that the data point is a fairly typical value • A z-score greater than ~ 2 or less than ~ –2 indicates a less than typical value
Typical Z-Scores