330 likes | 344 Views
Learn about various descriptive measures used in statistics such as mean, median, mode, range, and more. Explore how these measures provide valuable insights into the data distribution.
E N D
Data Summary Using Descriptive Measures Sections 3.1 – 3.6, 3.8 Based on Introduction to Business Statistics Kvanli / Pavur / Keeling
|»| Summary of Descriptive Measures DESCRIPTIVE MEASURES A single number computed from the sample data that provides information about the data. An example of such measures is the mean, which the average of all the observations in a sample or a population. Measures of Variation Determine the spread of the data. Measures of Position Indicate how a particular data point fits in with all the other data points. Measures of Shape Indicate how the data points are distributed. Measures of Central Tendency Determine the center of the data values or possibly the most typical value. Mean The average of the data values. Range Range = H - L Percentile P% below P-th Percentile & (1-P)% above it Skewness The tendency of a distribution to stretch out in a particular direction Median The value in the center of the ordered data values Variance The average of the sum squared differences of the mean from individual values. Quartiles The 25th, 50th and 75th percentiles Kurtosis A measure of the peakedness of a distribution Mode The value that occurs more than once and the most often Standard Deviation The positive squared root of the variance Z-Score Expresses the number of standard deviations the value x is from the mean. Midrange The average of the highest and the lowest values Coefficient of Variation The standard deviation in terms of the mean.
|»| The Mean • The mean represents the average of the data and is computed by dividing the sum of the data points by the number of the data points. • It is the most popular measure of central tendency. • We can easily compute and explain the mean. • We have two types of mean depending on whether the data set includes all items of a population or a subset of items of a population – Sample Mean and Population Mean.
|»| Sample Mean • It is the sum of the data values in a sample divided by the number of data values in that sample. • We use (X-bar) to denote the sample mean, and n to denote the number of data values in a sample. • Therefore, for ungrouped data, we obtain, • Example 3.1 (Accident Data): The following sample represents the number of accidents (monthly) over 11 months: 18, 10, 15, 13, 17, 15, 12, 15, 18, 16, 11. Compute the mean number of monthly accidents, i.e., compute the sample mean.
|»| Sample Mean (cont.) • Example 3.2: The mean of a sample with 5 observations is 20. If the sum of four of the observations is 75, what is the value of the fifth observation? |»| Population Mean • It is the sum of the data values in a population divided by the number of data values in that population. • We use μ to denote the population mean, and N to denote the number of data values in a population. • Therefore, we obtain,
|»| The Median, Md • The Median (Md) of a set of data is the value in the center of the data values when they are arranged from lowest to highest. • It has the equal number of items to the right and the left. • Median is preferred to the mean as a measure of central tendency for data set with outliers. • Calculating the median from a sample involves the following steps: • Arrange data values in ascending order. • Find the position of the median. The median position is the ordered value. • Find the median value.
|»| The Median (cont.) • Example 3.3: Compute the median for the accident data given in Example 3.1. • Ascending order: 10, 11, 12, 13, 15, 15 , 15, 16, 17, 18, 18. • n = 11, Median Position = (11+1)/2 = 6th ordered value. • Md = 15.
|»| The Mode, Mo • The Mode (Mo) of a data set is the value that occurs more than once and the most often. • Mode is not always a measure of central tendency; this value need not occur in the center of the data. • There may be more than one mode if several numbers occur the same (and the largest) number of times. • Mode is extensively used in areas such as manufacturing of clothing, shoes, etc. • Example 3.4: Find the mode for the accident data given in Example 3.1. • The data point 15 appears the most number of times, so Mo = 15.
|»| The Midrange, Mr • It is the average of the highest and the lowest values of a data set. • Midrange provides an easy-to-grasp measure of central tendency. • If we use H to denote the highest value and L to denote the lowest value of a data set, we obtain, • Example 3.5: Find the midrange for the accident data given in Example 3.1. • L = 10, H = 18, so Mr = (L + H)/2 = (10 + 18)/2 = 19.
|»| The Range, R • The numerical difference between the largest value (H) and the smallest value (L). That is, Range = H – L. • Example 3.6: The range for the accident data given in Example 3.1 is H – L = 18 – 10 = 8. • The range is a crude measure of variation but easy to calculate and contains valuable information in some situations. • For instance, stock reports cite the high and low prices of the day. • Similarly, weather forecasts use daily high and low temperatures. Range is strongly influenced by the outliers.
|»| The Variance • Variance describes the spread of the data values from the mean. • It is the average of the sum of the squared differences of the mean from individual values. • Two types of variance are (1) Sample variance, and (2) Population variance. |»| Sample Variance, S2 • S2 describes the variation of the sample values about the sample mean. • It is the average of the sum of the squared differences of the sample mean from individual values. • That is,
|»| Sample Variance - Example • Example 3.7: Calculate the sample variance for the accident data.
|»| Sample Variance - Examples • Example 3.8: From 50 collected data, the statistics ∑x and ∑x2 are calculated to be 20 and 33, respectively. Compute the sample variance, • Example 3.9: The values of the difference between data values and the sample mean are -5, 1, -3, 2, 3, and 2, What is the variance of the data?
|»| Population Variance, σ2 • σ 2 describes the variation of the population values about the population mean. • It is the average of the sum of the squared differences of the population mean from individual values. • That is, |»| The Standard Deviation • Standard deviation is the positive square root of the variance. • The positive square root of the sample variance is the sample standard deviation, denoted by S. • The positive square root of the population variance is the population standard deviation, denoted by σ.
|»| Standard Deviation • Example 3.10: Find the sample standard deviations for Examples 3.7, 3.8, and 3.9. From Example 3.7: From Example 3.8: From Example 3.9:
|»| Coefficient of Variation, CV • Measures the standard deviation in terms of mean. • For example, what percentage of x-bar is s? • The Coefficient of Variation (CV) is used to compare the variation of two or more data sets where the values of the data differ greatly. • Example 3.11: The scores for team 1 were 70, 60, 65, and 69. The scores for team 2 were 72, 58, 61, and 73. Compare the coefficients of variation for these two teams. For team 1: For team 2:
|»| Percentile • The P-th percentile is a number such that P% of the measurements fall below the P-th percentile and (100-P)% fall above it. • Most common measure of position. • How to calculate percentile • Arrange the data • Find the location of the Pth percentile. • Find percentile using the following rules: • Location Rule 1: If n P/100 is not a counting number, round it up, and the Pth percentile will be the value in this position of the ordered data. • Location Rule 2: If n P/100 is a counting number, the Pth percentile is the average of the number in this location (of the ordered data) and the number in the next largest location 17
|»| Percentile - Example • Example 3.12: Find the 35th percentile from the following aptitude data (Aptitude Data). • Number of data values, n = 50 • 35th Percentile = P35. So, • 17.5 is NOT a counting number. So, using Location Rule 1, P35 = 18th value = 53. 18
|»| Quartiles and Interquartile Range • Quartiles are merely particular percentiles that divide the data into quarters, namely. • Q1 = 1st quartile = 25th percentile (P25) • Q2 = 2nd quartile = 50th percentile (P50) = Median. • Q3 = 3rd quartile = 75th percentile (P75) • Example 3.13: Determine the quartiles for the aptitude data • Q1 = 13th ordered value = 46 • Q2 = Median = (61+63)/2 = 62 • Q3 = 38th ordered value = 75 • Interquartile Range (IQR) • The range for the middle 50% of the data • IQR = Q3 – Q1. For aptitude data: IQR = 75 – 46 = 29. 19
|»| Z-Scores • Z-score determines the relative position of any particular data value X and is based on the mean and standard deviation of the data set. • The Z-score is expresses the number of standard deviations the value x is from the mean. • A negative Z-score implies that x is to the left of the mean and a positive Z-score implies that x is to the right of the mean. • Example 3.14: Find the z-score for an aptitude test score of 83. |»| Standardizing Sample Data • The process of subtracting the mean and dividing by the standard deviation is referred to as standardizing the sample data. • The corresponding z-score is the standardized score. 20
|»| Skewness, Sk • Skewness measures the tendency of a distribution to stretch out in a particular direction. • The Pearson’s coefficient of skewness is used to calculate skewness. • Example 3.15: Find the skewness for aptitude data. • Sk = 3(60.36 – 62)/18.61 = 3(-1.64)/18.61 = -4.92/18.61 = -0.26 • The values of Sk will always fall between -3 and 3 • A positive Sk number implies a shape which is skewed right and the mode < median < mean • In a data set with a negative Sk value the mean < median < mode 21
Frequency x = Md = Mo |»| Skewness, Sk – In Graphs • Histogram of Symmetric Data 22
Sk > 0 Relative Frequency Mode (Mo) Median (Md) Mean (x) |»| Skewness, Sk – In Graphs • Histogram with Right (Positive) Skew 23
Sk < 0 Relative Frequency Mean (x) Median (Md) Mode (Mo) |»| Skewness, Sk – In Graphs • Histogram with Left (Negative) Skew 24
|»| Kurtosis • Kurtosis is a measure of the peakedness of a distribution. • Large values occur when there is a high frequency of data near the mean and in the tails. • The calculation is cumbersome and the measure is used infrequently. |»| Interpreting X-bar and S • How many or what percentage of the data values are/is within two standard deviation of the mean? • Usually three ways to know that: • Actual percentage based on the sample • Chebyshev’s Inequality • Empirical Rule 25
ActualChebyshev’s Percentage Inequality Empirical Rule Between (Aptitude Data) Percentage Percentage x - s and x + s 66% — ≈ 68% (33 out of 50) x - 2s and x + 2s 98% ≥ 75% ≈ 95% (49 out of 50) x - 3s and x + 3s 100% ≥ 89% ≈ 100% (50 out of 50) |»| Kurtosis • According to Chebyshev, in general, at least of the data values lie between and (have z-scores between –k and k) for any k > 1. • Chebyshev’s Inequality is usually conservative but makes no assumption about the distribution of the population. • Empirical rule assumes bell-shaped distribution of the population, i.e., normal population 26
|»| Bivariate Data • Data collected on two variables for each item. • Example 3.16: Data for 10 families on income (thousands of dollars) and square footage of home (hundreds of square feet) (Income-Footage Data). 28
Y Y 35 – 30 – 25 – 20 – 15 – 10 – 5 – 35 – 30 – 25 – 20 – 15 – 10 – 5 – Square footage (hundreds) Square footage (hundreds) | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 20 | 30 | 40 | 50 | 60 | 70 | 80 X X Income (thousands) Income (thousands) (a) (b) |»| Scatter Diagram • Graphical illustration of bivariate data • Each observation is represented by a point, where the X-axis is always horizontal and the Y-axis is vertical. 29
|»| Coefficient of Correlation, r • Measures the strength of the linear relationship between X variable and Y variable. • r ranges from -1 to 1. • The larger the |r| is, the stronger the linear relationship is between X and Y. • If r = 1 or r = -1, X and Y are perfectly correlated. • If r > 0, X and Y have positive relationship (i.e., large values of X are associated with large values of Y). • If r < 0, X and Y have negative relationship (i.e., large values of X are associated with small values of Y). 30
|»| Coefficient of Correlation – Example • Example 3.17: Calculate r for Income-Footage Data. 498 226 11782 26290 5370 31
y y y y x x r = 1 r = 0 x x (b) (a) r = .9 r = -1 (d) (c) |»| Coefficient of Correlation, r – In Graphs 32
y y x x r = .5 r = -.8 (f) (e) |»| Coefficient of Correlation, r – In Graphs 33