1 / 21

Understanding Data Measures and Transformation Methods

Learn about numerical data description, sample median computation, measures of spread, boxplots, standard deviation calculation, and transformation techniques to analyze and interpret data effectively.

mgambino
Download Presentation

Understanding Data Measures and Transformation Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Set 3 Numerical description of data, data transformation

  2. Measures based on rank of data • Five number summary • Minimum = The smallest data value • First Quartile = Q1, 25% below, 75% above • Median = M, a middle, 50% below, 50% above • Third Quartile = Q3, 75 % below 25% above • Maximum = The largest data value • Measures of spread • Range = Max - Min • Interquartile range = range of the middle 50% IQR = Q3-Q1

  3. Computation of sample median • Order the data values in increasing order • n odd, M = the middle value • n even, M = the average of two middle values • Data: 54, 9, 37, 15, 52, 40, 54, 128, 1 • Ordered data 1, 9, 15, 37, 40, 52, 54, 54, 128 Order 1 2 3 4 5 6 7 8 9 • n = 9, an odd number • M = 40

  4. A simple example • Data: 54, 9, 37, 15, 52, 40, 54, 128, 1, 3 • Ordered data: 1, 3, 9, 15, 37, 40, 52, 54, 54, 128 • Min = 1, Max = 128 • Range = 128 - 1 = 127 • N= 10, M = Average 37 and 40 = 38.5 • Q1 = 9, approximately median of data below M • Q3 = 54, approximately median of data above M • IQR = 54 - 9 =45, 1.5 IQR = 67.5 • Outlier 128 > 54 + 67.5

  5. Use computer • MINITAB (Version 14) Stat >> Basic Statistics >> Display Descriptive Statistics >>Statistics • Example data Variable N Min Q1 Med Q3 Max x10 1.0 7.5 38.5 78.0 54.0 • Harris Bank 1977 Salary data Variable N Min Q1 Med Q3 Max SALARY 93 3900.0 4890.0 5400.0 6000.0 8100.0

  6. Boxplot • Graph >> Boxplot >> One Y >> Simple (SALARY) Max Outlier > Q3+1.5IQR Q3 Median IQR Q1 Min

  7. Sample mean • Definition • Computation methods • Use the formula • Use a calculator • Use a computer • MINITAB Stat >> Basic Statistics >> Display Descriptive Statistics >>Statistics Or: Calc >> Column Statistics >> Mean Meanof salaries = 5420.3

  8. Interpretation of the mean • Center of the gravity of the distribution x 1 2 6 3 -2 -1 +3 x 1 2 12 5 -4 -3 +7 • For any data, the sum of deviations from the mean is zero

  9. Sample variance & standard deviation • Square deviation from the mean • Sum of square deviation from the mean • Variance = An average of the square deviations • Sample variance • SD = Square root of variance • Sample SD • SD is in the same unit as the variable • Sensitive to extreme values • Not suitable for skewed distribution

  10. Computation • Use the variance formula • Sample variance • Use calculator and compute s • Data: 1, 12, 2 Standard deviation of x = 6.0828

  11. Computation by MINITAB • MINITAB Stat >> Basic Statistics >> Display Descriptive Statistics >>Statistics Or: Calc >> Column Statistics >> Standard deviation • Harris Bank 1977 Salary data Variable N Mean St. Dev. Variance SALARY 93 5420.3 709.6 503514.0 Variable Min Q1 Med Q3 Max SALARY 3900 4890 5400 6000 8100

  12. Descriptive statistics for two groups • Stat >> Basic Statistics >> Display Descriptive Stat >> By variable >> Statistics Variable Gender N Mean Median Tr Mean StDev Salaries 0 61 5138.9 5220.0 5137.1 539.9 1 32 5957 6000 5927 691 Variable Gender Min Max Q1 Q3 Salaries 0 3900.0 6300.0 4800.0 5400.0 1 4620 8100 5400 6225

  13. Boxplot for two groups • Graph >> Boxplot >> One Y >> With groups

  14. Linear function of data • Compute y as y = a + b x • Multiply each observation by the constant b • Add the constant a • Relations between summary measures for x and y Mean(y) = a + b Mean(x) • Also true for Min, Q1, median, Q3, Max SD(y) = |b| SD(x ) • Also true for Range and IQR Variance(y) = b2 Variance(x)

  15. Example • Flat raise y = 1000 + x • Percentage raise w = 1.1x Salary after % raise Salary before raise

  16. Standardized data • Compute Zsas • Average zs = 0 • SD of zs= 1 • MINITAB Calc >> Standardize (Specify an output column)

  17. Example Z Income Income

  18. Non-linear functions of data • Monotone functions (increasing or decreasing) • Examples: y = log x Median(y) = log[Median(x)] • Also true for Min, Q1, Q3, Max • NOT TRUE FOR MEAN & SD • MEAN & SD must be computed after transforming the data • Non-monotone functions • Example: y = x2 • ALL MEASURES must be computed after transforming the data

  19. Example • Natural Log of Income Log of Income Income Med=40950

  20. Skewed distribution • Income distribution • Hypothesis H: Data are generated from a normal distribution • If H is true, then the tail probability (P-value=P[R>0.7916]<0.0100) • Conclusion:P-value is low, hence data reject the normality hypothesis

  21. Transformation to normality • Distribution of log of income • Hypothesis H: Log of incomes are generated from a normal distribution • If H is true, then the tail probability (P-value=P[R>0.9950]>0.10) • Conclusion:P-value is not low, hence data do not give evidence against the normality hypothesis

More Related