Understanding Data Measures and Transformation Methods

Set 3 Numerical description of data, data transformation

Measures based on rank of data • Five number summary • Minimum = The smallest data value • First Quartile = Q1, 25% below, 75% above • Median = M, a middle, 50% below, 50% above • Third Quartile = Q3, 75 % below 25% above • Maximum = The largest data value • Measures of spread • Range = Max - Min • Interquartile range = range of the middle 50% IQR = Q3-Q1

Computation of sample median • Order the data values in increasing order • n odd, M = the middle value • n even, M = the average of two middle values • Data: 54, 9, 37, 15, 52, 40, 54, 128, 1 • Ordered data 1, 9, 15, 37, 40, 52, 54, 54, 128 Order 1 2 3 4 5 6 7 8 9 • n = 9, an odd number • M = 40

A simple example • Data: 54, 9, 37, 15, 52, 40, 54, 128, 1, 3 • Ordered data: 1, 3, 9, 15, 37, 40, 52, 54, 54, 128 • Min = 1, Max = 128 • Range = 128 - 1 = 127 • N= 10, M = Average 37 and 40 = 38.5 • Q1 = 9, approximately median of data below M • Q3 = 54, approximately median of data above M • IQR = 54 - 9 =45, 1.5 IQR = 67.5 • Outlier 128 > 54 + 67.5

Use computer • MINITAB (Version 14) Stat >> Basic Statistics >> Display Descriptive Statistics >>Statistics • Example data Variable N Min Q1 Med Q3 Max x10 1.0 7.5 38.5 78.0 54.0 • Harris Bank 1977 Salary data Variable N Min Q1 Med Q3 Max SALARY 93 3900.0 4890.0 5400.0 6000.0 8100.0

Boxplot • Graph >> Boxplot >> One Y >> Simple (SALARY) Max Outlier > Q3+1.5IQR Q3 Median IQR Q1 Min

Sample mean • Definition • Computation methods • Use the formula • Use a calculator • Use a computer • MINITAB Stat >> Basic Statistics >> Display Descriptive Statistics >>Statistics Or: Calc >> Column Statistics >> Mean Meanof salaries = 5420.3

Interpretation of the mean • Center of the gravity of the distribution x 1 2 6 3 -2 -1 +3 x 1 2 12 5 -4 -3 +7 • For any data, the sum of deviations from the mean is zero

Sample variance & standard deviation • Square deviation from the mean • Sum of square deviation from the mean • Variance = An average of the square deviations • Sample variance • SD = Square root of variance • Sample SD • SD is in the same unit as the variable • Sensitive to extreme values • Not suitable for skewed distribution

Computation • Use the variance formula • Sample variance • Use calculator and compute s • Data: 1, 12, 2 Standard deviation of x = 6.0828

Computation by MINITAB • MINITAB Stat >> Basic Statistics >> Display Descriptive Statistics >>Statistics Or: Calc >> Column Statistics >> Standard deviation • Harris Bank 1977 Salary data Variable N Mean St. Dev. Variance SALARY 93 5420.3 709.6 503514.0 Variable Min Q1 Med Q3 Max SALARY 3900 4890 5400 6000 8100

Descriptive statistics for two groups • Stat >> Basic Statistics >> Display Descriptive Stat >> By variable >> Statistics Variable Gender N Mean Median Tr Mean StDev Salaries 0 61 5138.9 5220.0 5137.1 539.9 1 32 5957 6000 5927 691 Variable Gender Min Max Q1 Q3 Salaries 0 3900.0 6300.0 4800.0 5400.0 1 4620 8100 5400 6225

Boxplot for two groups • Graph >> Boxplot >> One Y >> With groups

Linear function of data • Compute y as y = a + b x • Multiply each observation by the constant b • Add the constant a • Relations between summary measures for x and y Mean(y) = a + b Mean(x) • Also true for Min, Q1, median, Q3, Max SD(y) = |b| SD(x ) • Also true for Range and IQR Variance(y) = b2 Variance(x)

Example • Flat raise y = 1000 + x • Percentage raise w = 1.1x Salary after % raise Salary before raise

Standardized data • Compute Zsas • Average zs = 0 • SD of zs= 1 • MINITAB Calc >> Standardize (Specify an output column)

Example Z Income Income

Non-linear functions of data • Monotone functions (increasing or decreasing) • Examples: y = log x Median(y) = log[Median(x)] • Also true for Min, Q1, Q3, Max • NOT TRUE FOR MEAN & SD • MEAN & SD must be computed after transforming the data • Non-monotone functions • Example: y = x2 • ALL MEASURES must be computed after transforming the data

Example • Natural Log of Income Log of Income Income Med=40950

Skewed distribution • Income distribution • Hypothesis H: Data are generated from a normal distribution • If H is true, then the tail probability (P-value=P[R>0.7916]<0.0100) • Conclusion:P-value is low, hence data reject the normality hypothesis

Transformation to normality • Distribution of log of income • Hypothesis H: Log of incomes are generated from a normal distribution • If H is true, then the tail probability (P-value=P[R>0.9950]>0.10) • Conclusion:P-value is not low, hence data do not give evidence against the normality hypothesis

Understanding Data Measures and Transformation Methods

Understanding Data Measures and Transformation Methods

Presentation Transcript

Chapter 3 Set Theory

Vocab Set #3

Set 3 Honors Vocab.

Set 3 Vocabulary

ICS312 Set 3

Vocabulary Set #3

Set #3 Vocabulary

HONORS SET 3

Honors vocab. Set 3

Lecture Set 3

Vocabulary Set # 3

Problem Set 3

Test 3 Part 2: Set 3 Part 3: Set 3

Unit 3 Problem Set

Set #3

Background+Slides+ +Set+3

CEP Set 3 (E)

3. Rough set extensions

Lecture Set 3

SAT Vocabulary Set 3

Problem Set #3

Practice Problem Set 3