Descriptive Statistics

Descriptive Statistics Statistics Faculty of Information Technology King Mongkut’s University of Technology North Bangkok

Content • Data Preparation • Data Presentation • Descriptive Statistics

Data Preparation • Data checking for accuracy • Data cleaning • Removal of inaccurate data, errors, outlier • Deal with missing data • Data transformation • Application of a deterministic mathematical function to each point in a data set • The function that is used to transform the data is invertible, and generally is continuous

Data Transformation • To comply with requirement of statistical analysis • For better understanding of graph • Ease of interpretation of data • Common method • The logarithm and square root transformations are commonly used for positive data • The multiplicative inverse (reciprocal) transformation can be used for non-zero data

Example • Populations • See http://en.wikipedia.org/wiki/Data_transformation_(statistics) • Fuel consumption • Kilometers per litre • 10 km/l • Reciprocal: litres per 100 kilometers • 10l/100km • Why?

Data Presentation • Text • Table • Graphical • Pictograph • Bar Chart • Pie Chart • Line Chart • Histogram • Stem and Leaf • Scatter Plot • Box Plot • What is the difference between Bar Chart and Histogram?

Normal Curve and Skewed Curves Positive Skewed Curve Normal or Symmetrical Curve Negative Skewed Curve

J-Shaped Curve U-Shaped Curve Multimodal Curve J-Reversed Shaped Curve Bimodal Curve

Cumulative Frequency Curve Stem and Leaf Scatter Plot

Box Plot • Shows data distribution and skewness Right/Positive Skewed Left/Negative Skewed Normal

Descriptive Statistic • A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution. (Frank & Althoen, “Statistics: Concepts and applications”, 1994)

Descriptive Statistics • Frequency distribution table • Describe • Location of distribution – Mode, Median, Mean • Dispersion of distribution – Range, SD, Variance • Shape of distribution – Skewness, Kurtosis • Individuals in distributions – Percentile, Decile, Quartile • Joint distributions of data • Scatter Diagram • Correlation Coefficient • Linear Regression

Frequency Distribution Grouped Ungrouped Raw data: 42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, … • Can be visualized using graphs and charts • Determining number of intervals • k = 1 + 3.3logN • Interval width = Range / k

Frequency Distribution Table • One-way • One variable – often used with percentage • Two-way • Two variables – shows rough relation between two variables • Etc.

Describing Location of distribution • Mode • The value with highest frequency • Applicable to nominal scale (and higher scale) • Can be more than one value for one set of data • fx : MODE

Arithmetic Mean • Considered best among the three • Sum of value divided by total frequency • Can be affected by (very) peak values • A value change of an entry also changes mean • Adding / subtracting a value from all entry changes mean for the same value • Multiply / divide all entry with a value also changes mean for the same multiplication/division with the value • Sum of the difference between each entry and mean is always zero • In case of grouped data, use sum of product of the midpoint of each interval and the frequency of that interval • fx : AVERAGE

Median • Better for data with very peaked values • Ungrouped data • The value in the middle of distribution after sorting • N is odd: (N+1) / 2 • N is even: Average(N/2, N/2 +1) • Average of two middle values • fx : MEDIAN • Grouped data • See percentile

Describing Dispersion • Range • Ungrouped: Max – Min (fx MAX – fx MIN) • Grouped: true highest upper bound – true lowest lower bound • True upper bound is average value between the upper bound of the interval and the (expected) lower bound of the higher interval • True lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval

Standard Deviation & Variance • Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed) • OR • Standard Deviation (or SD, S.D., S) is most popular for describing dispersion N >= 30 N < 30 N >= 30 N < 30

Standard Deviation & Variance • Always SD >= 0 • SD of 0 means that all data entries are of the same value • Adding / subtracting a value from all entries does not affect SD • Multiply / divide all entries with a value m changes SD by multiplying/dividing SD with the absolute value of m • Variance is equal to SD2 • Only interested in the positive value of SD • fx : STDEV and VARA

Shape of Distribution • Skewness • 0 means there is no skewness (normal distribution) • Positive value means positive/right skewed • Negative value means negative/left skewed • Calculation? • Just use Excel or SPSS • fx : SKEW

Shape of Distribution • Kurtosis • 0 means normal distribution • Positive value means very peaked (less dispersed) • Negative value means less peaked (more dispersed) • Calculation? • Just use Excel or SPSS • fx : KURT

Describing Individuals in Distributions • Percentile • Quartile • Decile • Performed on data sorted in ascending order • Dividing data in 100, 4, 10 parts and identify the value at the desired position

Percentile Rank • “The percentile rank of any particular score x is the percentage of observations equal to or less than x” • Divide sorted data set into 100 parts • “cent” = 100 thus “per”“cent” = /100 • Percentile rank of entry xi = 100*(cumulative frequencyi / N) • e.g. 18, 29, 31, 32, 33 • Percentile rank of 31 = 100*(3/5) = 60 • Be careful! • Percentile rank determines rank from data value • Excel uses 0.00 – 1.00 for fx: PERCENTRANK

Percentile • “The kth percentile is the x-value at or below which fall K percent of observations” • Roughly • Position of data entry at kthPercentile = k(n+1)/100 • e.g. 18, 29, 31, 32, 33 • Percentile 80th = 80/100(5+1) = 4.8 = 5th position • Be careful! • Percentile rank determines data value from percentile • Excel uses 0.00 – 1.00 for fx: PERCENTILE

Quartile • The kth quartile is the x-value at or below which fall K quarters of observations • Roughly • Position of data entry at kthQuartile = k(n+1)/4 • e.g. 18, 29, 31, 32, 33 • Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th position • fx: QUARTILE

Decile • The kthDecile is the x-value at or below which fall K tenth of observations • Roughly • Position of data entry at kthQuartile = k(n+1)/10 • e.g. 18, 29, 31, 32, 33 • Decile 5th = 5/10(5+1) = 3rd position • Excel does not have direct decile function • Use fx: PERCENTILE with 0.1, 0.2, 0.3, … instead

Percentile for Grouped Data • r: The percentile • P: Data value at given percentile r • L: True lower bound of the interval in which percentile r falls • I: Interval width • n: Number of data entries • Σf: Cumulative frequency of intervals below L • fr: Frequency of the L interval • Determine the interval that the percentile fall using (n*r)/100

Example • Percentile 60th • n = 72, thus P60 is at around 60/100*72 = 43rd entry which falls in interval 61 – 70 • Thus • P60 = 60.5 + (10{(60/100*72) - 36}) / 17 = 64.74

Median

Joint Distribution of Data • Scatter Diagram Imaginary line showing relation Imaginary line showing relation Negatively related Not related Positively related

Correlation Coefficient • Pearson Product-Moment Correlation Coefficient • Denoted as rxy or r • Measure the correlation between two data sets • Can take value from -1 to 1 • Value of 1: two data sets have absolute positive relation • Value of -1: two data sets have absolute negative relation • Value of 0: two data sets have no linear relation

Correlation Coefficient • Formula • fx: PEARSON (do not use in MS Excel earlier than 2003) • fx: CORREL

Correlation for Ordinal Scale • Spearman Rank Correlation Coefficient • Two variables • Kendall’s Tau Rank Correlation Coefficient • Three or more variable

Linear Regression • Describe relation between two interval-scale variables in the form of regression equation • y = bx + a (Straight line) • y = a + bx + cx2 (Parabola equation) • y = abx(Exponential equation) • x: independent variable • y: dependent variable • a: Y-intercept (where the line crosses Y axis) • b: Slope

Simple Linear Regression • Find b then a • Then write the equation • y = bx + a • E.g. b = 31.4, a = 4.52 • y = 31.4x + 4.52

Example • Table shows the period of time each student spends reading for exam and his/her score • b = {10 (45885) – (1035)(413)} / {10 (123375) – (1035)2 • = 31395 / 162525 = 0.1932 • a = 41.3 – (0.1932) (103.5) = 21.3038 • y = 0.1932x + 21.3038 • Meaning • Spending 1 minute will increase score by 0.1932 mark • If you don’t read at all you should get 21.3038 mark

Multiple Linear Regression • More than one independent variables • Equation Y = a + b1x1 + b2x2 + b3x3… • Requirement • Normal distribution • No multicollinearity (independent variables do not depend on each other) • Selecting independent variables • All Entry – when you are not sure which variable has effect • Stepwise – only use variables tested to be significant

how much of the dependent variable can be explained by the independent variable Simple correlation Is the model good (significant)? (yes, Sig. < 0.05) b1 a b a b2

Summary

Descriptive Statistics