430 likes | 611 Views
Descriptive Statistics. Statistics. Faculty of Information Technology King Mongkut’s University of Technology North Bangkok. Content. Data Preparation Data Presentation Descriptive Statistics. Data Preparation. Data checking for accuracy Data cleaning
E N D
Descriptive Statistics Statistics Faculty of Information Technology King Mongkut’s University of Technology North Bangkok
Content • Data Preparation • Data Presentation • Descriptive Statistics
Data Preparation • Data checking for accuracy • Data cleaning • Removal of inaccurate data, errors, outlier • Deal with missing data • Data transformation • Application of a deterministic mathematical function to each point in a data set • The function that is used to transform the data is invertible, and generally is continuous
Data Transformation • To comply with requirement of statistical analysis • For better understanding of graph • Ease of interpretation of data • Common method • The logarithm and square root transformations are commonly used for positive data • The multiplicative inverse (reciprocal) transformation can be used for non-zero data
Example • Populations • See http://en.wikipedia.org/wiki/Data_transformation_(statistics) • Fuel consumption • Kilometers per litre • 10 km/l • Reciprocal: litres per 100 kilometers • 10l/100km • Why?
Data Presentation • Text • Table • Graphical • Pictograph • Bar Chart • Pie Chart • Line Chart • Histogram • Stem and Leaf • Scatter Plot • Box Plot • What is the difference between Bar Chart and Histogram?
Normal Curve and Skewed Curves Positive Skewed Curve Normal or Symmetrical Curve Negative Skewed Curve
J-Shaped Curve U-Shaped Curve Multimodal Curve J-Reversed Shaped Curve Bimodal Curve
Cumulative Frequency Curve Stem and Leaf Scatter Plot
Box Plot • Shows data distribution and skewness Right/Positive Skewed Left/Negative Skewed Normal
Descriptive Statistic • A descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution. (Frank & Althoen, “Statistics: Concepts and applications”, 1994)
Descriptive Statistics • Frequency distribution table • Describe • Location of distribution – Mode, Median, Mean • Dispersion of distribution – Range, SD, Variance • Shape of distribution – Skewness, Kurtosis • Individuals in distributions – Percentile, Decile, Quartile • Joint distributions of data • Scatter Diagram • Correlation Coefficient • Linear Regression
Frequency Distribution Grouped Ungrouped Raw data: 42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, … • Can be visualized using graphs and charts • Determining number of intervals • k = 1 + 3.3logN • Interval width = Range / k
Frequency Distribution Table • One-way • One variable – often used with percentage • Two-way • Two variables – shows rough relation between two variables • Etc.
Describing Location of distribution • Mode • The value with highest frequency • Applicable to nominal scale (and higher scale) • Can be more than one value for one set of data • fx : MODE
Arithmetic Mean • Considered best among the three • Sum of value divided by total frequency • Can be affected by (very) peak values • A value change of an entry also changes mean • Adding / subtracting a value from all entry changes mean for the same value • Multiply / divide all entry with a value also changes mean for the same multiplication/division with the value • Sum of the difference between each entry and mean is always zero • In case of grouped data, use sum of product of the midpoint of each interval and the frequency of that interval • fx : AVERAGE
Median • Better for data with very peaked values • Ungrouped data • The value in the middle of distribution after sorting • N is odd: (N+1) / 2 • N is even: Average(N/2, N/2 +1) • Average of two middle values • fx : MEDIAN • Grouped data • See percentile
Describing Dispersion • Range • Ungrouped: Max – Min (fx MAX – fx MIN) • Grouped: true highest upper bound – true lowest lower bound • True upper bound is average value between the upper bound of the interval and the (expected) lower bound of the higher interval • True lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval
Standard Deviation & Variance • Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed) • OR • Standard Deviation (or SD, S.D., S) is most popular for describing dispersion N >= 30 N < 30 N >= 30 N < 30
Standard Deviation & Variance • Always SD >= 0 • SD of 0 means that all data entries are of the same value • Adding / subtracting a value from all entries does not affect SD • Multiply / divide all entries with a value m changes SD by multiplying/dividing SD with the absolute value of m • Variance is equal to SD2 • Only interested in the positive value of SD • fx : STDEV and VARA
Shape of Distribution • Skewness • 0 means there is no skewness (normal distribution) • Positive value means positive/right skewed • Negative value means negative/left skewed • Calculation? • Just use Excel or SPSS • fx : SKEW
Shape of Distribution • Kurtosis • 0 means normal distribution • Positive value means very peaked (less dispersed) • Negative value means less peaked (more dispersed) • Calculation? • Just use Excel or SPSS • fx : KURT
Describing Individuals in Distributions • Percentile • Quartile • Decile • Performed on data sorted in ascending order • Dividing data in 100, 4, 10 parts and identify the value at the desired position
Percentile Rank • “The percentile rank of any particular score x is the percentage of observations equal to or less than x” • Divide sorted data set into 100 parts • “cent” = 100 thus “per”“cent” = /100 • Percentile rank of entry xi = 100*(cumulative frequencyi / N) • e.g. 18, 29, 31, 32, 33 • Percentile rank of 31 = 100*(3/5) = 60 • Be careful! • Percentile rank determines rank from data value • Excel uses 0.00 – 1.00 for fx: PERCENTRANK
Percentile • “The kth percentile is the x-value at or below which fall K percent of observations” • Roughly • Position of data entry at kthPercentile = k(n+1)/100 • e.g. 18, 29, 31, 32, 33 • Percentile 80th = 80/100(5+1) = 4.8 = 5th position • Be careful! • Percentile rank determines data value from percentile • Excel uses 0.00 – 1.00 for fx: PERCENTILE
Quartile • The kth quartile is the x-value at or below which fall K quarters of observations • Roughly • Position of data entry at kthQuartile = k(n+1)/4 • e.g. 18, 29, 31, 32, 33 • Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th position • fx: QUARTILE
Decile • The kthDecile is the x-value at or below which fall K tenth of observations • Roughly • Position of data entry at kthQuartile = k(n+1)/10 • e.g. 18, 29, 31, 32, 33 • Decile 5th = 5/10(5+1) = 3rd position • Excel does not have direct decile function • Use fx: PERCENTILE with 0.1, 0.2, 0.3, … instead
Percentile for Grouped Data • r: The percentile • P: Data value at given percentile r • L: True lower bound of the interval in which percentile r falls • I: Interval width • n: Number of data entries • Σf: Cumulative frequency of intervals below L • fr: Frequency of the L interval • Determine the interval that the percentile fall using (n*r)/100
Example • Percentile 60th • n = 72, thus P60 is at around 60/100*72 = 43rd entry which falls in interval 61 – 70 • Thus • P60 = 60.5 + (10{(60/100*72) - 36}) / 17 = 64.74
Joint Distribution of Data • Scatter Diagram Imaginary line showing relation Imaginary line showing relation Negatively related Not related Positively related
Correlation Coefficient • Pearson Product-Moment Correlation Coefficient • Denoted as rxy or r • Measure the correlation between two data sets • Can take value from -1 to 1 • Value of 1: two data sets have absolute positive relation • Value of -1: two data sets have absolute negative relation • Value of 0: two data sets have no linear relation
Correlation Coefficient • Formula • fx: PEARSON (do not use in MS Excel earlier than 2003) • fx: CORREL
Correlation for Ordinal Scale • Spearman Rank Correlation Coefficient • Two variables • Kendall’s Tau Rank Correlation Coefficient • Three or more variable
Linear Regression • Describe relation between two interval-scale variables in the form of regression equation • y = bx + a (Straight line) • y = a + bx + cx2 (Parabola equation) • y = abx(Exponential equation) • x: independent variable • y: dependent variable • a: Y-intercept (where the line crosses Y axis) • b: Slope
Simple Linear Regression • Find b then a • Then write the equation • y = bx + a • E.g. b = 31.4, a = 4.52 • y = 31.4x + 4.52
Example • Table shows the period of time each student spends reading for exam and his/her score • b = {10 (45885) – (1035)(413)} / {10 (123375) – (1035)2 • = 31395 / 162525 = 0.1932 • a = 41.3 – (0.1932) (103.5) = 21.3038 • y = 0.1932x + 21.3038 • Meaning • Spending 1 minute will increase score by 0.1932 mark • If you don’t read at all you should get 21.3038 mark
Multiple Linear Regression • More than one independent variables • Equation Y = a + b1x1 + b2x2 + b3x3… • Requirement • Normal distribution • No multicollinearity (independent variables do not depend on each other) • Selecting independent variables • All Entry – when you are not sure which variable has effect • Stepwise – only use variables tested to be significant
how much of the dependent variable can be explained by the independent variable Simple correlation Is the model good (significant)? (yes, Sig. < 0.05) b1 a b a b2