620 likes | 1.02k Views
Data Analysis and Presentation. By Temtim Assefa solomonmolla9@gmail.com. Data Type. Quantitative data is classified as categorical and numerical data
E N D
Data Analysis and Presentation By TemtimAssefa solomonmolla9@gmail.com
Data Type • Quantitative data is classified as categorical and numerical data • Categorical data refer to data whose values cannot be measured numerically but can be either classified into sets (categories) such as sex (male and female), religion, department • Numerical data, which are sometimes termed ‘quantifiable’, are those whose values are measured or counted numerically as quantities • These are analyzed by different techniques
Quantitative Data Analysis • Two common types analysis • Descriptive statistics • to describe, summarize, or explain a given set of data • Inferential statistics • use statistics computed from a sample to infer about the population • It is concerned by making inferences from the samples about the populations from which they have been drawn
Common data analysis technique • Frequency distribution • Measures of central tendency • Measures of dispersion • Correlation • Regression • And more
Frequency distribution • It is simply a table in which the data are grouped into classes and the number of cases which fall in each class are recorded. • Shows the frequency of occurrence of different values of a single Phenomenon. • Main purpose • To facilitate the analysis of data. • To estimate frequencies of the unknown population distribution from the distribution of sample data and • To facilitate the computation of various statistical measures
Example – Frequency Distribution • In a survey of 30 organizations, the number of computers registered in each organizations is given in the following table • This data has meaning unless it is summarized in some form
Example The following table shows frequency distribution Number of computers
Example … • The above table can tell us meaningful information such as • How many computers most organizations has? • How many organizations do not have computers? • How many organizations have more than five computers? • Why the computer distribution is not the same in all organizations? • And other questions
Continuous frequency distribution • Continuous frequency distribution constructed when the values do not have discrete values like number of computers • Example is age, salary variables have continuous values
Constructing frequency table • The number of classes should preferably be between 5 and 20. However there is no rigidity about it. • As far as possible one should avoid values of class intervals as 3,7,11,26….etc. preferably one should have class intervals of either five or multiples of 5 like 10,20,25,100 etc. • The starting point i.e the lower limit of the first class, should either be zero or 5 or multiple of 5. • To ensure continuity and to get correct class interval we should adopt “exclusive” method. • Wherever possible, it is desirable to use class interval of equal sizes.
Constructing … You can create a frequency table with two variables This is called Bivariate frequency table
Graphs • You can plot your frequency distribution using bar graph, pie chart, frequency polygon and other type of charts • Computer Import in Ethiopia in 2010
Measures of central tendency • Mode shows values that occurs most frequently • is the only measure of central tendency that can be interpreted sensibly • Median is used to identify the mid point of the data
Central Tendency …. • Mean is a measure of central tendency • includes all data values in its calculation Mean = sum of observation (sum)/ Total no. of observation (frequency ) • The mean for grouped data is obtained from the following formula: • where x = the mid-point of individual class • f = the frequency of individual class • N = the sum of the frequencies or total frequencies.
Advantages of Mean • It should be rigidly defined. • It should be easy to understand and compute. • It should be based on all items in the data. • Its definition shall be in the form of a mathematical • formula. • It should be capable of further algebraic treatment. • It should have sampling stability. • It should be capable of being used in further statistical computations or processing • However affected by extreme data values in skewed distributions • For Skewed distribution, use median than mean
Exercise • Do the following exercise for the following IT staff data for 13 organizations named as O1 to O13 • 25, 18, 20, 10, 8, 30, 42, 20, 53, 25, 10, 20, 42 • What is the mode? • What is the median? • What is the mean? • Change into frequency table? • Plot on bar graph? Pie chart? • What you interpret from the data?
Measures of Dispersion • The measure of central tendency serve to locate the center of the distribution, • Do not measure how the items are spread out on either side of the center. • This characteristic of a frequency distribution is commonly referred to as dispersion. • Small dispersion indicates high uniformity of the items, • Large dispersion indicates less uniformity. • Less variation or uniformity is a desirable characteristic
Type of measure of dispersion • There are two types • Absolute measure of dispersion • Relative measure of dispersion. • Absolute measure of dispersion indicates the amount of variation in a set of values in terms of units of observations. For example, if computers measured by numbers, it shows dispersion by number • Relative measures of dispersion are free from the units of measurements of the observations. You may measure dispersion by percentage • Range is an absolute measure while coefficient of variation is the relative measure
Dispersion … • There are different type of dispersion measures • We look at Standard Deviation and Coefficient of variation • Karl Pearson introduced the concept of standard deviation in 1893 • Standard deviation is most frequently used one • The reason is that it is the square–root of the mean of the squared deviation • Square of standard deviation is called Variance
Standard Deviation • It is given by the formula • Calculate the standard deviation from the following data. • 14, 22, 9, 15, 20, 17, 12, 11 • The Answer is 4.18 or
Interpretation • We expect about two-thirds of the scores in a sample to lie within one standard deviation of the mean. • Generally, most of the scores in a normal distribution cluster fairly close to the mean, • There are fewer and fewer scores as you move away from the mean in either direction. • In a normal distribution, 68.26% of the scores fall within one standard deviation of the mean, • 95.44% fall within two standard deviations, and • 99.73% fall within three standard deviations.
Advantage of SD • Assume the mean is 10.0, and standard deviation is 3.36. • one standard deviation above the mean is 13.36 and one standard deviation below the mean is 6.64. • The standard deviation takes account of all of the scores and provides a sensitive measure of dispersion. • it also has the advantage that it describes the spread of scores in a normal distribution with great precision. • The most obvious disadvantage of the standard deviation is that it is much harder to work out than the other measures of dispersion like rank and percentiles
Coefficient of Variation • The Standard deviation is an absolute measure of dispersion. • However, It may not always applicable • The standard deviation of number of computers cannot be compared with the standard deviation of computer use osstudents, as both are expressed in different units, • standard deviation must be converted into a relative measure of dispersion for the purpose of comparison -- coefficient of variation • The is obtained by dividing the standard deviation by the mean and multiply it by 100 coefficient of variation = X 100
Skewness • skewness means ‘ lack of symmetry’ . • We study skewness to have an idea about the shape of the curve which we can draw with the help of the given data. • If in a distribution mean = median =mode, then that distribution is known as symmetrical distribution. • The spread of the frequencies is the same on both sides of the center point of the curve.
Symmetrical distribution • Mean = Median = Mode
Positively skewed distribution Negatively skewed distribution
Measures of Skewness • Karl – Pearason’ s coefficient of skewness • Bowley’ s coefficient of skewness • Measure of skewness based on moments We see Karl- Pearson, read others from the textbook • Karl – Pearson is the absolute measure of skewness = mean – mode. • Not suitable for different unit of measures • Use relative measure of skewness -- Karl – Pearson’ s coefficient of skewness, i.e (Mean –Mode)/standard deviation In case of ill defined mode, we use 3(Mean –median)/standard deviation
Kurtosis • All the frequency curves expose different degrees of flatness or peakedness – called kurtosis • Measure of kurtosis tell us the extent to which a distribution is more peaked or more flat topped than the normal curve, which is symmetrical and bell-shaped, is designated as Mesokurtic. • If a curve is relatively more narrow and peaked at the top, it is designated as Leptokurtic. • If the frequency curve is more flat than normal curve, it is designated as platykurtic.
Interpretation • Real word things are usually have a normal distribution pattern – Bell shape
Normal dist… • This implies that • 68% of the population is in side 1 • 95% of the population is inside 2 • 99% of the population is 3 • So you need to select a confidence limit to say your sample is statistically significant or not • For example, if more than 5% of the population falls outside 2 standard deviation, the difference between two groups of population is not statistically significant
Correlation • Correlation is used to measure the linear association between two variables • For example, assume X is IT skill and Y is IT use. Is there association b/n these two variables
Correlation … • Correlation expresses the inter-dependence of two sets of variables upon each other. • One variable may be called as independent variable (IV) and the other is dependent variable (DV) • A change in the IV has an influence in changing the value of dependent variable • For example IT use will increase organization productivity because have better information access and improve their skills and knowledge
Correlation Lines Perfect Correlation No Correlation
Type of Correlation • Simple • Multiple correlation • Partial correlation • In simple correlation, we study only two variables. • For example, number of computers and organization efficiency • In multiple correlation we study more than two variables simultaneously. • For example, usefulness and easy of use and IT adoption • In Partial and total correlation, it refers to the study of two variables excluding some other variable
Karl pearson’ s coefficient of correlation • Karl pearson, a great biometrician and statistician, suggested a mathematical method for measuring the magnitude of linear relationship between the two variables • Karl pearson’ s coefficient of correlation is the most widely used method of correlation where X = x - x , Y = y - y
Exercise Calculate the correlation for the following given data
Spear Man Rank Correlation • Developed by Edward Spearman in 1904 • It is studied when no assumption about the parameters of the population is made. • This method is based on ranks • It is useful to study the qualitative measure of attributes like honesty, colour, beauty, intelligence, character, morality etc. • The individuals in the group can be arranged in order and there on, obtaining for each individual a number showing his/her rank in the group
Formula • Where D2 = sum of squares of differences between the pairs of ranks. • n = number of pairs of observations. • The value of r lies between –1 and +1. If r = +1, there is complete agreement in order of ranks and the direction of ranks is also same. If r = -1, then there is complete disagreement in order of ranks and they are in opposite directions.
Advantage of Correlation • It is a simplest and attractive method of finding the nature of correlation between the two variables. • It is a non-mathematical method of studying correlation. It is easy to understand. • It is not affected by extreme items. • It is the first step in finding out the relation between the two variables. • We can have a rough idea at a glance whether it is a positive correlation or negative correlation. • But we cannot get the exact degree or correlation between the two variables
The Pearson Chi-square • it is the most common coefficient of association, which is calculated to assess the significance of the relationship between categorical variables. • It is used to test the null hypothesis that observations are independent of each other. • It is computed as the difference between observed frequencies shown in the cells of cross-tabulation and expected frequencies that would be obtained if variables were truly independent.
Chi-square … Where O is observed value E is expected value X2 is the association Where is X2 value and its significance level depend on the total number of observations and the number of cells in the table
Regression • Regression is used to estimate (predict) the value of one variable given the value of another. • The variable predicted on the basis of other variables is called the “dependent” or the ‘ explained’ variable and the other the ‘ independent’ or the ‘ predicting’ variable. • The prediction is based on average relationship derived statistically by regression analysis. • For example, if we know that advertising and sales are correlated we may find out expected amount of sales f or a given advertising expenditure or the required amount of expenditure for attaining a given amount of sales.
Regression • Regression is the measure of the average relationship between two or more variables in terms of the original units of the data. • Type of regression • Simple and Multiple • Linear and Non –Linear • Total and Partial
Simple and Multiple: • In case of simple relationship only two variables are considered, for example, the influence of advertising expenditure on sales turnover. • In the case of multiple relationship, more than two variables are involved. On this while one variable is a dependent variable the remaining variables are independent ones. • For example, the turnover (y) may depend on advertising expenditure (x) and the income of the people (z). • Then the functional relationship can be expressed as y = f (x,z).
Linear and Non-linear • The linear relationships are based on straight-line trend, the equation of which has no-power higher than one. But, remember a linear relationship can be both simple and multiple. • Normally a linear relationship is taken into account because besides its simplicity, it has a better predictive value, a linear trend can be easily projected into the future. • In the case of non-linear relationship curved trend lines are derived. The equations of these are parabolic.
Total and Partial • In the case of total relationships all the important variables are considered. • Normally, they take the form of a multiple relationships because most economic and business phenomena are affected by multiplicity of cases. • In the case of partial relationship one or more variables are considered, but not all, thus excluding the influence of those not found relevant for a given purpose.
Regression analysis • The goal of regression analysis is to develop a regression equation from which we can predict one score on the basis of one or more other scores. • For example, it can be used to predict a job applicant's potential job performance on the basis of test scores and other factors
Linear regression equation • Linear regression equation of Y on X is Y = a + bX ……. (1) • And X on Y is X = a + bY……. (2) Where a, b are constants. • In a regression equation, y is the dependent variable or criterion variable, or outcome variable we would like to predict. • X represents the variable we are using to predict y; x is called the predictor variable.