260 likes | 413 Views
MBA Statistics 51-651-00 COURSE #3. Is there a link? Quantitative data analysis. Qualitative data analysis. Example:
E N D
MBA Statistics 51-651-00COURSE #3 Is there a link? Quantitative data analysis
Qualitative data analysis Example: The human resources department of a large multinational enterprisecarried out a study on the satisfaction level of the employees with respect to their jobs. A total of 527 employees took part in this study.
Here are the results obtained presented in a table format: JOBS(jobs) SATIS(satisfaction) Frequency |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 17 | 62 | 79 -------------------|-----------|---------| white collar worker| 50 | 112 | 162 -------------------|-----------|---------| blue collar worker | 99 | 187 | 286 -------------------|-----------|---------| Total 166 361 527
Question: Is there a link between the type of employment and the level of satisfaction in this company? • The « type of jobs » variable is a three level qualitative variable i.e. with three categories. • In this example, the « satisfaction » variable is also qualitative and with two levels.
It is easier to answer the question, in a descriptive way, with percentages: Frequency % | SATIS(satisfaction) % line | % column |unsatisfied|satisfied| Total ------------------|-----------|---------| professional | 17 | 62 | 79 | 3.23 | 11.76 | 14.99 | 21.52 | 78.48 | | 10.24 | 17.17 | Type -------------------|-----------|---------| of white collar worker| 50 | 112 | 162 Job | 9.49 | 21.25 | 30.74 | 30.86 | 69.14 | | 30.12 | 31.02 | -------------------|-----------|---------| blue collar worker | 99 | 187 | 286 | 18.79 | 35.48 | 54.27 | 34.62 | 65.38 | | 59.64 | 51.80 | -------------------|-----------|---------| Total 166 361 527 31.50 68.50 100.00
The frequency tables allow: • to summarize and present the information • to describe the presence or the absence of a link between two qualitative variables (nominal and/or ordinal) • to check, by using the hypothesis test, if there is a statistically signifiant link between two qualitative variables.
The two possible hypotheses we want to examine are: H0: There is no link between the two qualitative variables i.e. the two variables are independent H1: There is a link between the two qualitative variables i.e. the two variables are dependent When two variables are independent, their distribution of percentages per category is similar.
To illustrate the concept of independence testing between two qualitative variables, let’s take our previous example and suppose that we have the following numbers to make calculation easier : JOBS(jobs) SATIS(satisfaction) Frequency |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 0 | 100 | 100 -------------------|-----------|---------| white collar worker| 100 | 200 | 300 -------------------|-----------|---------| blue collar worker | 300 | 300 | 600 -------------------|-----------|---------| Total 400 600 1000
The ditribution of percentages is: JOBS(jobs) SATIS(satisfaction) Frequency | % | % line | % column |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 0 | 100 | 100 | 0.00 | 10.00 | 10.00 | 0.00 | 100.00 | | 0.00 | 16.67 | -------------------|-----------|---------| white collar worker| 100 | 200 | 300 | 10.00 | 20.00 | 30.00 | 33.33 | 66.67 | | 25.00 | 33.33 | -------------------|-----------|---------| blue collar worker | 300 | 300 | 600 | 30.00 | 30.00 | 60.00 | 50.00 | 50.00 | | 75.00 | 50.00 | -------------------|-----------|---------| Total 400 600 1000 40.00 60.00 100.00
In the previous table, the two variables are dependent because: • For each type of job, the employees’satisfaction distribution is different. Indeed, 100% of the professionals are satisfied compared to 67% of the white collar workers and only 50% of the blue collar workers (line %); • Or, for each category of satisfaction, the type of job distribution is different. Indeed, among the unsatisfied, 0% are professionals, 25% are white collar workers and 75% are blue collar workers, compared to 17%, 33% and 50% respectively in the satisfied groups (% column ).
In the case where the two variables would be completely independent| in the cells table, we would have the following frequencies (note: the lines and columns totals are unchanged): JOBS(jobs) SATIS(satisfaction) Frequency |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 40 | 60 | 100 -------------------|-----------|---------| white collar worker| 120 | 180 | 300 -------------------|-----------|---------| blue collar worker | 240 | 360 | 600 -------------------|-----------|---------| Total 400 600 1000
The distribution of percentages is: JOBS(jobs) SATIS(satisfaction) Frequency | % | % line | % column |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 40 | 60 | 100 | 4.00 | 6.00 | 10.00 | 40.00 | 60.00 | | 10.00 | 10.00 | -------------------|-----------|---------| white collar worker| 120 | 180 | 300 | 12.00 | 18.00 | 30.00 | 40.00 | 60.00 | | 30.00 | 30.00 | -------------------|-----------|---------| blue collar worker | 240 | 360 | 600 | 24.00 | 36.00 | 60.00 | 40.00 | 60.00 | | 60.00 | 60.00 | -------------------|-----------|---------| Total 400 600 1000 40.00 60.00 100.00
In the previous table, the two variables are independent because: • For each type of job, the employees’ satisfaction distribution is the same i.e. 60% of the employees are satisfied and 40% are unsatisfied (line % ). • Or, for each category of satisfaction, the type of job distribution is the same, i.e. 10% are professionals, 30% are white collar workers and 60% are blue collar workers (column %).
The ij cells of the previous table are composed of « theoretical » frequencies, i.e. the frequencies we should have if the two variables were perfectly independent. • If the hypothesis of independence is true, the theoretical frequencies for each crossed table cell are : ftheoij cell = (total row i) x (total column j) / total
Testing the independence between two qualitative variables is the same as testing the difference between observed frequencies and theoretical frequencies. • If the two variables are independent, the observed frequencies should be close to the theoretical frequencies. • The test statistic is given by: 2obs = sum [(fobs-ftheo)2/ftheo]
We will reject the hypothesis of independence if the value of the 2obs statistic is large. • The calculation of the threshold(p-value) is done usingthe Chi-square probability distribution with the number of degrees of freedom given by : (#lines-1) x (#columns-1) in the Table • Note: This test is only valid for large samples, i.e. when all the theoretical frequencies are 5 (or nearly). • We can demonstrate that 0 2obs n(m-1), where m=minimum (# lines, # columns).
The value of the 2obs statistic is 0 when the two variables are perfectly independent. • It reaches its superior limit when a functional dependence binds one of its variables to the other.
Example: independence JOBS(JOBS) SATIS(satisfaction) Frequency % line |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 40 | 60 | 100 | 40.00 | 60.00 | -------------------|-----------|---------| white collar worker| 120 | 180 | 300 | 40.00 | 60.00 | -------------------|-----------|---------| blue collar worker | 240 | 360 | 600 | 40.00 | 60.00 | -------------------|-----------|---------| Total 400 600 1000 Statistic DF Value Prob --------------------------------------------------- Chi-square 2 0.000 1.000 n = 1000
Example: dependence (functional link) JOBS(jobs) SATIS(satisfaction) Frequency | | % line |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 0 | 100 | 100 | 0.00 | 100.00 | -------------------|-----------|---------| white collar worker| 0 | 300 | 300 | 0.00 | 100.00 | -------------------|-----------|---------| blue collar worker | 600 | 0 | 600 | 100.00 | 0.00 | -------------------|-----------|---------| Total 600 400 1000 Statistic DF value Prob ----------------------------------------------------------- chi-square 2 1000.000 0.000 n = 1000
Example: JOBS (jobs) SATIS(satisfaction) Obs. frequency | Theo.frequency| % | % line | % column |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 17 | 62 | 79 | 24.884 | 54.116 | | 3.23 | 11.76 | 14.99 | 21.52 | 78.48 | | 10.24 | 17.17 | -------------------|-----------|---------| white collar worker| 50 | 112 | 162 | 51.028 | 110.97 | | 9.49 | 21.25 | 30.74 | 30.86 | 69.14 | | 30.12 | 31.02 | -------------------|-----------|---------| blue collar worker | 99 | 187 | 286 | 90.087 | 195.91 | | 18.79 | 35.48 | 54.27 | 34.62 | 65.38 | | 59.64 | 51.80 | -------------------|-----------|---------| Total 166 361 527 31.50 68.50 100.00
Results of the statistical test using CT.xls: Thus, we will not reject the hypothesis of independence at the =5% level, because the « p-value » is > 5%.
What happens to the « p-value » if the size of the sample increases but the distributions are the same? JOBS(jobs) SATIS(satisfaction) Obs. frequency | Theo.frequency | % | % line | % column | | |unsatisfied|satisfied| Total -------------------|-----------|---------| professional | 34 | 124 | 158 | 49.769 | 108.23 | | 3.23 | 11.76 | 14.99 | 21.52 | 78.48 | | 10.24 | 17.17 | ------------------ |-----------|---------| white coller worker| 100 | 224 | 324 | 102.06 | 221.94 | | 9.49 | 21.25 | 30.74 | 30.86 | 69.14 | | 30.12 | 31.02 | -------------------|-----------|---------| blue coller worker | 198 | 374 | 572 | 180.17 | 391.83 | | 18.79 | 35.48 | 54.27 | 34.62 | 65.38 | | 59.64 | 51.80 | -------------------|-----------|---------| Total 332 722 1054 31.50 68.50 100.00
Results of the statistical test: Thus, we will reject the hypothesis of independence at the =5% level because the « p-value » is < 5%!!
2x2 tables: test of the difference between two proportions • In two neighbouring municipalities, we carried out a survey to obtain the opinion of the taxpayers on the location of a garbage dump site. If the proportion of taxpayers in favour is significantly higher in one municipality than in the other then the site will probably develop in that municipality. • In municipality 1, a sample of 130 individuals answered the survey and 84 were in favour (64.6%) while in municipality 2, 124 individuals answered and 62 were in favour (50%).
Equivalent formulations of the problem: • H0 : p1= p2 vs H1 : p1 p2 (two-tailed test) • Is there a link between the municipality variable and the opinion on the location of a garbage dump site? • H0 : independence between municipality and opinion • vs • H1 : dependence between municipality and opinion
Using CT.xls, one obtains: One can reject H0. The 2 proportions are significantly different.