380 likes | 455 Views
Introduction. Last lecture deals with statistical inferences concerning one population. This lecture deals with statistical inferences concerning two subpopulations. Is gender associated with lung cancer (bi-category response variable)?
E N D
Introduction • Last lecture deals with statistical inferences concerning one population. • This lecture deals with statistical inferences concerning two subpopulations. • Is gender associated with lung cancer (bi-category response variable)? • Is gender associated with dementia (multi-category response variable) ?
Test No Association: dichotomous response variable • Consider a dichotomous response variable, Y, which is coded as 0 for its first category and 1 for its second category, and a dichotomous independent variable,X, which is coded as 0 for its first category and 1 for its second category. Our interest is to test whether or not X is associated with Y. • One example is to test whether gender is associated with lung cancer.
Test No Association: dichotomous response variable • Let denote the population proportion of the second category of the bicategory response variable Y in the second subpopulation(X=1), i.e., • Let denote the population proportion of the second category of the bicategory response variable Y in the second subpopulation(X=0), i.e., • Let denote the difference between and , i.e.,
Test No Association: dichotomous response variable • Our goal is to test whether or not the two subpopulation proportions are equal, i.e., • The idea of developing the test for testing the above hypothesis is to estimate and see how far the estimate is from 0. Being far from 0 is the evidence against H0.
MLE of • Let be the observed Y values in the random sample from the subpopulation (X=1) • Let be the observed Y values in the random sample from the subpopulation (X=0) • The above individual data can be grouped into four groups according to x value and y value. Two by two table is used to represent the grouped data x y 1 0 1 n11 n10 n1 0 n01 n00 n0 m1 m0 n
MLE of • The Maximum Likelihood Estimate of is i.e. is the sample proportion of Y=1 in the subpopulation (X=1) • The Maximum Likelihood Estimate of is i.e. is the sample proportion of Y=0 in the subpopulation (X=0)
MLE of • The Maximum Likelihood Estimate of is • Large or small value of is the evidence against H0. How large(small) is large(small)? • Fact: When n1 and n0 are sufficiently large and H0 is true, the statistics is standard normal, where
Asymptotic z Test for testing • Decision rule: Reject H0 if • P-value:the probability of observing the test statistics as extreme as or more extreme than the observed , which is against H0,
Asymptotic Confidence Interval for • The idea of the confidence interval for a normal mean can be applied to construct an asymptotic confidence interval for • The asymptotic 100(1- )% confidence interval for is
Asymptotic Confidence Intervalfor data riskdiff; input treat y count; datalines; 1 1 189 1 0 10845 0 1 104 0 0 10933 ; run; procfreqdata=riskdiff order=data; weight count; tables treat*y/riskdiff; run;
Asymptotic Confidence Intervalfor • Order=data controls how the 2 ×2 table looks like. Specifically, if in the dataset, the first observation of treat and y is treat= i , y= j , then the first row of the table is for treat=i, and the first column of the table is for y=j,(i=0,1,j=0,1) • The RISKDIFF option requests column 1 and 2 risks (or binomial proportions), risk differences, and their confidence limits for 2 ×2 tables.
Asymptotic Confidence Intervalfor Column 1 Risk Estimates (Asymptotic) 95% (Exact) 95% Risk ASE Confidence Limits Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Row 1 0.0171 0.0012 0.0147 0.0195 0.0148 0.0197 Row 2 0.0094 0.0009 0.0076 0.0112 0.0077 0.0114 Total 0.0133 0.0008 0.0118 0.0148 0.0118 0.0149 Difference 0.0077 0.0015 0.0047 0.0107 Difference is (Row 1 - Row 2) Column 2 Risk Estimates (Asymptotic) 95% (Exact) 95% Risk ASE Confidence Limits Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Row 1 0.9829 0.0012 0.9805 0.9853 0.9803 0.9852 Row 2 0.9906 0.0009 0.9888 0.9924 0.9886 0.9923 Total 0.9867 0.0008 0.9852 0.9882 0.9851 0.9882 Difference -0.0077 0.0015 -0.0107 -0.0047 Difference is (Row 1 - Row 2) Sample Size = 22071
Fisher’s Exact Test • When sample size n1 and n0 are small, the asymptotic P-value of the above asymptotic z test is not valid. The Fisher’s exact test should be used. • The idea of the Fisher’s exact test is to use the (1,1) cell count , n11, as the test statistics and large or small value of n11 is the evidence against H0. How large(small) is large (small) is governed by the distribution of n11 under H0.
Fisher’s Exact Test • The following table is an example of (1,1) cell count being large: 41 1 4 which leads to =4/5 and =1/5 and H0 is unlikely to be true • The following table is an example of (1,1) cell count being small: 14 4 1 which leads to =1/5 and =4/5and H0 is unlikely to be true
Fisher’s Exact Test • Under H0 , the conditional distribution of n11 given m1 and n1 fixed is so-called Hypergeometric distribution, which has the following probability mass function: where is the number of ways of s choose r. • This distribution is denoted by Hypergeometric (n,n1,m1)
Fisher’s Exact Test • Hypergeometric distribution originated from the experiment of randomly drawing m1 balls without replacement from a box that contains n1 white balls and n0 black balls. The probability of obtaining k white balls and m1-k black balls is
Fisher’s Exact Test • Fisher’s exact test for testing • Hypergeometric(n,n1,m1) has mean and standard error: • The exact P-value: depending on whether is on the right tail of the binomial distribution,i.e, or is on the right tail of the binomial distribution,i.e,
Fisher’s Exact Test SAS codes for conducting Fisher’s Exact Test data fisher; input treat y count ; datalines; 1 1 10 1 0 2 0 1 2 0 0 4 ; run; procfreqdata=fisher order=data; weight count; tables treat*y; exactfisher; run;
Fisher’s Exact Test Statistics for Table of treat by y Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 4.5000 0.0339 Likelihood Ratio Chi-Square 1 4.4629 0.0346 Continuity Adj. Chi-Square 1 2.5313 0.1116 Mantel-Haenszel Chi-Square 1 4.2500 0.0393 Phi Coefficient 0.5000 Contingency Coefficient 0.4472 Cramer's V 0.5000 WARNING: 75% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 10 Left-sided Pr <= F 0.9961 Right-sided Pr >= F 0.0573 Table Probability (P) 0.0533 Two-sided Pr <= P 0.1070 Sample Size = 18
Fisher’s Exact Test Whether using Order=data or not using it does not affect the P value of Fisher’s Exact data fisher; input treat y count ; datalines; 1 1 10 1 0 2 0 1 2 0 0 4 ; run; procfreqdata=fisher; weight count; tables treat*y; exactfisher; run;
Fisher’s Exact Test Statistics for Table of treat by y Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 4.5000 0.0339 Likelihood Ratio Chi-Square 1 4.4629 0.0346 Continuity Adj. Chi-Square 1 2.5313 0.1116 Mantel-Haenszel Chi-Square 1 4.2500 0.0393 Phi Coefficient 0.5000 Contingency Coefficient 0.4472 Cramer's V 0.5000 WARNING: 75% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 4 Left-sided Pr <= F 0.9961 Right-sided Pr >= F 0.0573 Table Probability (P) 0.0533 Two-sided Pr <= P 0.1070 Sample Size = 18
Test No Association:Multi-category Response Variable • Consider a categorical response variable Y that has r categories with the jth category coded as j-1, i.e., the first category is coded as 0,the second as 1, and so on, and a dichotomous independent variable,X, which is coded as 0 for its first category and 1 for its second category. Our interest is to test whether or not X is associated with Y. • One example is to test whether or not gender is associated with the type of dementia, which has 4 categories: dementia free, AD, Vascular dementia and other.
Test No Association:Multi-category Response Variable • Let be the population proportion of the jth category of Y in the subpopulation (X=1), i.e., • Let be the population proportion of the jth category of Y in the subpopulation (X=1), i.e.,
Test No Association:Multi-category Response Variable • Let be the difference between and ,i.e. • Our main interest is to test
Data representation • Let be the observed Y values in the random sample from the subpopulation (X=1) • Let be the observed Y values in the random sample from the subpopulation (X=0) • The above individual data can be grouped 2*r groups according to x value and y value,which is usually represented by a 2 by r table: x y r-1 … 1 0 1 n1r-1 … n11 n10 n1 0 n0r-1 … n01 n00 n0 mr-1 … m1 m0 n
MLE of • MLE of using the data is • If H0 is true, both and can be used to obtain the MLE of , which is
MLE of • MLE of using the data is • If H0 is true, both and can be used to obtain the MLE of , which is
Pearson Chi-Squared Test • Pearson Chi-Squared Test can be used to test • The idea of this test is to consider the “distance” between the two sets of MLEs: The MLEs without using H0: The MLEs using H0: Large distance is the evidence against H0.
Pearson Chi-Squared Test • Pearson Chi-Squared Test Statistics: • Under H0, has a Chi-Squared distribution with degrees of freedom r-1
Pearson Chi-Squared Test • Decision rule: Reject H0 if the observed ,where is the percentile of the Chi-Squared distribution with degrees of freedom r-1 • P-Value:the probability of observing the test statistics as extreme as or more extreme than the observed , which is against H0
Pearson Chi-Squared Test • SAS codes for conducting Pearson Chi-squared Test data multi; input treat y count; datalines; 1 2 21 1 1 7 1 0 13 0 2 7 0 1 7 0 0 29 ; run; procfreqdata=multi order=data; weight count; table treat*y/chisq; run;
Pearson Chi-Squared Test The FREQ Procedure Table of treat by y treat y Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚ 2‚ 1‚ 0‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 28 ‚ 7 ‚ 13 ‚ 48 ‚ 33.33 ‚ 8.33 ‚ 15.48 ‚ 57.14 ‚ 58.33 ‚ 14.58 ‚ 27.08 ‚ ‚ 100.00 ‚ 50.00 ‚ 30.95 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 ‚ 0 ‚ 7 ‚ 29 ‚ 36 ‚ 0.00 ‚ 8.33 ‚ 34.52 ‚ 42.86 ‚ 0.00 ‚ 19.44 ‚ 80.56 ‚ ‚ 0.00 ‚ 50.00 ‚ 69.05 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 28 14 42 84 33.33 16.67 50.00 100.00 Statistics for Table of treat by y Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 2 33.0556 <.0001 Likelihood Ratio Chi-Square 2 43.3480 <.0001 Mantel-Haenszel Chi-Square 1 31.5424 <.0001 Phi Coefficient 0.6273 Contingency Coefficient 0.5314 Cramer's V 0.6273 Sample Size = 84
HW Assignments • Problem 1 (2.11 a and c on page 47) Refer to Table 2.1 (given on slide 34) a. Construct a 90% confidence interval for the difference of proportions, and interpret. c. Conduct a test of statistical independence. Interpret. • Problem 2 (2.15a on page 48) Table 2.13 (given on the slide 35) was taken from the 1991 General Social Survey a. Test the hypothesis of independence between party identification and race. Interpret. • Problem 3 (2.25 on page 50) Table 2.16 (given on the slide 36) contains results of a study comparing radiation therapy with surgery in treating cancer of the larynx. Use Fisher’s exact test to see whether radiation therapy is different from surgery in treating cancer of the larynx. Interpret results.
Table 2.1 Cross Classification of Belief in Afterlife by Gender