300 likes | 321 Views
Intermediate Applied Statistics STAT 460. Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu. Revised schedule. Last lecture. Categorical Data. This lecture . Categorical Data/Response (ch. 18,19,20) Odds.
E N D
Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu
Last lecture • Categorical Data
This lecture • Categorical Data/Response (ch. 18,19,20) • Odds
Review: Categorical Variable • Notation: • Population proportion = = sometimes we use p • Population size = N • Sample proportion = = X/n = # with trait / total # • Sample size = n • The Rule for Sample Proportions • If numerous samples of size n are taken, the frequency curve of the sample proportions ( ‘s) from the various samples will be approximately normal with the mean and standard deviation • ~ N( , (1- )/n )
These tests can be extended to test the difference in parameters π between two groups.
Difference between proportions These tests can be extended to test the difference in parameters π between two groups.
Warning: z-tests for proportions are based on an approximation. They don’t work for small samples. It is often said that n is large enough if Because of improved computing power, an exact test based on the binomial distribution rather than the normal is now available in most software.
Contingency Table • A statistical tool for summarizing and displaying results for categorical variables • A two-way table if for two categorical variables • 2x2 Table, for two categorical variables, each with two categories • Place the counts of each combination of the two variables in the appropriate cells of the table. • Exploratory variable as labels for the rows, response variable as labels for the columns.
Example • A university offers only two degree programs: English and Computer Science. Admission is competitive and there is a suspicion of discrimination against women in the admission process. Here is a two-way table of all applicants by sex and admission status: • These data show an association between the sex of the applicants and their success in obtaining admission.
Marginal & Conditional Distributions • Marginal Distributions: • Exploratory Variable: add up values for the rows; take away response variable • In our example distribution is: 55, 85, 140 • Observed proportions: • ‘admit’ = 55/140 = 0.39 • ‘deny’ = 85/140 = 0.61 • NOTE: they add up to 1 • Response Variable: add up values for the columns; take away exploratory variable • In our example distribution is? • Observed proportions are: • Do they add up to 1?
Marginal & Conditional Distributions • Conditional Distribution: • Conditional percentages; what percent of a particular row or a column a count in a cell is. • Conditional distribution of gender for those admitted: • % of admitted who are male = 35/55 = 0.63 = 63% • % of admitted who are female = ? • What is: • % of male applicants admitted = ? • % of female applicants admitted = ?
Statistical Significance • An observed relationship is statistically significant if the chances of observing the relationship in the sample when there is no actual relationship in the population are small (usually less than 5%) • In other words, a relationship is statistically significant if that relationship is stronger than 95% of the relationships we would expect to see just by chance. • If we say that there was no statistically significant relationship found, that does not mean that there is no relationship at all! • Warnings: • If a sample size is small, strong relationships may not achieve significance • If a sample size is large, even minor relationships could achieve significance but these might not then have practical importance
Chi-Squared Test (2 Test) • A Chi-Squared Test for independence • The Chi-Squared Statistics (2 ) for contingency table. • Follows 2 distribution • Skewed to the right • Min = 0, Max = infinity • As the strength of observed relationship in the sample increase, the statistic increases. • It combines info about a strength of the relationship and the sample size into a one number • Can be calculated for any size contingency table • For 2 x 2 table: if 2 > 3.84 then we have a statistically significant relationship • We either show (2 > 3.84) or fail to show significant relationship (if 2 < 3.8); we either reject (2 > 3.84 ) or fail to reject (2 < 3.84) the claim of independence between two variables that is our null hypothesis. • H0: variables are independent HA: variabls are NOT independent
2 • The chi-squared distribution with k-1 degrees of freedom acts as though it was the sum the squares of k-1 independent Normal(0,1) distributions. (Not that you need to know.) • See table on pages 1100-1101 in textbook.
You Must Know: • How to calculate 2 statistic • Compute the expected numbers • Compare the expected and observed numbers • Compute the 2 statistic • How to compare it to 3.84 for 2x2 tables • How to make proper conclusion about statistical relationship and in general about the question of interest for any two-way and k-way tables.
For our example: • Computing 2statistic: • Expected number = the number of counts (individuals) that we expect to fall in a particular cell = (row total)(column total)/(table total) • Expected number of admitted male students = (55 x 80)/140 = 31.42 • Expected number of admitted female students = ? • Observed number = the number of counts in the cell • Observed number of admitted male students = 35 • Observed number of admitted female students = ? • Compare the observed and expected number : ( observed – expected)2/(expected number) For male students: (35 - 31.42)2/(31.42) = 0.41 For female students: = ? • Compute the statistic = Sum all the above calculated numbers for all the cells • In our case 2= 1.58 • Compare it to 3.84 • Is it statistically significant? Are admission decisions independent of the gender?
Relative Risk, Increased Risk, Odds Ratio • Quantifications of the chances of a particular outcome and how do these chances change • What are the chances that a randomly selected individual would fall into a particular category for a categorical variable. • There are two basic ways to express these chances: • Proportions = expressing one category as a proportion of the total • Proportion of admitted students who are female = 20/55 = 0.36 • Odds = comparing one category to another • Odds of being admitted = 55 to 85 = 55/85 to 1
Expressing Proportions & Odds • There are 4 equivalent ways to express proportions: • Percent = Proportion = Probability = Risk • 36% (percent) of all admitted students are females • The proportion of females admitted is 0.36 • The probability that a female would be admitted is 0.36 • The risk for a female to be admitted is 0.36 • Odds = expressed by reducing the numbers with and without a characteristic we are interested in to the smallest possible whole number: • The odds of being admitted = 55 to 85 = 7 to 11 = 7/11 to 1 • Going back and forth between proportions and odds: • If the proportion has value p then the odds are: /(1- )to 1 • If the odds of having a characteristic are a to b, then the proportion with the characteristic is a/(a+b)
Generalized forms for the expressions: • Percentage with the characteristic = (number with the characteristic/total) x 100% • Proportion with the characteristic = (number with the characteristic/total) • Probability of having the characteristics = (number with the characteristic/total) • Risk of having the characteristic = (number with the characteristic/total) • Odds of having the characteristic = (number with the characteristic/number without characteristics) to 1 • = /(1- )
Types of Risk: Relative risk & Increased Risk • Relative risk = the ratio of the risks for each category of the exploratory variable • Relative risk of being a female based on whether you are rejected or accepted: • Risk for being rejected if you are female = 40/85 = 0.47 • Risk of being accepted if you are female = 20/55 = 0.36 • Relative risk = 0.47/0.36 = 1.31 to 1 • What does this mean? • What does a relative risk of 1 mean? • Increased Risk = usually, the percent increase in risk • Increased risk = (change in risk/original risk) x 100% • Change in risk = 0.47 – 0.36 = 0.11 • Original risk = Baseline risk = 0.36 • Increased risk = 0.11/0.36x 100% = 0.31 = 31% • There is a 23% increase in the chances of females to be rejected • Increased risk = (relative risk – 1.0) x 100% • Increased risk = (1.31 – 1.0) x 100% = 31%
Odds Ratio • First calculate the odds of having a characteristic versus not having it: • Odds for female being admitted = 20/35 =0.571429 • Odds for female being rejected = 40/45= 0.888889 • Then take the ratio of these odds: • Odds ratio = 0.888889/ 0.571429 = 1.5556 • Not too close to 1.31, but sometimes it can be close to relative risk • Odds ratio = (upper left * lower right)/(upper right * lower left) • Sometimes you need to reverse denominator and numerator so that the ratio is greater than 1 (easier to interpret)
Misleading items about Risk/Odds • The baseline risk is missing • The time period of the risk is not identified • The reported risk is not necessarily your risk (relative risk vs. your risk) • Retrospective vs. Prospective study • Prospective: take a random sample and record success and failure in the future • Retrospective: take a random sample and record success and failure that happened in the past • In retrospective study you can meaningfully interpret odds ratio, but not individual odds
Simpson’s Paradox • Lurking variable = A variable that changes the nature of association even reverses direction of relationship between two other variables. • A nature of association changes due to a lurking variable • In our example we didn’t consider type of a program (major) as a variable. What happens if we do, and if construct two separate tables, one for each major?
Example of Simpson’s Paradox • Computer Science admits each 50% of males and females • English takes ¼ of both males and females • Now there doesn’t seem to be an association between sex and admission decision in either program • Hence, type of program was a lurking variable
Commands in SAS • To create contingency tables, calculate chi-square statistic, etc… • Statistics/Table Analysis • To run the logistic regression • Statistics/Regression/Logistic
Next • Lab Monday • Categorical Data, • Logistic Regression -- we will work through the lab together and learn about logistic regression • Project II