Survey of Statistical Methods

Survey of Statistical Methods April 25, 2005

Sample Problem • Suppose a bill that proposes to lower the legal drinking age to 18 is pending before the state legislature. A political scientist is interested in determining whether there is an association between political affiliation and attitude toward the bill. He sends out a survey and receives answers from 400 Republicans and Democrats.

Data Collection • A questionnaire was sent to 1,200 individuals. A total of 500 questionnaires were returned (42%), however due to incomplete data, the final sample size for this study was N=400. • Political affiliation • Response to question #4 (Do you consider yourself to be Republican or Democrat?) • Attitude toward the bill • Response to question #15 (How would your characterize your attitude toward lowering the drinking age to 18? For, Against, or Undecided?)

Chi Square Test of Independence • Purpose • To determine if two variables of interest independent (not related) or are related (dependent)? • When the variables are independent, we are saying that knowledge of one gives us no information about the other variable. When they are dependent, we are saying that knowledge of one variable is helpful in predicting the value of the other variable. • The chi-square test of independence is a test of the influence or impact that a subject’s value on one variable has on the same subject’s value for a second variable. • Some examples where one might use the chi-squared test of independence are: • Is level of education related to level of income? • Is the level of price related to the level of quality in production? • Is one party affiliation related to the person's preferred television network? • Hypotheses • The null hypothesis is that the two variables are independent. This will be true if the observed counts in the sample are similar to the expected counts. • H0: X and Y are independent • H1: X and Y are dependent

Displaying Independent andDependent Relationships When group membership makes a difference, the dependent relationship is indicated by one group having a higher proportion than the proportion for the total sample. When the variables are independent, the proportion in both groups is close to the same size as the proportion for the total sample. From: http://www.utexas.edu/courses/schwab/sw318_spring_2004/SolvingProblems/Class24_ChiSquareTestOfIndependencePostHoc.ppt

Chi Square Test of Independence • Wording of Research questions • Are X and Y independent? • Are X and Y related? • The research hypothesis states that the two variables are dependent or related. This will be true if the observed counts for the categories of the variables in the sample are different from the expected counts. • Level of Measurement • Both X and Y are categorical

AssumptionsChi Square Test of Independence • Each subject contributes data to only one cell • Finite values • Observations must be grouped in categories. No assumption is made about level of data. Nominal, ordinal, or interval data may be used with chi-square tests. • A sufficiently large sample size • In general N > 20. • No one accepted cutoff – the general rules are • No cells with observed frequency = 0 • No cells with the expected frequency < 5 • Applying chi-square to small samples exposes the researcher to an unacceptable rate of Type II errors. Note: chi-square must be calculated on actual count data, not substituting percentages, which would have the effect of pretending the sample size is 100.

Raw Datafor 1st 20 subjects(N=400) • Political 1 = Republican 2 = Democrat • Votingon a bill that proposes to lower the legal drinking age to 18 1 = For 2 = Against 3 = Undecided

Setup for Analysis • Research Question • Is there an association between political affiliation and attitude toward the bill? • Statistical Hypotheses • H0: Political affiliation and attitude toward the bill are independent • H1: Political affiliation and attitude toward the bill are not independent • Level of Significance • α = .05

How to Compute the Chi Square Test of Independence using SPSS • Analyze – Descriptive Statistics – Crosstabs • Do not go to • Analyze – Nonparametric – Chi Square • This is a different type of Chi Square Test

Set up in SPSS for aChi Square Test of Independence Analyze – Descriptive Statistics -- Crosstabs

Refer to handout of the SPSS output forinformation about how to interpret the results to reach the following conclusions • There is a significant association between political affiliation and attitude toward the bill [2(2) = 6.0: p=.05]. • More democrats are FOR the bill. • More republicans are AGAINST the bill.

Figure 1a. Histogram showing the association between political affiliation and attitude toward the bill. A Chi-Square Test of Independence revealed a significant relationship between these variables [X2(2)=6.0: p=.05]. Democrats were found to have a more favorable attitude toward the bill than Republicans.

Figure 1b. Histogram showing the association between political affiliation and attitude toward the bill. A Chi-Square Test of Independence revealed a significant relationship between these variables [X2(2)=6.0: p=.05]. Republicans were found to have a more unfavorable attitude toward the bill than Democrats.

How to determine the Critical Region for the Test Statistic by hand • Utilizes the Chi Square Distribution • df = (r-1)*(c-1) = (2-1)*(3-1) = 1*2 = 2

df = 2 Critical X2 = = 5.991

Post-Hoc analysis for a Chi Square Test of IndependenceWhich Cell or Cells Caused the Difference • You can conduct a post-hoc procedure only if the result of the chi-square test was statistically significant. • Examination of percentages in the contingency table and expected frequency table can be misleading. • The residual, or the difference, between the observed frequency and the expected frequency is a more reliable indicator. • Notice the values labeled “standardized residual” that is computed for each cell. This value is a z-score. • Compare the value for each standardized residual against the critical z-value for your α level. • This is equivalent to testing the null hypothesis that the actual frequency equals the expected frequency for a specific cell versus the research hypothesis of a difference greater than zero. • There can be 0, 1, 2, or more cells with statistically significant standardized residuals to be interpreted.

Interpreting Standardized Residuals • Standardized residuals that have a positive value mean that the cell was over-represented in the actual sample, compared to the expected frequency, i.e. there were more subjects in this category than we expected. • Standardized residuals that have a negative value mean that the cell was under-represented in the actual sample, compared to the expected frequency, i.e. there were fewer subjects in this category than we expected.

Post Hoc Strategyfor the Chi Square Test of Independence • If there is at least one cell with a significant standardized residual • Formulate your conclusion based on a comparison of all of the cells containing significant standardized residuals. • If none of the cells have a significant standardized residual • Interpret the findings based on a comparison of the ‘sign (+ or -)’ of the largest values for the standardized residuals. • Apply caution when this is the case!

What is a Categorical Variable? • A categorical variable represents a set of discrete events, such as groups, decisions, or anything else that can be classified into categories. In contrast to a continuous variable, a value of a categorical variable indicates a discrete category, whereas a value of a continuous variable can fall on any point on a numeric continuum. One example of a categorical variable is a person's sex, which can be represented by two exhaustive and mutually exclusive categories: male and female. A categorical variable may also consist of more than two categories. For example, a person's major in college can be categorized as biology, history, engineering, psychology, etc. • A categorical variable can be ordered or unordered. For instance, a person's level of schooling is an ordered variable; a person's sex is an unordered variable. Although the levels of a categorical variable are often represented by numerals, these symbols are not interpreted numerically if the variable is unordered. • Categorical data are often presented in a contingency table which tabulates the number of observations that fall into each cell of the table. The table above is a simple 2 x 2 contingency table that crosstabulates whether graduate school applicants were accepted or rejected and whether they were male or female. Each cell represents a joint event which is a unique combination of the categorical variables. The crosstabulation of the graduate school's decision and applicant's gender results in four possible outcomes, or joint events. For example, in Table 1, the joint event representing rejected males contains 166 observations. Marginal events refer to the total number of observations for a category of a particular variable. Here, the marginal event for the category female is 245. From: http://www.utexas.edu/cc/docs/stat57.html#variable

Survey of Statistical Methods