610 likes | 729 Views
Data Analysis . The easy part. Data Cleaning. The first thing to do is “ data cleaning, ” Ensure that those cases in your variable with missing answers will be excluded from analysis by SPSS Code variables correctly: Follow response options ’ logic—higher values reflect greater magnitude
E N D
Data Analysis The easy part.
Data Cleaning The first thing to do is “data cleaning,” • Ensure that those cases in your variable with missing answers will be excluded from analysis by SPSS • Code variables correctly: • Follow response options’ logic—higher values reflect greater magnitude • Be consistent with hypotheses’ statements about the variables • Level of Measurement must be a match with statistical test that will be used Remember levels of measurement? They guide choices of statistical procedures. Coding schemes and levels of measurement are integrally related.
Data Cleaning Coding for Level of Measurement • Ordinal variable Determine whether the variable should be used as “nominal” or “interval-ratio.” Follow the logic for coding at that level. • Dichotomous variable Determine whether the variable should be used as “nominal” or “interval-ratio.” • If nominal, follow the logic for coding at that level • If interval-ratio code as: • ‘1’ represents presence of the concept • ‘0’represents absence of the concept.
Data Cleaning • Nominal variable • Code values are meaningless so you can use any numbers when coding • Allow at maximum only 5 categories (to aid analysis and interpretation), • If more than 5 categories exist, recode by grouping categories into larger sets. • Interval-ratio variable • Order and magnitude of code numbers match order and magnitude of response options • Preserve the variation in the variable—do not reduce number of options and do not dichotomize. • The exception to this rule is: • The hypothesis requires the reduction in variation • The dependent variable is nominal; therefore, you must combine responses of the independent variable into 3 to 5 categories so that the variable may be treated as a nominal variable for use in crosstabs.
Recoding When recoding, create a new variable to keep original variable on hand in case of a screw up. SPSS Commands for Recode • CLICK <Transform> • CLICK <Recode into Different Variables…> • CLICK ON <Reset> NOW! • [Highlight target variable on left and CLICK ON arrow to put it into “Input Variable -> Output Variable:” box] • [Write in a new variable name in the “Name:” box and click <change>] May choose to write in a “Label” but not necessary • CLICK ON <Old and New Values…> • [code by code, enter the old code on the left, and the corresponding new code on the right, click <Add>, UNTIL ALL ORIGINAL CODES ARE CONVERTED TO A NEW ONE, INCLUDING “MISSING CODES”] • CLICK ON <Continue> • CLICK ON <OK> In “variable view,” new variable appears in the last row at the end of list of variables. In “data view” numbers appear in the last column on the right. If using the student version software, may need to delete unused variables to make room for new ones.
Recoding Example:Dichotomous • In GSS, CAPPUN is coded as follows: 0 = NAP 1 = Favor 2 = Oppose 8 = DK 9 = NA • However, “Favor” should equal ‘1’ and ‘Oppose’ should equal ‘0’ • NAP, DK, and NA are useless, so “0, 8, 9” should be treated as missing. SPSS Commands to Recode CAPPUN: • <Transform> • <Recode Into Different Variables…> • RESET • [Highlight Cappun and click on arrow to put it into “Input Variable -> Output Variable:” box] • [Write in a new variable name, “newcappun” and click <change>] • <Old and New Values…> • [0 left, Missing right, click <Add>; 8 left, Missing right, click <Add>; 9 left, Missing right, click <Add>; 1 left, 1 right, click <Add>; 2 left, 0 right, click <Add>] • <Continue> • <OK>
Recoding Example:Nominal • In GSS, RELIG is coded as follows: 0 = NAP 6 = Buddhism 11 = Christian 1 = Protestant 7 = Hinduism 12 = Native American 2 = Catholic 8 = Other Eastern 13 = Inter-Nondenominational 3 = Jewish 9 = Moslem/Islam 98 = DK 4 = None 10 = Orthodox-Christian 99 = NA • Let’s say you just want 3 groups: Christian, Other, and No Religion • The numbers assigned do not matter, so just use: All Christian = 1, Other Religions = 2, No Religion = 3 • NAP, DK, and NA are useless, “0, 98, and 99” should be treated as missing. SPSS Commands to Recode RELIG: • <Transform> • <Recode Into Different Variables…> • RESET • [Highlight relig and click on arrow to put it into “Input Variable -> Output Variable:” box] • [Write in a new variable name, “newrelig” and click <change>] • <Old and New Values…> • [0 left, Missing right, click <Add>; 98 left, Missing right, click <Add>; 99 left, Missing right, click <Add>; 1 through 2 left, 1 right, click <Add>; 10 through 11 left, 1 right, click <Add>; 3 left, 2 right, click <Add>; 5 through 9 left, 2 right, click <Add>; 12 through 13 left, 2 right, click <Add>; 4 left, 3 right, click <Add>] • <Continue> • <OK>
Recoding Example:Interval-ratio • In GSS, PARTYID is coded as follows: 0 = Strong Democrat 5 = Not Very Strong Republican 1 = Not Very Strong Democrat 6 + Strong Republican 2 = Independent, Close to Democrat 7 = Other Party, Refused to Say 3 = Independent (Neither, No Response) 8 = DK 4 = Independent, Close to Republican 9 = NA • To use as a scale of party identification, remove items that do not indicate Democratic or Republican affiliation. “Independent” and “Other Party” are like that. Recode to: • 1 = Strong Democrat, 2 = Not Very Strong Democrat, 3 = Independent, Close to Democrat, 4 = Independent, Close to Republican, 5 = Not Very Strong Republican, and 6 = Strong Republican. • “Independent,”“Other Party,” DK, and NA are useless, so “3, 7, 8, and 9” will be converted to missing in the new variable. SPSS Commands to Recode PARTYID: • <Transform> • <Recode Into Different Variables…> • RESET • [Highlight partyid and click on arrow to put it into “Input Variable -> Output Variable:” box] • [Write in a new variable name, “newpartyid’ and click <change>] • <Old and New Values…> • [3 left, Missing right, click <Add>, 7 through 9 left, Missing right, click <Add>; 0 left, 1 right, click <Add>, 1 left, 2 right, click <Add>, 2 left, 3 right, click <Add>, 4 through 6 left, Copy right, click <Add>] • <Continue> • <OK>
Descriptive Statistics • In research articles, one presents descriptive statistics for the variables used to test the research hypotheses. • Descriptive Statistics produced are determined by level of measurement: • Ordinal variable that is used as • Nominal, follow the rules for nominal variable • Interval-ratio, follow the rules for interval-ratio variable • Dichotomous variable, follow the rules for nominal variable (even if it is being considered an interval-ratio variable) • Nominal variable, generate a frequency table that shows the percent of respondents in each category • Interval-ratio variable, generate a table for the mean and standard deviation of the variable • Never produce a frequency table for an interval-ratio variable Never report a mean and standard deviation for a nominal variable.
Descriptive Statistics SPSS Commands to Generate Descriptive Statistics • Frequency Table(s) • CLICK ON <Analyze> • CLICK ON <Descriptive Statistics> • CLICK ON <Frequencies> • [Place your variable(s) in the “Variable(s):” box by highlighting it (them) and then clicking on the right-pointing arrow.] • CLICK ON <OK> • Mean(s) and Standard Deviation(s) • CLICK ON <Analyze> • CLICK ON <Descriptive Statistics> • CLICK ON <Descriptives…> • [Place your variable(s) in the “Variable(s):” box by highlighting it (them) and then clicking on the right-pointing arrow.] • CLICK ON <OK>
Descriptive StatisticsFrequency Table Output Table 1. Frequency Table for Marital Status of Respondents* *The table title is in the format that is required for the final paper.
Descriptive StatisticsMean and SD Output Table 2. Mean and Standard Deviation for Education. Mean, S.D. Note how ugly and confusing the SPSS Output tables are. When you are a professional, NEVER use SPSS output in a report. ALWAYS make professional tables!
Descriptive Statistics, Professional Table Elements of a good table: Title stating which statistics for what variables, Variables in left column, Headers in top row stating which statistics are below, statistical value is linked to variables in crosshairs, decimal aligned columns, attention to order and avoiding clutter.
Tables for Paper • You may put SPSS output into your results section to substitute for professional tables. • Those with professional tables will get extra special grading on the papers. • The simplest way to reproduce the SPSS tables in your papers is to “copy and paste” them. They migrate from SPSS to Word quite well. • The tables should be inserted into the body of the paper and include a table number and title. They should NOT bleed across pages.
Inferential Statistics • The next step: test hypotheses. • Technically, to test a hypothesis, all one needs is an empirical observation. • However, to apply empirical observations to the populations they study, quantitative social scientists typically need INFERENTIAL STATISTICS. That is because they actually use empirical observations of samples that were taken from populations. Inferential statistics allow us to infer something about a population using a sample’s characteristics
Sampling Sampling Sample GSS Population Persons in US Households Inferences
Sampling To be MOST representative of a target population, data should be collected from a sample that was generated in a RANDOM way. (think coin flip, name on paper in a bag, roulette wheel, etc.) Even though random sampling is the BEST way to get respondents, sampling error occurs naturally in the process of Random Sampling. Researchers aren’t guaranteed representative precision. In randomness, patterns can emerge.
SamplingPatterns in Random Dots Example of a high density of independent selections in a random sample. Notice how many occur throughout.
Sampling Error • Researchers typically take and use one sample, but there are many, many other possible selection patterns that could be generated by random selection of the same number of people from populations. • Therefore, there are many, many possible random samples, each with its own pattern of differences between people. • Different people would lead to different recorded measurements of any variable from sample to sample. • Variation from sample to sample would lead to variations in statistics for any variable from sample to sample. For example: The difference between men and women on average money $pent on $hoes each year would vary from one possible sample to the next and on and on. 1000 1500 1400 800 170 200 1100 500 700
Sampling Error Let’s create a sampling distribution of the difference $pent on $hoes… Take a sample of 1,500 men and women from the US. Record the variable, $pent on $hoes. Calculate the difference between men and women and record it. Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes.
Sampling ErrorRepeated Sampling Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes.
Sampling ErrorRepeated Sampling Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes.
Sampling ErrorRepeated Sampling Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes.
Sampling ErrorRepeated Sampling Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes.
Sampling ErrorRepeated Sampling Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes.
Sampling ErrorRepeated Sampling The samples’ differences of means would stack up in the shape of a normal curve. A normal sampling distribution. Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes.
Statistical “Noise” The difference in averages for two groups on any variable sampled like this will have its own sampling distribution. Another sampling distribution of the difference in means. Actual difference, but what is it? = a possible sample’s difference between men and women on $pent on $hoes using a different sample size.
Statistical “Noise” These distributions are known because statisticians have charted the characteristics of statistics when using random samples. Note: Not all of them are cute, bell-shaped distributions. The sampling distribution reveals the extent of sampling error inherent in variables’ statistics • The statistic as measured in each sample would jump around from sample to sample—This is the “noise“ of sampling error.
Statistical Noise • If the sample’s relationship statistic is not among the central, most likely possibilities, it is assumed that the sample did not come from such a population. • Our sample is real. Our # is just a ?. Therefore, in the example here, we conclude that our # is not it. Researchers use their knowledge of statistical noise to determine whether there is a relationship between two variables in the population. Here’s the logic: Knowing that two variables’ relationship statistic has a “noise” pattern, one can tell whether a particular relationship value is possible by comparing the sample’s statistic against the noise. What if the sample’s statistic were this large? Could a population with # produce this sample? #? Is #? the relationship statistic? The “noise” contains the most likely possibilities for sample values if # is it.
Hypothesis Testing • Statistical “noise” is the basis of inferential statistics used for hypothesis testing. • Research hypotheses (like your paper’s 3 hypotheses) imply a statistical pattern. • They are judged against a “null hypothesis” stating “no pattern exists.” • A population with “no pattern” (null = 0) will have sampling error “noise.” What if my sample’s statistic were this large? null = 0 Null value for statistic about relationship: no relationship = 0.
Hypothesis Testing • Finally, the “noise patterns” for statistics give probability readouts. • The further a sample statistic is from the null, the less likely the sample is consistent with the “noise pattern” that a population of null characteristics would produce. Likelihood Less Greater Less What if my sample’s statistic were this large? null = 0 Null value for statistic about relationship: no relationship = 0.
Hypothesis Testing • Inferential statistics are typically reported as responses to tests of null hypotheses—you rarely see the nulls reported in articles. • A null hypothesis typically states that a statistic (such as the relationship between two variables) is zero in the population. • The significance test reports the likelihood that the statistic in the sample could have come from a population where the statistic is zero. • Researchers typically reject the null hypothesis when the significance test shows that there is a low likelihood (less than a 5% chance) that the sample statistic could have come from a population where the statistic is zero (null value). • The likelihood is determined by location on the noise pattern.
Hypothesis Testing • Location on the noise pattern determines likelihood of a population with null value producing that statistic. Likelihood Under 2.5 on either end is less than 5% in bidirectional magnitude. 2.5% 100% 2.5% What if my sample’s statistic were this large? null = 0 Null value for statistic about relationship: no relationship = 0.
The way scientists/statisticians interpret the weather….05 is the magic point!Less than 5% is extremely small, low.Greater than 5% is extremely large, high. Sig. < .05 or Sig. < 5% chance that it’s going to rain… NO UMBRELLA. It’s nice and sunny out. Sig. > .05 or Sig. > 5% chance that it’s going to rain… GRAB YOUR UMBRELLA, it’s awful out. .041 = Sunny!!! .010 = Sunny!!! .000 = Sunny!!! .061 = Rain .240 = Rain .735 = Rain In statistics: Sample’s relationship statistic is beyond the noise, indicating a relationship exists in population. In statistics: Sample’s relationship statistic is within the noise, indicating no relationship in population.
Data Analysis • Inferential statistics typically test a null hypothesis that there is no relationship between variables in the population. • We reject the null if our statistics show evidence that our sample’s data are unlikely to have come from a population where the null (no relationship) is true. • Rejecting the null leads us to believe that the variables are related in the population.
Significance Tests Recap: • Since numbers will vary naturally from sample to sample, we use inferential statistics to tell us how much our statistics describing the relationships between variables (association) should “jump around by chance.” • We use a term, “significance,” to refer to whether our statistics could represent real relationships in the population or whether they reflect natural “jumping around.” • Are they beyond the noise?
Significance Tests In project 2, you will use an inferential statistics technique to test the NULL version of your four hypotheses: • Inferential statistics null: that there is no relationship between variables in the population • The technique will 1) give a statistic that describes the relationship between variables in the sample, and 2) information on whether the null should be REJECTED or not. • The p-value (probability, or in SPSS, “Sig.”) for the statistics, to find out the likelihood that the relationship statistics could have been produced (or not) by a population where the null is true. • If your p-value is less than .05, you can reject the null because you have little chance that your sample came from a population where there is no relationship. • If your p-value is greater than .05 (such as .26), you fail to reject the null. There is a high chance that your sample came from a population where the null is true, where there is no relationship between variables.
Significance Tests The statistical test you choose depends on your variables’ levels of measurement: Dependent Variable Independent Variable Statistical Test Interval-Ratio Dichotomous Independent Samples t-test Dichotomous Interval-Ratio Nominal ANOVA Dichotomous Interval-Ratio Interval-Ratio Correlation Dichotomous Nominal Nominal* Cross Tabs Dichotomous *An interval-ratio variable may be used as an independent variable in crosstabs if response options are meaningfully merged to form five or fewer categories.
Ind. Samples T-Test • Used when an independent variable is dichotomous and the dependent variable is interval-ratio or dichotomous. • Its tests whether two groups likely have the same or different means (averages) for the dependent variable in the population. For example: sex education Research hypothesis: Men will have higher average education than women Null hypothesis: Male mean = Female mean in Population Alternative hypothesis: Male mean ≠ Female mean in Population • If “sig.” is less than .05, reject the null hypothesis in favor of the alternative. • If male mean in sample is higher, research hypothesis is supported. • If male mean in sample is lower, research hypothesis is NOT supported. • If “sig.” is greater than .05, fail to reject the null hypothesis. Research hypothesis is not supported.
Ind. Samples T-Test:Steps in SPSS Enter SPSS and open data file (make sure variables are coded properly) • Use commands (click on menu items): • Analyze • Compare Means • Independent Samples T-Test • Highlight dependent variable and click arrow to place into “Test Variable(s):” box. • Highlight independent variable and click arrow to place into “Grouping Variable:” box. • Click on “Define Groups…” • Enter the code number for each of the two groups (e.g., “1” for men in the box next to “Group 1” and “2” for women in the box next to “Group 2”). Click “Continue.” • Click “OK”
Ind. Samples T-Test Output Mean education for men and mean education for women. Probability that this sample could have come from a population where the null (that men and women have equal education) is correct. P > .05, that’s high! Between the goal posts. Do not reject the null. No support for research hypothesis. Men and women may have the same average education.
Significance Tests The statistical test you choose depends on your variables’ levels of measurement: Dependent Variable Independent Variable Statistical Test Interval-Ratio Dichotomous Independent Samples t-test Dichotomous Interval-Ratio Nominal ANOVA Dichotomous Interval-Ratio Interval-Ratio Correlation Dichotomous Nominal Nominal* Cross Tabs Dichotomous *An interval-ratio variable may be used as an independent variable in crosstabs if response options are meaningfully merged to form five or fewer categories.
ANOVA • Used when an independent variable is nominal or dichotomous and the dependent variable is interval-ratio or dichotomous. • ANOVA tests whether 3 or more groups have the same or different means (averages) for the dependent variable in the population. For example: Views of the Bible Lacks Confidence in Science Research hypothesis: Those believing the Bible is only the word of God will lack confidence in science more than others. Null hypothesis: Word mean = Inspired mean = Fables, in the Population Alternative hypothesis: One of the means is different in the Population • If “sig.” is less than .05, reject the null hypothesis in favor of the alternative. • If Word mean in sample is higher, research hypothesis is supported. • If Word mean in sample is lower, research hypothesis is NOT supported. • If “sig.” is greater than .05, fail to reject the null hypothesis. Research hypothesis is not supported.