430 likes | 448 Views
CPSY 501: Lecture 3 – Sept. 19. Please open SPSS, and load up the 03-SPSSExam.sav data set: myCourses or http://cpsy501.seanho.com Hermeneutics of Data Preparation & SPSS tools (Reminder : Field text references old SPSS v.13) Creating “ secondary ” (derived) Variables
E N D
CPSY 501: Lecture 3 – Sept. 19 • Please open SPSS, and load up the 03-SPSSExam.sav data set: myCourses or http://cpsy501.seanho.com • Hermeneutics of Data Preparation & SPSS tools (Reminder: Field text references old SPSS v.13) • Creating “secondary” (derived) Variables • Describing a data set & Error checking • Missing Data & Outliers • Assumptions of Parametricity • “Informationally Adequate” Statistics; Project
Research Design & Planning Research Question Data Preparation & Exploration Data Gathering Analysis & Interpretation
Overview of Data Exploration and Preparation • Working with real data, far more time is spent examining and preparing the data than running the test statistics. • Failure to identify and correct problems will prevent the tests from working properly, or lead to misleading results that do not reflect real relationships / effects. • Stages: • Create / fix variables • Deal with missing data & outliers • Assess assumptions of tests
Plan Analysis Exploration Stage Clean / Fix Data Hermeneutics of Data Exploration and Preparation • The relationship between exploring, preparing and analyzing data is reciprocal and hermeneutic. • Each stage of exploration can precipitate changes to the data set, and possible re-evaluation of plans for analysis, resulting in the following circular process:
CPSY 501: Lecture 3 • Hermeneutics of Data Preparation & SPSS tools • Creating “Secondary” (derived) Variables • Describing a data set & Error checking • Missing Data & Outliers • Assumptions of Parametricity • “Informationally Adequate” Statistics
Generating “Secondary” Variables What are some of the reasons for generating secondary, “derived” variables in a data set? • To obtain “total” or “subscale” scores from test items that have been entered into SPSS (e.g., interest tests) • To create grouping variables (e.g., age groups by year of birth) or to alter grouping categories • To correct for problems with the data(e.g., normality transformations, missing data) • In SPSS you can create new variables from current variables through the Compute Variable and Recode into… options in the Transform menu
SPSS: Compute Variable • Transform > Compute Variable • Allows for all kinds of transformations and computations from existing variables into new variables • Can combine information from multiple old variables into a single new variable (but make sure you address missing scores first) • Can compute for specific sub-sets of a data set using the “if” option
Application Examples • Demonstration data set: 03-SPSSExam • Combining marks: “Average assignment mark” • Transform Compute Variable Mean • Mean vs. Sum: • Generally, means are easier to interpret: does not depend on number of assignments
Recode Options • Recoding can be done into the same variable, or into a new column • I generally recommend recode into new: • Transform Recode into different variables • Specify what variable you want to recode • Specify old and new values • We can recode specific parts of a variable using the “if” option
Example: Coding for Ethnicity • Generally, researchers in Canada draw upon categories used by StatsCan and information on generational status • Once data gathering has been completed and demographic description is conducted, then • That sample can be recoded for information on ethnic, cultural, linguistic, and generational status
CPSY 501: Lecture 3 • Hermeneutics of Data Preparation & SPSS tools • Creating “Secondary” (derived) Variables • Describing a data set & Error checking • Missing Data & Outliers • Assumptions of Parametricity • “Informationally Adequate” Statistics
Data Exploration • Purposes of data exploration: • Get overall sense of what is happening in the data-set • Identify problems with the data (for subsequent fixing) • (in final rounds)Report & describe basic characteristics of the data and sample (following APA guides). • Basic explorations can be aided through these several sub-menus of SPSS, including: • Analyze Descriptive Statistics Descriptives, or Frequencies, or P-P plots, etc.Explore has more stats, plots • Graphs Legacy Dialogues Boxplot, or Error Bar, or Histogram, or Scatterplot, etc.
Possible Signs of Data Entry Errors • Obtained scores that are outside of what is possible • Unexplained gaps in the frequency output / histogram, or odd patterns in boxplots • Cases on your boxplots or histograms that do not fit with the rest of the sample • Standard deviations that are much larger than anticipated • Means that are the opposite (or just very different) from what would be expected • For possible errors, compare the SPSS file against your original data source
CPSY 501: Lecture 3 • Hermeneutics of Data Preparation & SPSS tools • Creating “Secondary” (derived) Variables • Describing a data set & Error checking • Missing Data & Outliers • Assumptions of Parametricity • “Informationally Adequate” Statistics
Missing Data • What are some of the reasons for missing data to occur, in real life research? • How to cope with missing data depends on the nature of what is missing in which variables: • Any systematic gaps indicate a real problem • If you have random gaps, you may be okay as long as the gaps constitute less than 5% of the data • To check for missing data: • (a)code it as a separate variable, then • (b)compare the responses on the remaining variables for the “missing” vs. “not missing” groups
Missing Data (cont.) • For random missing, strategies for dealing with missing data include: Go back and obtain the bits that were missing(not always practical or ethical or effective) Eliminate the variables with most of the problems(unless they are central to the study) Drop cases/people with too many missing data(may result in insufficient sample size) Estimate and replace missing values(complicated, and there are problems with simple imputation methods, like mean substitution)
Missing Data (cont.) • For non-randommissing, strategies include: • Adding the missing vs. non-missing variable as a coded variable used in the analysis (may or may not help clarify the import of omissions for the findings) • Return to participants & collect the missing data (no guarantee of better response rates the second time) • Remove the kind of participants that choose not to respond from your sample (reduces generalizability of the results; may end up altering research question) • Start over again, changing the design and data collection procedures to ensure that systematic patterns of missing data do not re-occur
Univariate Outliers • Definition: Patterns that reflect persons from a different population from the rest of the sample • Note that multivariate outliers sometimes also exist • Univariate outliers can usually be identified by examining boxplots • From (a) knowledge of the literature and(b) extremity of an outlying score, decide • (1) whether to retain or exclude the outlying cases • (2) whether to use robust strategies for analysis
Univariate Outliers (cont.) When you have identified a potential outlier: • If you decide that it is a valid case (despite its score), retain it in the data-set & check for impact. • Otherwise, remove that entire case from the data set (other strategies exist, but are complicated and used only when necessary) • Process: • Check for univariate outliers ONCE, then • Check for multivariate outliers ONCE • Impact can be checked by comparing results with and without extreme scores. • Excessiveelimination of outliers may result in a very distorted data-set!
Multivariate Outliers • Scatterplots are useful ways to check for outliers • Graphs Legacy Dialogues Scatter/Dot • Choose “matrix scatter” to examine several quantitative variables at once. • Use the “simple scatter” option to “zoom in” on a variable of interest.
CPSY 501: Lecture 3 • Hermeneutics of Data Preparation & SPSS tools • Creating “Secondary” (derived) Variables • Describing a data set & Error checking • Missing Data & Outliers • Assumptions of Parametricity • “Informationally Adequate” Statistics
Examining Assumptions of the Chosen Statistical Test • Most statistical procedures assume that data have certain characteristics. Always check assumptions before using such a test. • Procedures based on the General Linear Model all assume that data are parametric. • This includes parametric r, t-tests, multiple regression, the ANOVA family, factor analysis, multi-level modeling, etc.! • If the data do not meet the assumptions of a procedure, it is necessary to • (a) use a different procedure, and/or • (b) “clean up” the data-set so that the assumptions will be met.
Assumptions of “Parametricity” • Data must be at interval / ratio level of measurement (often applies to DV / Outcome variables at least) • Independence:Data from one case should not be systematically linked to data from another case • Homogeneity of variance:Variability in scores should be similar across the range of scores for other variables & across different participant grps • Normally distributed:Scores in the population from which the data are obtained should be distributed along a normal curve (e.g., not skewed or kurtotic)
Interval / Ratio Data • Evaluation: determine each variable’s “level of measurement” • Solution: use alternative, non-parametric tests … Pearson rsSpearman’s corr. (rs), Kendall’s Tau (τ);cf. also Biserial / Point Biserial correlations Multiple Regression logistic regression, loglinear an. t–tests / ANOVAs Mann-Whitney U test, Wilcoxon Signed-Rank, Chi-square
Independence of Scores • Evaluation: no simple numeric procedure is sufficient; use logic to examine whether any cases are dependent on any others (same family, etc.) • Solutions: (a)eliminate the offending cases/var’s • (b) Use statistical tests that can handle “dependent” relations in data: • Repeated-measures ANOVA, multi-level (hierarchical) modelling, etc. … • (c) Break data into independent subsamples • E.g., separate groups by gender
Homogeneity of Variance • Evaluation: varies according to specific test. At minimum, check the variance ratio: • 4:1 ratio is one guide to use as a “max” ratio. • Levene’s test is sometimes helpful: • Obtain variance scores for each group: • Analyze Descriptive Statistics Explore • Choose group (factor) and quantitative variables • Plots Spread vs Level Untransformed • If p-value is significant, then assumption may be violated. (non-significance = good) • Use the score based on the mean, when reporting the results of your Levene’s test.
Homogeneity of Variance(cont.) • Variance ratio: • Obtain variance scores for each group: • Analyze Descriptive Statistics Explore Statistics… check Descriptives • Calculate score:highest variance group ÷ lowest variance group • If that score < 4, then you are safe; otherwise this assumption may be violated (depending on N). • Solution: (a) Use a more conservative alpha level (b) Use non-parametric tests instead
Normality • Evaluation: • Kolmogorov-Smirnov & Shapiro-Wilk tests • Descriptive and normality graphs, • Skewness / kurtosis results (use confidence intervals) • Analyze Descriptive Statistics ExplorePlots… check“Normality plots with tests” • If K-S & S-W p -value is significant, then assumption may be violated (non-significance = good “fit”) • Examine graphs & skewness/kurtosis & P-P plots to see what the violation may be. • Solutions: (a) perform appropriate transformation (b) Change specific problematic scores (not usually recommended!)
Normality(cf. Beherns, 1997) • Transforming to reshape non-normal disributions: • e.g., Square-root (sqrt) for minor deviations • Logarithmic (lg10) for medium deviations • Reciprocal (1/score) for severe deviations Note: if tail is to the right, data needs to be reflected first (Transform Compute Variable [maximum score – variable name]) • SPSS:TransformCompute VariableApply an appropriate transformation:e.g., square-root (from “Arithmetic” menu) or 1 / [variable name] …
Practice Time! • Practice assessing (and correcting) for violations of assumptions of parametricity for yourself! • E.g., try this: assume you are studying the “effect” of gender (IV) on number of lectures or ADD symptoms (DV) attended. • Data sets: 03-SPSSExam, AttnDefDis (from HW#1) • Interval/Ratio? Independence? Normality? Homogeneity of variances?
Order of Data Preparation • Examine the initial patterns of data; identify & correct data entry errors; use lots of graphs!!! • Explore & deal with missing data; create test scores, demographics & other secondary variables • Identify and choose how to deal with univariate and multivariate outliers • Evaluate assumptions of your chosen procedure (e.g., parametricity) and deal with any violations of assumptions • Obtain descriptive information for the final data-set, and proceed with the analysis
CPSY 501: Lecture 3 • Hermeneutics of Data Preparation & SPSS tools • Creating “Secondary” (derived) Variables • Describing a data set & Error checking • Missing Data & Outliers • Assumptions of Parametricity • “Informationally Adequate” Statistics
APA Standards for “Informationally Adequate” Statistical Reporting • Sample size (total N and ni for each sub-group) • Mean and SD scores for outcome variables, globally and for each sub-group • Statistical significance (p –values, “exact”) • Measures of effect size • Evidence of sufficient statistical power • Other procedure-specific information (see pp. 23-26 of APA publication manual for some details)
Further Reading -- Articles • Wintre & al. (2000). Generational status & ethnicity in Canada • Tabachnick & al. (2007). On cleaning data • Schafer & al. (2002). Missing data primer
APPENDICES Valuable background for research in Canada Valuable data archiving sources Available on-line
Example: StatsCan downloads • Selected Demographic and Cultural Characteristics (102), Visible Minority Groups (15), Age Groups (6) and Sex (3) for Population, for Canada, Provinces, Territories and Census Metropolitan Areas, 2001 Census - 20% Sample Data • http://www.statcan.ca/bsolc/english/bsolc?catno=97F0010X2001044 • Selected Demographic and Cultural Characteristics (105), Selected Ethnic Groups (100), Age Groups (6), Sex (3) and Single and Multiple Ethnic Origin Responses (3) for Population, for Canada, Provinces, Territories and Census Metropolitan Areas, 2001 Census - 20% Sample Data • http://www.statcan.ca/bsolc/english/bsolc?catno=97F0010X2001040 • Place of Birth of Father (35), Place of Birth of Mother (35) and Generation Status (4) for the Population 15 Years and Over of Canada, Provinces, Territories, Census Metropolitan Areas and Census Agglomerations, 2006 Census - 20% Sample Data • http://www.statcan.ca/bsolc/english/bsolc?catno=97-557-X2006009
Example: StatsCan definitions Source: Ethnic Diversity Survey -Methodology and Data Quality • “questions on the birthplace of respondents and their parents were used to establish the respondent’s generational status. The first generation includes respondents born outside Canada. The second generation includes respondents born in Canada with at least one parent born outside Canada. The third-plus generation includes respondents born in Canada to two Canadian-born parents.”
StatsCan definitions (cont.) • “Responses to the ethnic origin question were divided up to form the two main categories of interest: CBFA+ (Canadian or British or French or Americans or Australians and/or New Zealanders) and Non-CBFA+ (all other responses containing at least one origin other than CBFA+). The non-CBFA+ category was divided into European origins (for example, German, Italian, Dutch, Portuguese) and non-European (for example, Chinese, Jamaican, Lebanese, Iranian).”
StatsCan definitions (cont.) • “CBFA+ • Canadian only – Generation 1 and 2 • Canadian only – Generation 3 and more • Canadian with BFA+ – Generations 1 and 2 • Canadian with BFA+ – Generations 3 and more • BFA+ – Generation 1 and 2 • BFA+ – Generation 3 and more
StatsCan definitions (cont.) • Non-CBFA+ • Other Europeans with Canadian – Generation 1 and 2 • Other Europeans with Canadian – Generation 3 and more • Other Europeans – Generation 1 • Other Europeans – Generation 2 • Other Europeans – Generation 3 and more • Other non-Europeans with Canadian – All generations • Other non-Europeans – Generation 1 • Other non-Europeans – Generation 2 and more
ICPSR • “Established in 1962, ICPSR is the world's largest archive of digital social science data.” • http://www.icpsr.umich.edu/ICPSR/ • E.g., National Institute of Mental Health Collaborative Psychiatric Epidemiology Surveys (CPES)