260 likes | 451 Views
Advanced Statistics for Linguistics Students. Syllabus. Data Screening Data cleaning Data transformation Boxplots Other charts and graphs Analysis of Variance Analysis of covariance Two-way ANOVA MANOVA. Syllabus. Regression Simple linear regression Multiple linear regression
E N D
Syllabus • Data Screening • Data cleaning • Data transformation • Boxplots • Other charts and graphs • Analysis of Variance • Analysis of covariance • Two-way ANOVA • MANOVA
Syllabus • Regression • Simple linear regression • Multiple linear regression • Logistic regression • Data reduction • Exploratory Factor analysis • Structural Equation Modeling • Confirmatory factor analysis • Structural Equation Modeling
Syllabus • Reliability and validity • Reliability • Validity • Qualitative analysis • Item response theory • Classical test theory • Item response theory • Rasch analysis
Syllabus • Assignments and Grading: • Readings • Homework • References: • Relevant papers • Website: http://clal.gdufs.edu.cn/personal/statistics/phd/
Prerequisites • Variables • t-test • One-way ANOVA • Correlation • Excel, SPSS, Statistica
Session 1 Data Screening
Some statistical considerations and precautions • Do the data accurately reflect the responses made by the participants of my study? • Are all the data in place and accounted for, or are some of the data absent or missing? • Is there a pattern to the missing data? • Are there any unusual or extreme responses present in the data set that may distort my understanding of the phenomena under study? • Do these data meet the statistical assumptions that underlie the statistical technique I will be using? • What can I do if some of the statistical assumptions turn out to be violated?
Code and value cleaning • The data cleaning process ensures that once a given data set is in hand, a verification procedure is followed that checks for the appropriateness of numerical codes for the values of each variable under study. • Whether each variable contains only legitimate numerical codes or values • Whether these legitimate codes seem reasonable
Distribution diagnosis • Data screening • Frequency tables • Histograms and bar graphs • Skewness and kurtosis • Close to 0 • Conservative: ±0.5 • Liberal: ±1 • Heuristic (SPSS): 2 x standard error
Distribution diagnosis • Data screening • Stem-and-leaf plots • Weight cases • AnalyzeExplore • Box plots
Distribution diagnosis • Data screening • Scatterplot matrices
Dealing with missing values • Missing value patterns • Random patterns of missing data • Looking for patterns Variables containing missing data on 5% or fewer of the cases can be ignored.
Dealing with missing values • Missing value patterns • Methods of handling missing data • Listwise deletion • Pairwise deletion • Imputation procedures • Mean distribution • Multiple regression imputation • Expectation maximization imputation (Missing value analysis in SPSS) • E step: calculates expected values of parameters • M step: calculates maximum likelihood estimates • Example
Dealing with missing values • Missing value patterns • Methods of handling missing data • Recommendations: • Compare cases with and without missing values on variables of interest using independent sample t test. • Compare your statistical analysis with cases using only complete data. If no difference emerge between ‘complete’ versus ‘imputed’ data sets, then you can have confidence that your missing value interventions reflect statistical reality. • Use listwise case deletion • Use regression imputation procedures • Use SPSS ‘Missing Values Analysis”
Regression Hypothetical data showing the relationship between SAT scores and GPA with a regression line drawn through the data points. The regression line defines a precise, one-to-one relationship between each X value (SAT score) and its corresponding Y value (grade-point average, GPA).
Outliers • Causes of outliers • Data entry errors or improper attribute coding • A function of extraordinary events or unusual circumstances. Use this question to judge: “Does this outlier represent my sample?” • No explanation. Good candidate for deletion. • Pattern of combination of values on several variables, e.g., unusual combined patterns of age, gender, and number of arrests.
Outliers • Detection of univariate outliers • Explore descriptives • If outliers are few (less than 1% or 2% of n) and not very extreme, they are probably best left alone. • Detection of multivariate outliers • Scatterplots • Mahalanobis distance
Multivariate Statistical Assumptions • Normality • Statistical approach • Explore • Graphical approach • Linearity • Variables in the analysis are related to each other in a linear manner (MANOVA, factor analysis) • Scatterplots • Regression analysis (residuals) • Homoscedasticity • Equal variance – equal levels of variability • ANOVA – homogeneity of variance – Levene’s test • MONOVA – Box’s M
Data transformation • Use Excel and SPSS • Square root • Logarithm • Inverse • Square of X • ‘double-edged sword’ • Can significantly improve the precision of a multivariate analysis • Can pose a formidable data interpretation problem
Homework • Scores, Student Satisfaction, and Type of School • This study was conducted to assess if there are differences between scores and student satisfaction between public or private schools. • Use the SPSS data file to answer the following questions: • Identify the independent variable. Identify the dependent variable(s). • Are there any missing values for any of the variables? If there are, what do you recommend doing to address this issue? • Were there any outliers inthis data set? If outliers are present, what is your recommendation? • Check the independent and dependent variables for statistical assumptions violations. If there are violations, what do you recommend? • Write a sample result section, discussing your data screening activity.