190 likes | 351 Views
Class 9. Basics of Quantitative Data Analysis. Class Outline. Data Files Codebooks Univariate Analysis Mean Variance Proportions Bivariate Analysis Crosstabulation Group Comparison of Means Correlation Stata commands Graphs. Data.
E N D
Class 9 Basics of Quantitative Data Analysis
Class Outline • Data Files • Codebooks • Univariate Analysis • Mean • Variance • Proportions • Bivariate Analysis • Crosstabulation • Group Comparison of Means • Correlation • Stata commands • Graphs
Data • Social science datasets usually have a rectangular structure. • Columns: variables • Rows: observations • Data storage • in statistical packages: Stata (*.dta), SPSS (*.sav), SAS, R, S-Plus, etc. • in spread sheet (*.cvs, *.xls). example • raw text data. example • Need data dictionary. example
Codebooks • Codebook is a guide for locating variables and interpreting codes in the data file. It often includes the following contents: • Explanation of variable names • Code lists (variable values) • Variable frequencies • Questionnaire • Explanation of sampling and use of weights • Examples • The General Social Survey (web) • U.S. Census 2000 codebook (PDF) • Codebooks generated by Stata
Quantitative Analysis • Univariate analysis involves a single variable. • Bivariate analysis involves two variables simultaneously. • Example: gender difference in admission rate. • Multivariate analysis involves more than two variables simultaneously. • Example: gender difference in admission rate by field of studies.
Statistics • A statistic is a numerical quantity calculated from a sample. • Measures of central tendency • Mean and median, proportion • Measures of variability • Range, interquartile range, variance, and standard deviation • Measures of association • Correlation and Chi-square statistic
Categorical and Continuous Variables • Categorical variables have a limited number of possible values. • Continuous variables potentially have infinite number of categories. • Nominal measures: always treat them as categorical variables. • Ordinal measures: Intrinsically categorical. In practice, we sometimes treat them as continuous variables by assuming equal distance between the adjacent categories. • Interval and ratio measures: always treat them as continuous. • Use different statistics and graphs when working with categorical and continuous variables.
Levels of Measurement and Choice of Statistics: An Example • Income can be measured with different levels of details. • Nominal measure • 0: below poverty • 1: above poverty • Ordinal measure • 1: 0~15k • 2: 16~30k • 3: 31~50k • 4: 51~100k • 5: 101k+ • Ratio measure. “Round your income to the nearest hundred” • To summarize income, we use proportions when it is measured at the nominal or ordinal level and mean and variance when it is measured at the ratio level.
Univariate Statistics:Continuous Variables • Statistics to use with continuous variables • Measures of central tendency • Mean: arithmetic average • Median: 50% percentile • Proportion • Measures of variability • Range: (min, max) • Interquartile range: (25% percentile, 75% percentile) • Variance: • Standard deviation:
Mean vs. Median • Mean and median are different when the distribution of the variable is skewed. For example, depression scores are right skewed. In this case, mean is greater than the median. Income is another example of right skewed distributions. • When the distribution is symmetric, mean and median are the same. Data source: the Wisconsin Longitudinal Study
Stata Commands and Graphs • Use “summarize varname” to find the mean and variance of a continuous variable. • Use a boxplot or histogram to display the distribution of a continuous variable. • Examples
Univariate Statistics:Categorical Variables • Statistics to use with categorical variables: • Proportions • When the categorical variable has only 2 categories, which are coded 0 and 1, we can calculate the mean to find the proportions of cases in category 1.
Stata Commands and Graphs • Use “tabulate varname” to find the frequencies and proportions of a categorical variable. • Use a pie chart or bar chart to display the distribution of a categorical variable. • Examples
Bivariate Analysis • Analysis involving two variables simultaneously. • Example: • Gender • Attitudes toward premarital sex • Choose the appropriate bivariate analysis:
Crosstabulations • It is customary to put the independent variable as the row variable and the dependent variable as the column variable. If the table is set up like this, calculate the row percentages, not the column percentages. Key: frequency row percentage column percentage RESPONDENT | IS PREMARITAL SEX WRONG? S SEX | ALWAYS WR ALMOST AL WRONG ONL NOT WRONG | Total -----------+--------------------------------------------+---------- MALE | 374 153 329 723 | 1,579 | 23.69 9.69 20.84 45.79 | 100.00 | 35.18 35.75 45.82 48.23 | 42.58 -----------+--------------------------------------------+---------- FEMALE | 689 275 389 776 | 2,129 | 32.36 12.92 18.27 36.45 | 100.00 | 64.82 64.25 54.18 51.77 | 57.42 -----------+--------------------------------------------+---------- Total | 1,063 428 718 1,499 | 3,708 | 28.67 11.54 19.36 40.43 | 100.00 | 100.00 100.00 100.00 100.00 | 100.00
Group Comparison of Means • Example: sex difference in years of schooling • Sometimes it is useful to “collapse” (i.e., combine) categories.
Group Comparison of Means • Bar Charts • graph bar (mean) educ, over(income) • graph bar (mean) educ, over(newincome)
Scatter Plots and Correlations . graph matrix popgrowth lexp gnppc . correlate popgrowth lexp gnppc (obs=63) | popgro~h lexp gnppc -------------+--------------------------- popgrowth | 1.0000 lexp | -0.4215 1.0000 gnppc | -0.3580 0.7182 1.0000