250 likes | 267 Views
This review covers topics such as sampling, measurement, measures of central tendency and dispersion, the normal distribution, and correlation in statistics.
E N D
2017 Statistics ReviewJohn Glenn College of Public Affairs Aditi Vaishali Thapar thapar.9@osu.edu
Outline • Sampling • Sampling Terms: An Example • Measurement • Descriptive Statistics: • Measures of Central Tendency • Measures of Dispersion • The Normal Distribution • Inferential Statistics: • Correlation vs. Causation • Hypothesis testing • P-values • Standard Error • Confidence Intervals and Z-Scores
Sampling Population vs. Sample • Population • The entire group of people or things about which we want information • Sample • Unlikely that we will be able to collect data for the entire population • Representative portion of population about which data is collected.
Sampling Statistics vs. Parameters • Parameters • Summarise data for an entire population • Statistics • Summarise data for a sample • Unit of Analysis: Entity that is being analyzed in a study • Variable: A characteristic of the unit of analysis Image source: https://www.cliffsnotes.com/study-guides/statistics/sampling/populations-samples-parameters-and-statistics
Sampling Terms: Example What is the demographic information for students who attend statistics boot camp? • Population: • Sample: • Unit of Analysis: • Variables: • Parameter: • Statistics:
Sampling Terms: Example What is the demographic information for students who attend statistics boot camp? • Population: • All students who attend statistics boot camp • Sample: • 20 randomly selected students at statistics boot camp • Unit of Analysis: • The individual (i.e. student) • Variables: • Age, gender, income, race, etc. • Parameter: • Average age of all students at statistics boot camp, etc. • Statistics: • Average age of the randomly 20 selected students at boot camp, etc.
Measurement • Nominal • Numerical values just "name" the attribute uniquely • No ordering of the cases is implied • Example: Numbers on football/basketball jerseys • Ordinal • Attributes can be rank-ordered, numerically • Distances between attributes do not have any meaning. • Example: Coding educational Attainment as 0 = less than high school1 = high school degree2 = college degree3 = Masters, PhD, etc.
Measurement • Interval • The distance between attributes does have meaning • Example: When measuring temperature, the distance between 30F and 40F is the same as that between 70F and 80F. • Ratio • There is always an absolute zero that is meaningful. • i.e. you can construct a meaningful fraction/ratio Source: http://www.socialresearchmethods.net/kb/measlevl.php
Measures of Central Tendency Central tendencies tell us where most of the data lie • Mean: also known as the average • Add up all the values for your variable, then divide by the total number of values • Median: The middle score for a set of data that has been arranged in order of magnitude. • Mode: The most frequent value in the dataset
Which Measure Should We Use? It depends on, both, the type of variable and the distribution of the data • Mode: Typically used when we have categorical data (i.e. gender, race, educational attainment etc.) • Mean: When we want the average value of a variable, UNLESS our data is skewed. • Median: When we have skewed data and/or outliers Question: What measure of central tendency would you use to calculate the average salary for a group of 10 people where 9 people earn $1 and 1 person earns $100?
Measures of Dispersion Dispersion studies the spread of the data • Range • | Maximum – Minimum | • Variance • How far each of the observations in the sample dataset lie away from the mean • Standard Deviation • Square root of the variance • A low standard deviation tells us that data points tend to be close to the mean
Measures of Dispersion • Question: Given the data below on test scores what is the sample size (N), mean, median, mode, range, standard deviation and variance?
Measures of Dispersion Answer: • Start by ordering the data in order of magnitude: • 0, 3, 4, 6, 6, 6, 7, 8, 9, 10 • Sample size: 10 • Mean: • Median: 6 • Mode: 6 • Range: 10 – 0 = 10 • Variance: 8.76, calculated using • Standard deviation:
The Normal Distribution • The normal distribution is a symmetric, bell-shaped distribution that is completely described by the mean and the standard deviation • The mean describes the centre of the curve • The standard deviation determines the shape
Central Limit Theorem As the sample size of a random variable grows larger, the sampling distribution of mean approaches a normal distribution • What does this theorem tell us? • A sample with more observations gives us a truer picture of the actual population • Making assumptions based on samples that are “too small” may make for a biased analysis
Correlation • Correlation: Asingle number that describes the degree of relationship between two variables. • The value of correlation ranges from -1 to 1 • If the correlation coefficient is positive, this means that the two variables move together • Example: Education and salary (as level of education increases, as does salary) • If the correlation coefficient is negative, this means that the two variables have an inverse relationship • Example: Education and unemployment rate (as the level of education increases, the unemployment rate decreases) • If the correlation coefficient is zero, the two variables do not have a relationship • Example: The weather and salary
Causation • Causation is a much stronger relationship than just correlation Image source: https://www.dreamstime.com/royalty-free-stock-images-causation-correlation-difference-explained-image37881989;https://xkcd.com/925/
Hypothesis Testing Hypothesis testing is used to compare our observed statistic to other statistics/parameter. • But what does that really mean? • You’re testing whether your results are valid by calculating the odds that your results are a product of chance. • The null hypothesis (H0) is the hypothesis that we are trying to disprove. Usually, the null hypothesis is a statement of no effect or no difference • The alternative hypothesis (H1) describes the relationship as we expect it to be • Tests can be either one-tailed or two-tailed
Hypothesis Testing Two-tailed test example: A researcher claims that individuals aged 17 have an average body temperature higher than the commonly accepted average of 98.6F. H0: Individuals aged 17 have an average body temperature that is not greater than 98.6 F average temp <= 98.6F H1: Individuals aged 17 have an average body temperature that is greater than 98.6 F average temp > 98.6F
Hypothesis Testing One-tailed test example: A researcher claims that consuming a drug she developed increases student performance on exams. The average student test score is 87. H0: The drug will have no effect on average student test scores (i.e. they stay constant) average test score = 87 H1: The drug will increase average student test scores (i.e. they stay constant) average test score > 87
P-values P-value is the probability of finding an observed result, assuming that the null hypothesis is true. • There are multiple critical values (1%, 5% and 10%) that we use to test the validity of our claims • The most frequently used critical value is 5% (0.05) • If the p-value obtained is higher than the 0.05 threshold, we say that our finding is not statistically significant • Therefore, we cannot reject our null hypothesis. • If the p-value obtained is lower than the 0.05 threshold, we say that our finding isstatistically significant • Therefore, we can reject our null hypothesis, and accept the alternate hypothesis.
Standard Error Standard error is how far the sample mean is likely to be from the population mean. • How does this differ from the standard deviation? • Standard deviation is the degree to which individuals within thesample differ from the sample mean. • Calculated using: • Example: if we only sample 5 universities to examine the impact of ownership on the test score, what is the likelihood that the true average test score is equivalent to that in our sample?
Confidence Intervals and Z-Scores • A Z-score score is a numerical measurement of a value's distance from the mean. • If a Z-score is 0, it represents the score is identical to the mean score. • Calculated using: • At the 95% level, we use 1.96 • A confidence interval is a range of values between which we are certain that the true mean lies. • Calculated using: mean +/- (standard error * Z-score)
Finding the Confidence Interval Question: You want to investigate the impact of college degree on income. Therefore, you sample 20 persons that have college degree (Group A) and 20 persons that do not have (Group B). You get the following statistics. What is the 95% confidence intervals of each group? How can we interpret the results?