560 likes | 566 Views
This lecture provides an introduction to variation and variability in data sets. It explains the concept of range as a measure of variability and introduces the measures of variance and standard deviation. The lecture also discusses the computational formulas for calculating these measures and the concept of degrees of freedom.
E N D
Basic Quantitative Methods in the Social Sciences(AKA Intro Stats) 02-250-01 Lecture 3
Variation • Variability: The extent numbers in a data set are dissimilar (different) from each other • When all elements measured receive the same scores (e.g., everyone in the data set is the same age, in years), there is no variability in the data set • As the scores in a data set become more dissimilar, variability increases
Variation: Range • The range tells us the span over which the data are distributed, and is only a very rough measure of variability • Range: The difference between the maximum and minimum scores • Example: The youngest student in a class is 19 and the oldest is 46. Therefore, the age range of the class is 46 – 19 = 27 years.
Variation X 5 0.00 This is an example of data 5 0.00 with NO variability 5 0.00 5 0.00 5 0.00 = 25 n = 5 = 5
Variation X 6 +1.00 This is an example of data 4 -1.00 with low variability 6 +1.00 5 0.00 4 -1.00 = 25 n = 5 = 5
Variation X 8 +3.00 This is an example of data 1 -4.00 with higher variability 9 +4.00 5 0.00 2 -3.00 = 25 n = 5 = 5
Note: • Let’s say we wanted to figure out the average deviation from the mean. Normally, we would want to sum all deviations from the mean and then divide by n, i.e., • (recall: look at your formula for the mean from last lecture) • BUT: We have a problem. will always add up to zero
Variation • However, if we square each of the deviations from the mean, we obtain a sum that is not equal to zero • This is the basis for the measures of variance and standard deviation, the two most common measures of variability of data
Variation X 8 +3.00 9.00 1 -4.00 16.00 9 +4.00 16.00 5 0.00 0.00 2 -3.00 9.00 = 25 = 0.00 = 50.00 Note: The is called the Sum of Squares
Variance of a Population • VARIANCE OF A POPULATION: the sum of squared deviations from the mean divided by the number of scores (sigma squared):
Population Standard Deviation Square root of the variance
Sample Variance • the sum of squared deviations from the mean divided by the number of degrees of freedom (an estimate of the population variance, n-1)
Sample Standard Deviation • Square root of the variance s2
Why use Standard Deviation and not Variance!??! • Normally, you will only calculate variance in order to calculate standard deviation, as standard deviation is what we typically want • Why? Because standard deviation expresses variability in the same units as the data • Example: Standard deviation of ages in a class is 3.7 years
Variance • The above formulae are definitional - they are the mathematical representation of the concepts of variance and standard deviation • When calculating variance and standard deviation (especially when doing so by hand) the following computational formulae are easiest to use (trust us, they really are easier to use. You should however have a good understanding of the definitional formulae):
Population Variance • Computational Formula:
Population Standard Deviation • Computational Formula:
Sample Variance • Computational Formula:
Sample Standard Deviation • Computational Formula:
Sample Standard Deviation Example Data: X X2 8 64 n = 5, = 5 1 1 9 81 5 25 s2 =175 – (25)2/5 2 4 4 X=25 =175 s2 = 12.50 s = s = 3.54
Computing Standard Deviation • When calculating standard deviation, create a table that looks like this:
Computing Standard Deviation • The values are then entered into the formula as follows: = 150= 222 = 484 n = 4 n-1 = 3
Computing Standard Deviation • The values are then entered into the formula as follows: = 150= 222 = 484 n = 4 n-1 = 3
Computing Standard Deviation • The values are then entered into the formula as follows:
Degrees of Freedom • Degrees of Freedom: The number of independent observations, or, the number of observations that are free to vary • In our data example above, there are 5 numbers that total 25 ( = 25, n = 5)
Degrees of Freedom • Many combinations of numbers can total 25, but only the first 4 can be any value • The 5th number cannot vary if = 25 • This example has 4 degrees of freedom, as four of the five numbers are free to vary • Sample standard deviation usually underestimates population standard deviation. Using n-1 in the denominator corrects for this and gives us a better estimate of the population standard deviation.
Degrees of Freedom • Degrees of freedom are usually n-1 (the total # of data points minus one)
Time for an example • Seven people were asked to rate the taste of McDonalds french fries on a scale of 1 to 10. Their ratings are as follows: 8, 4, 6, 2, 5, 7, 7 Calculate the population standard deviation Calculate the sample variance Class discussion: When would this be a population, and when would it be a sample?
Why is Standard Deviation so Important? • What does the standard deviation really tell us? • Why would a sample’s standard deviation be small? • Why would a sample’s standard deviation be large?
An Example • You’re sitting in the CAW Student Centre with 4 of your friends. A member of the opposite sex walks by, and you and your friends rate this person’s attractiveness on a scale from 1 to 10 (where 1=very unattractive and 10=drop dead gorgeous)
Food for thought • 1) What would it mean if all five of you rated this person a 9 on 10? • 2) What would it mean if all five of you rated this person a 5 on 10? • 3) What would it mean if the five of you produced the following ratings: 1, 10, 2, 9, and 3 (note that the mean rating would be 5)? • Why would scenario #3 happen instead of scenario #2? What factors would lead to these different ratings? • These questions form the basis of why statisticians like to “explain variability”
An In-Depth Look at Scenario #3 • So if the five of you produced the following ratings: 1, 10, 2, 9, and 3, what is the standard deviation of these ratings? • Calculate! • What is the standard deviation in Scenario #2? Calculate!
Normal Distribution • The normal distribution is a theoretical distribution • “Normal” does not mean typical or average, it is a technical term given to this mathematical function • The normal distribution is unimodal and symmetrical, and is often referred to as the Bell Curve
Normal Distribution Mean Median Mode
Normal Distribution • We study the normal distribution because many naturally occurring events yield a distribution that approximates the normal distribution
Properties of Area Under the Normal Distribution • One of the properties of the Normal Distribution is the fixed area under the curve • If we split the distribution in half, 50% of the scores of the sample lie to the left of the mean (or median, or mode), and 50% of the scores lie to the right of the mean (or median, or mode)
Properties of Area Under the Normal Distribution • The mean, median, and mode always cut the Normal Distribution in half, and are equal since the Normal Distribution is unimodal and symmetrical:
Properties of Area Under the Normal Distribution 50% of scores 50% of scores Mean, Median, Mode
Properties of Area Under the Normal Distribution • The entire area under the normal curve can be considered to be a proportion of 1.0000 • Thus, half, or .5000 of the scores lie in the bottom half (i.e., left of the mean) of the distribution, and half, or .5000 of the scores lie in the top half (i.e., right of the mean)
Properties of Area Under the Normal Distribution .5000 of scores .5000 of scores Mean, Median, Mode
Z-scores • Z-Scores (or standard scores) are a way of expressing a raw score’s place in a distribution • Z-score formula:
Z-scores • The mean and standard deviation are always notated in Greek letters • Z-scores only reflect the data points’ position relative to the overall data set (so you’re now considering the data as a population, as you’re not looking to infer to a greater population) • This means use the population formula for standard deviation rather than the sample formula whenever you calculate Z
Z-scores • A z-score is a better indicator of where your score falls in a distribution than a raw score • A student could get a 75/100 on a test (75%) and consider this to be a very high score
Z-scores • If the average of the class marks is 89 and the (population) standard deviation is 5.2, then the z-score for a mark of 75 would be: = 89 = 5.2 z = (75-89)/5.2 z = (-14)/5.2 z = -2.69
Z-scores • This means that a mark of 75% is actually 2.69 standard deviations BELOW the mean • The student would have done poorly on this test, as compared to the rest of the class
Z-scores • z = 0 represents the mean score (which would be 89 in this example) • z < 0 represents a score less than the mean (which would be less than 89) • z > 0 represents a score greater than the mean (which would be greater than 89)
Z-scores • For any set of scores: • the sum of z-scores will equal zero ( = 0.00) • have a mean equal to zero ( = 0.00) • and a standard deviation equal to one ( = 1.00)
Z-scores • A z-score expresses the position of the raw score above or below the mean in standard deviation sized units • E.g., • z = +1.50 means that the raw score is 1 and one-half standard deviations above the mean • z = -2.00 means that the raw score is 2 standard deviations below the mean
Z-score Example • If you write two exams, in Math and English, and get the following scores: • Math 70% (class = 55, = 10) • English 60% (class = 50, = 5) • Which test mark represents the better performance (relative to the class)?
Z-score Example cont. • Math mark: z = (70-55)/10 • z = +1.50 • English mark: z = (60-50)/5 • z = +2.00