150 likes | 392 Views
Outliers. Outliers are data points that are not like many of the other points, or values. Here we learn about some tools to detect them. age. x = 22. 24 26 28. 16 18 20.
E N D
Outliers Outliers are data points that are not like many of the other points, or values. Here we learn about some tools to detect them.
age x = 22 24 26 28 16 18 20 I want to use an example here to introduce some ideas. Say a sample of data has been taken and the age was one of the variables. Here I have a number line with age as the variable. Let’s say the mean, or average, age in the sample is 22. Let’s also say the sample standard deviation calculated out to be 2.
Would you like a cookie? Parker, what a silly question! I ask it because a cookie means 1 cookie. If you like the cookie, maybe you can have more. Well, in stats we say a standard deviation. This means 1 standard deviation. The standard deviation is calculated from the data. In the example on age I said the standard deviation was 2. In another data set a standard deviation might be 4. The standard deviation potentially changes from data set to data set. In our example, again I say the standard deviation was 2. Now, for another silly idea.
digress Say you have a board 60 inches long but you only want it to be 48 inches long. How many inches do you cut off? 60-48 = 12. How many feet do you cut off? 12 inches/12inches per foot = 1 foot. We do something close to this. The problem is that a standard deviation is not exactly like a foot. A foot is always 12 inches. Standard deviation can change from problem to problem.
Z-score Imagine a little guy with a thick accent comes up to you at the ball game and says, “what is z-score?” You might say, “3 to 2, we’re up.” This has nothing to do with stats, but was fun to type. Think about our age example where the mean was 22 and the standard deviation was 2. A z-score tells us how far a data point is from the mean, but in standard deviation units (in feet, not inches). Example: 25 is 3 from the mean, but since the standard deviation is 2, 25 is only 1.5 standard deviations from the mean.
z scores 19 is also 1.5 standard deviations from the mean. Note 19 is on the low side and 25 is on the high side of the mean. In general, to get a z-score for a data point do this: 1) Take the point minus the mean 2) Divide the result by the standard deviation. In notation form z = (xi – x)/s
age x = 22 24 26 28 16 18 20 Age 25 is (25-22)/2 = 1.5 standard deviations above the mean Age 19 is (19 – 22)/2 = -1.5 standard deviations below the mean Z’s that calculate out with a negative sign signifies the value is less than the mean, positive is above the mean.
More notes about the z-score: • When a data point equals the sample mean the z = 0. • Z is a measure of relative location – standard deviations from the mean. • Examples of two data sets • Data set a – mean = 22, standard deviation = 2. • Data set b – mean = 41, standard deviation = 3. • The value 24 in data set a and the value 44 in data set b are similar in that both have z = 1, meaning they are both 1 standard deviation above their mean.
Chebyshev’s Theorem – a rule for any data set. At least 1 – (1/k2) of the data values must be within k standard deviations of the mean, where k is any value greater than 1. Calculation example: k = 2, we have 1 – (1/4) = ¾ or .75 or 75% k = 3, we have 1 – (1/9) = 8/9 = .89 or 89% Note in the statement of the theorem the word within is used. This means that we can be on either side of the mean. Example: At least what percent of scores are within 2.4 standard deviations of the mean, when the mean is 70 and the standard deviation is 5? 1 – (1/2.42) = .826 or 82.6%
Empirical rule based on the normal distribution Chevy Chase’s theorem (OK Chebyshev’s) was general. When data follow a normal distribution, a more specific rule applies. Thought experiment: Imagine you put your eyes just above the top of a table. I then drop sugar in a small stream, with a steady hand, onto the table. The sugar will start to pile up like this picture below. Why does the sugar pile up like this? I have no idea! But it seems normal!
I have drawn here a histogram and I have put on top of it a bell shaped curve. In the case where you can put a bell shaped curve on top of a histogram to approximate the distribution of the variable, then the variable is called normal.
68 - 95 - 99.7 rule This rule allows us to say approximately 68% of the people in the data set have a value on the variable within 1 standard deviation of the mean. Approximately 95% have a value within 2 standard deviations of the mean, and approximately 99.7% have a value within 3 standard deviations of the mean. Let’s look at this idea again in the context of an example. Say we asked a whole bunch of people how many ounces of Mt. Dew they consume each year. Say the responses follow a normal distribution with mean = 5480 and standard deviation = 480.
The rule again 4040 4520 5000 5480 5960 6440 6920 -----68% --- Ounces of Mt. Dew -------------- 95%------------- per year -----------------------------99.7%-----------------------
The rule So, by the rule we know that about 68% of the people in the data set have between 5000 and 5960 ounces of Mt. Dew (ozs of MD)per year. By the rule we know that about 95% of the people in the data set have between 4520 and 6440 ounces of Mt. Dew per year.
Outlier – A data point that has a z less than –3 or greater than +3 is likely to be an outlier. This means the data point is really not like the other points. Maybe the point should not be included in the statistical analysis. Chebyshev’s Theorem and the empirical rule tell us about where most of the data should be. In this sense these rules can also assist us in thinking about data points that really do not belong. Say I want to take 72 minus 43, but I type in my calculator 72 minus 34. I would get 38, but I really wanted to get 29. The 38 is off by 9 from what I wanted. Accounting folks know if you are off by 9 you should check to see if you made a transposition error. In stats, the first thing we should do with an outlier is check the records and make sure a data entry error was not made. If it was an entry error – fix it! Otherwise, maybe you want to disregard that data point.