E N D
Elementary Statistics
NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to be a quick reference for studying concepts as well as presenting the material from a different angle that might help you to better understand the statistics.
For the Student: How do insurance companies determine various premium rates for different age groups and sexes? What information does the government use to decide who get taxed how much? How are you going to determine which vehicle is the safest to drive? The answer to each one of these questions relies heavily on statistics. Unfortunately there is a very large amount of statistics in society that has been manipulated very badly, and will provide you with unreliable results. Statistics is everywhere, and affects virtually everyone. The question to ask yourself now is “ Are you going to become another “victim of statistics”?”. Whether your career in the future is education, politics, or fire fighting, making decisions based off of statistics will be inevitable, and there will be consequences. Statistics is not hard, it only takes a little time and patience to gain a true understanding of what your information really ‘means’.
To the Student A little hint in keeping yourself from being overwhelmed while first learning about statistics. It is not vital to memorize all of the equations. Although memorizing them can help, it is better to understand what the equations mean, why they are used, and when do you use them. Some of the equations, especially any equation that has a symbol requires adding a series of many numbers. Practically speaking you should use either a calculator or computer to compute such equations. Focus on understanding the concept of what the computer is doing so that the number that pops out is more than a number to you, because that number means something. Don’t memorize, just recognize.
Graphs and Summaries A. One Categorical Variable 1. Graphs 2. SummariesApplets B. One Numerical Variable 1. Graphs a. Stemplot b. Histogram c. Boxplot d. Normal Quantile plot (Q-plot) e. Shapes i. Symmetric 1) Normal 2) Uniform ii. Skewed 1) Left 2) Right 2. Summaries a. Locations i. Mean ii. Median iii. Mode iv. Min and Max v. Quartiles vi. Comparisons of Mean and Median vii. Z-scoresEmpirical Rule b. Spreads i. Variance ii. Standard Deviation iii. Range iv. Interquartile Range (IQR) 3. Transformations a. Shift changes i. Centers ii. Spreads b. Scale changes i. Centers ii. Spreads
Beginning Definitions Variable- the overall object of interest that is desired to be understood. ie. Percent of people who use Deodorant in America ie. Average debt of a college graduate from Texas A&M Individual- A single value constructed by a variable. ie. Bob, an American who does not use deodorant ie. Jill, a college graduate of Texas A&M with $10,000 in debt
Categorical (Qualitative) -Nominal ie. Colors {red,blue,green} -Ordinal ie. Strength {weak,moderate,Strong} Numerical (Quantitative) -Discreet ie. Number of Children in a family -Continuous ie. Amount of water the average house uses {Depending on the context, certain discreet numbers can be considered continuous for practical purposes, and continuous data can be made discreet} Variable Types
Distribution -Shape ie. unimodal, bimodal, multimodal, symmetric, skewed right, skewed left -Center ie. Mean, Median -Spread ie. Range, Standard Deviation,Variance, Interquartile Range
Categorical Graphs (Nominal or Ordinal) • Pie Charts • Bar Graphs Index
Bar Graphs Index
Numerical Graphs (Univariate) • Stemplots • Histograms • Boxplots • Normal Quantile Plots (Q-plots) Index
Stemplots Back to back stemplot boys girls 18| 3 | 7 67 | 4 | 5 0| 5 | 9 7| 6 | 389 13379| 7 | 0558 1488| 8 | 0025 | 9 | 7 Stemplot of Scores 3 | 178 4 | 567 5 | 09 6 | 3789 7 | 013355789 8 | 00124588 9 | 7 Index
Histogram Note that these are analogous to counts and percents with bar charts Index
Boxplots Boxplots are made using the 5 number summary to define the box and whiskers unless there are outliers present. If an outlier is present then the next minimum number not considered an outlier is chosen to represent the new minimum if the outlier or outliers where minimum numbers and vice versa if the outliers are considered maximum numbers. Outlier? A number is considered an outlier if it lies a distance of 1.5 times the IQR (Interquartile Range) lower than the 1st quartile or higher than the 3rd quartile. Index
Normal Plots (aka. Q-plots) Q-plots are used to determine how reasonable it may be to assume that the sample comes from a normal distribution. If the sample comes from a normal distribution then the plot of the scatterplots should make a straight 45 degree line, or in the case where the Q plot includes a Q-line, the points should follow “closely” to the line. Unfortunately there is no clear rule for declaring a set of data normal or not. It takes practice of examining patterns in Q-plots to recognize “close calls”, but if the data is strongly skewed it will be very easy to see the change in pattern from the line. Index
Shapes-Symmetric-Normal The blue histograms are samples from a population of test grades that have an average of 65 with a standard deviation of 10. Notice the one with more samples begins to look more like the density curve of a normal distribution (the red line)
Shapes-Symmetric-Normal Boxplots Normal plots Index
Shapes-Skewed- Right and Left Right Skewed The other major pattern to recognize is skew. Think about a skewer on a barbeque grill. Everything seems lopped to one side of the stick. Likewise, the pattern in graphs is similar. If the majority of the data lies on the left then the graph is right skewed and vice- versa. Left Skewed Index
SummariesLocations - Mean Heights of students 71 70 68 69 68 65 72 69 71 62 Index
SummariesLocation-Median Heights of students 71 70 68 69 68 65 72 69 71 62 Ordered heights 62 65 68 68 69 69 70 71 71 72 Median = If the number of observations is even the Median is the average of the middle two numbers. If the number of observations is odd then the middle number of the order data is the Median. Heights of male students 65 68 70 71 72 Index
SummariesLocation-Mode, Min, Max Ordered heights 62 65 68 68 69 69 70 71 71 72 Mode= 69 & 71 Mode is most common number. If there is tie for the number of common numbers then there is more that one mode. Min= 62 Max=72 Index
Summaries Location- Quartiles 1st Quartile = 68 3rd Quartile = 71 To find the 1st and 3rd Quartiles you consider the data separately to the left and to the right of the median. The median is the 2nd Quartile. The 1st Quartile is the middle number (or average of two middle numbers if the subset is even) between the minimum and the median. The 3rd Quartile is calculated the same way only replacing the max for the min. Ordered heights 62 65 68 68 69 69 70 71 71 72 Technical note: Include the median when finding the 1st or 3rd Quartile if the number of observations is odd. Index
Comparing Means and Medians Notice the blue and red lines on distribution graphs below. The blue line represent the mean and the red line represent the median. This demonstrates how whenever data becomes skewed the mean is affected more then the median. The bottom graph shows how the mean and median are about the same on a normal distribution. Medians Right Skewed Left Skewed Normal Distribution Mean Mean Index Median and Mean the Same
Z - Scores Suppose we are given a set of data that has a normal distribution. Given that we already know the mean and the standard deviation we want to find precisely how many actual deviations a certain amount is. That value is called a z-score. The equation is: Why is the z-score useful to us? Well if we compare our z-score to the 68-95-99.7 rule we can learn about what percentage of values in greater than or less that our value. Suppose we had a z-score of 1.5. Obviously more than 68% of the value are below our value, meaning that we would have less than a 32% chance of choosing our particular value at random. Now consider that our value had a z-score of -2.5 meaning that it is 2 and 1/2 standard deviation to the left of the mean. Our new score lies between 95 and 99.7 which means that we had less than a 5% chance of selecting our value at random and more .3%. We can look up our z-score on a table of “Standard Normal Probabilities in order to find our exact chances of being so lucky. Index
Z-Scores Based off the standard deviation, Z-Scores are used to determine how far a way a sample is from the mean. A Z-Score of 1 corresponds to one standard deviation from the mean. The 68-95-99.7 rule is helpful in determining what the value of a z-score really means. Figure 2 is density curve demonstrating what is meant by the 68-95-99.7 rule. The area under the blue contains 68% of the data. Where the blue ends is where z = 1 or z = -1. The red plus the blue contains 95 % of the data with the outer edges being z = 2 or z = -2. Likewise, the green added to the data contains 99.7% of the whole data. If we had a z-score of 0.5 we know that our number is somewhere in the blue. A z-score of 2.5 would lie somewhere in the green. Blue- 68% Blue & Red- 95% Blue, Red & Green- 99.7% Z-scores
When to use 68-95-99.7 rule When do we use the Empirical Rule? It is better to make a decision based off of graphs (histograms,boxplots,Q-plots), but if all we are given is the above we can notice some features about the distribution by observing the frequency column. The tallies need to be somewhat low in the top and bottom of this column with the data builiding up near the middle. Notice for this example this is what we have. If this pattern is apparent it is then necessary to compare the standard deviation of the data with the percentiles. If the data is normal then our standard deviation should contain about 68% of our data. According to the table 68% of the data lies between 5 and 14 for a length of 9. The standard deviation is 4.7 approximately 4.7, which with the empirical rule says that we expect about this distance is 9.4, so we conclude that the data has a Normal distribution
Empirical Rule usage Once again the two things we need to check for -pattern of the tallies -68% Interval Here we see that the frequency column has a pattern of higher tallies appears the same or bigger then the center of the tallies. But to be safe we consider the 68%Interval compared to the standard deviation. The lower bound of the interval is between (-1 and 0) the upperbound is between (19 and 20) Therefore the length of the interval is between 21 and 19. With the empirical rule we would expect this interval to be around 2 * 8.34= 16.68. Because this interval is clearly smaller than either of the previous we conclude that the data is not normal.
Spreads- Variance Variance is a number that describes how much the data “varies”. The reason for the two different formula below is that one is that the first one is used if we have the mean of the population. The second equation divides by n – 1 because the variance of a sample will be smaller then the variance from the population that the sample comes from. However as n gets large there becomes very little difference between these two equations Index
Spreads- Standard Deviation The Standard Deviation is just the square root of the variance. A standard deviation of “1” is exact the same as Z-score of one. Once again the difference between the two formula below are whether or not the data is the population or a sample from a population. Index
Spread-Calculation of variance and standard deviation. Heights of male students 65 68 70 71 72 Index
Summaries-Range and IQR Ordered heights 62 65 68 68 69 69 70 71 71 72 1st Quartile = 68 3rd Quartile = 71 Range = Max – Min = 72 – 62 = 10 Inter-Quartile-Range (IQR)= 3rd Quartile – 1st Quartile = 71 - 68 Index
Transformations A Transformation is when each value of a data set is placed into the same function. For example if we add a number n to every observation we will have a transformed data set that is shiftedn-units. If we multiply or divide every observation by the same number then the data set will have a new scale. If you are given a mean, (or ), and a standard deviation, s (or ), and want to convert your data so you have a new mean, new (or new), and new standard deviation, snew (or new), all you need to remember is what shift and scales changes affect. In our linear transformation formula: shift scale Index
Transformation Standard deviation are only affected by scale changes, but means are affected by both shift and scales changes. This means that: For example suppose College Station has an average annual temperature of 72 degrees with a standard deviation of 10 degrees. We want to know what these statistics are in Celsius. The formula for Celsius is: Index
Transformations- Shifts Suppose we discover that a measuring instrument was off by 3 inches because someone was measuring from the top of the shoe to the head. Well obviously the given heights would not be the height of the subjects. If we assume every suject’s shoes where the same height of 3 inches then we can fix the data appropriately with the equation: Ordered heights 62 65 68 68 69 69 70 71 71 72 Shifted heights 65 68 71 71 72 72 73 74 74 75 Notice what this does to the following statistics. What we see from this is that a shift change adds or substracts the same amount from every statistic that is not related to spread. The statistics that describe the spread (ie s2 and IQR) are not affected by the shift. Index
Transformations - Scale Going back to our original subjects for whom we have their height. Suppose that instead of inches we wanted to know how tall every one was in cm. 2.54 cm = 1 inch. Therefore our new data is as follows Ordered heights 62 65 68 68 69 69 70 71 71 72 Heights in cm 157.48 165.10 172.72 172.72 175.26 175.26 177.80 180.34 180.34 182.88 Unlike with the shifts notice that every single one of these statistics is affected by the scale change. Index