1 / 39

Elementary

Elementary. Statistics. NOTICE.

Download Presentation

Elementary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Elementary Statistics

  2. NOTICE Please note that the purpose of these slides is not a substitute for the reading and interacting of your material. It’s intended purpose is to be a quick reference for studying concepts as well as presenting the material from a different angle that might help you to better understand the statistics.

  3. For the Student: How do insurance companies determine various premium rates for different age groups and sexes? What information does the government use to decide who get taxed how much? How are you going to determine which vehicle is the safest to drive? The answer to each one of these questions relies heavily on statistics. Unfortunately there is a very large amount of statistics in society that has been manipulated very badly, and will provide you with unreliable results. Statistics is everywhere, and affects virtually everyone. The question to ask yourself now is “ Are you going to become another “victim of statistics”?”. Whether your career in the future is education, politics, or fire fighting, making decisions based off of statistics will be inevitable, and there will be consequences. Statistics is not hard, it only takes a little time and patience to gain a true understanding of what your information really ‘means’.

  4. To the Student A little hint in keeping yourself from being overwhelmed while first learning about statistics. It is not vital to memorize all of the equations. Although memorizing them can help, it is better to understand what the equations mean, why they are used, and when do you use them. Some of the equations, especially any equation that has a symbol requires adding a series of many numbers. Practically speaking you should use either a calculator or computer to compute such equations. Focus on understanding the concept of what the computer is doing so that the number that pops out is more than a number to you, because that number means something. Don’t memorize, just recognize.

  5. Graphs and Summaries A. One Categorical Variable 1. Graphs 2. SummariesApplets B. One Numerical Variable 1. Graphs a. Stemplot b. Histogram c. Boxplot d. Normal Quantile plot (Q-plot) e. Shapes i. Symmetric 1) Normal 2) Uniform ii. Skewed 1) Left 2) Right 2. Summaries a. Locations i. Mean ii. Median iii. Mode iv. Min and Max v. Quartiles vi. Comparisons of Mean and Median vii. Z-scoresEmpirical Rule b. Spreads i. Variance ii. Standard Deviation iii. Range iv. Interquartile Range (IQR) 3. Transformations a. Shift changes i. Centers ii. Spreads b. Scale changes i. Centers ii. Spreads

  6. Beginning Definitions Variable- the overall object of interest that is desired to be understood. ie. Percent of people who use Deodorant in America ie. Average debt of a college graduate from Texas A&M Individual- A single value constructed by a variable. ie. Bob, an American who does not use deodorant ie. Jill, a college graduate of Texas A&M with $10,000 in debt

  7. Categorical (Qualitative) -Nominal ie. Colors {red,blue,green} -Ordinal ie. Strength {weak,moderate,Strong} Numerical (Quantitative) -Discreet ie. Number of Children in a family -Continuous ie. Amount of water the average house uses {Depending on the context, certain discreet numbers can be considered continuous for practical purposes, and continuous data can be made discreet} Variable Types

  8. Distribution -Shape ie. unimodal, bimodal, multimodal, symmetric, skewed right, skewed left -Center ie. Mean, Median -Spread ie. Range, Standard Deviation,Variance, Interquartile Range

  9. Categorical Graphs (Nominal or Ordinal) • Pie Charts • Bar Graphs Index

  10. Pie Charts (Counts and Percents) Index

  11. Bar Graphs Index

  12. Numerical Graphs (Univariate) • Stemplots • Histograms • Boxplots • Normal Quantile Plots (Q-plots) Index

  13. Stemplots Back to back stemplot boys girls 18| 3 | 7 67 | 4 | 5 0| 5 | 9 7| 6 | 389 13379| 7 | 0558 1488| 8 | 0025 | 9 | 7 Stemplot of Scores 3 | 178 4 | 567 5 | 09 6 | 3789 7 | 013355789 8 | 00124588 9 | 7 Index

  14. Histogram Note that these are analogous to counts and percents with bar charts Index

  15. Boxplots Boxplots are made using the 5 number summary to define the box and whiskers unless there are outliers present. If an outlier is present then the next minimum number not considered an outlier is chosen to represent the new minimum if the outlier or outliers where minimum numbers and vice versa if the outliers are considered maximum numbers. Outlier? A number is considered an outlier if it lies a distance of 1.5 times the IQR (Interquartile Range) lower than the 1st quartile or higher than the 3rd quartile. Index

  16. Normal Plots (aka. Q-plots) Q-plots are used to determine how reasonable it may be to assume that the sample comes from a normal distribution. If the sample comes from a normal distribution then the plot of the scatterplots should make a straight 45 degree line, or in the case where the Q plot includes a Q-line, the points should follow “closely” to the line. Unfortunately there is no clear rule for declaring a set of data normal or not. It takes practice of examining patterns in Q-plots to recognize “close calls”, but if the data is strongly skewed it will be very easy to see the change in pattern from the line. Index

  17. Shapes-Symmetric-Normal The blue histograms are samples from a population of test grades that have an average of 65 with a standard deviation of 10. Notice the one with more samples begins to look more like the density curve of a normal distribution (the red line)

  18. Shapes-Symmetric-Normal Boxplots Normal plots Index

  19. Shapes-Symmetric-Uniform

  20. Shapes-Skewed- Right and Left Right Skewed The other major pattern to recognize is skew. Think about a skewer on a barbeque grill. Everything seems lopped to one side of the stick. Likewise, the pattern in graphs is similar. If the majority of the data lies on the left then the graph is right skewed and vice- versa. Left Skewed Index

  21. Shapes- Skewed Left

  22. Shapes-Skewed Right

  23. SummariesLocations - Mean Heights of students 71 70 68 69 68 65 72 69 71 62 Index

  24. SummariesLocation-Median Heights of students 71 70 68 69 68 65 72 69 71 62 Ordered heights 62 65 68 68 69 69 70 71 71 72 Median = If the number of observations is even the Median is the average of the middle two numbers. If the number of observations is odd then the middle number of the order data is the Median. Heights of male students 65 68 70 71 72 Index

  25. SummariesLocation-Mode, Min, Max Ordered heights 62 65 68 68 69 69 70 71 71 72 Mode= 69 & 71 Mode is most common number. If there is tie for the number of common numbers then there is more that one mode. Min= 62 Max=72 Index

  26. Summaries Location- Quartiles 1st Quartile = 68 3rd Quartile = 71 To find the 1st and 3rd Quartiles you consider the data separately to the left and to the right of the median. The median is the 2nd Quartile. The 1st Quartile is the middle number (or average of two middle numbers if the subset is even) between the minimum and the median. The 3rd Quartile is calculated the same way only replacing the max for the min. Ordered heights 62 65 68 68 69 69 70 71 71 72 Technical note: Include the median when finding the 1st or 3rd Quartile if the number of observations is odd. Index

  27. Comparing Means and Medians Notice the blue and red lines on distribution graphs below. The blue line represent the mean and the red line represent the median. This demonstrates how whenever data becomes skewed the mean is affected more then the median. The bottom graph shows how the mean and median are about the same on a normal distribution. Medians Right Skewed Left Skewed Normal Distribution Mean Mean Index Median and Mean the Same

  28. Z - Scores Suppose we are given a set of data that has a normal distribution. Given that we already know the mean and the standard deviation we want to find precisely how many actual deviations a certain amount is. That value is called a z-score. The equation is: Why is the z-score useful to us? Well if we compare our z-score to the 68-95-99.7 rule we can learn about what percentage of values in greater than or less that our value. Suppose we had a z-score of 1.5. Obviously more than 68% of the value are below our value, meaning that we would have less than a 32% chance of choosing our particular value at random. Now consider that our value had a z-score of -2.5 meaning that it is 2 and 1/2 standard deviation to the left of the mean. Our new score lies between 95 and 99.7 which means that we had less than a 5% chance of selecting our value at random and more .3%. We can look up our z-score on a table of “Standard Normal Probabilities in order to find our exact chances of being so lucky. Index

  29. Z-Scores Based off the standard deviation, Z-Scores are used to determine how far a way a sample is from the mean. A Z-Score of 1 corresponds to one standard deviation from the mean. The 68-95-99.7 rule is helpful in determining what the value of a z-score really means. Figure 2 is density curve demonstrating what is meant by the 68-95-99.7 rule. The area under the blue contains 68% of the data. Where the blue ends is where z = 1 or z = -1. The red plus the blue contains 95 % of the data with the outer edges being z = 2 or z = -2. Likewise, the green added to the data contains 99.7% of the whole data. If we had a z-score of 0.5 we know that our number is somewhere in the blue. A z-score of 2.5 would lie somewhere in the green. Blue- 68% Blue & Red- 95% Blue, Red & Green- 99.7% Z-scores

  30. When to use 68-95-99.7 rule When do we use the Empirical Rule? It is better to make a decision based off of graphs (histograms,boxplots,Q-plots), but if all we are given is the above we can notice some features about the distribution by observing the frequency column. The tallies need to be somewhat low in the top and bottom of this column with the data builiding up near the middle. Notice for this example this is what we have. If this pattern is apparent it is then necessary to compare the standard deviation of the data with the percentiles. If the data is normal then our standard deviation should contain about 68% of our data. According to the table 68% of the data lies between 5 and 14 for a length of 9. The standard deviation is 4.7 approximately 4.7, which with the empirical rule says that we expect about this distance is 9.4, so we conclude that the data has a Normal distribution

  31. Empirical Rule usage Once again the two things we need to check for -pattern of the tallies -68% Interval Here we see that the frequency column has a pattern of higher tallies appears the same or bigger then the center of the tallies. But to be safe we consider the 68%Interval compared to the standard deviation. The lower bound of the interval is between (-1 and 0) the upperbound is between (19 and 20) Therefore the length of the interval is between 21 and 19. With the empirical rule we would expect this interval to be around 2 * 8.34= 16.68. Because this interval is clearly smaller than either of the previous we conclude that the data is not normal.

  32. Spreads- Variance Variance is a number that describes how much the data “varies”. The reason for the two different formula below is that one is that the first one is used if we have the mean of the population. The second equation divides by n – 1 because the variance of a sample will be smaller then the variance from the population that the sample comes from. However as n gets large there becomes very little difference between these two equations Index

  33. Spreads- Standard Deviation The Standard Deviation is just the square root of the variance. A standard deviation of “1” is exact the same as Z-score of one. Once again the difference between the two formula below are whether or not the data is the population or a sample from a population. Index

  34. Spread-Calculation of variance and standard deviation. Heights of male students 65 68 70 71 72 Index

  35. Summaries-Range and IQR Ordered heights 62 65 68 68 69 69 70 71 71 72 1st Quartile = 68 3rd Quartile = 71 Range = Max – Min = 72 – 62 = 10 Inter-Quartile-Range (IQR)= 3rd Quartile – 1st Quartile = 71 - 68 Index

  36. Transformations A Transformation is when each value of a data set is placed into the same function. For example if we add a number n to every observation we will have a transformed data set that is shiftedn-units. If we multiply or divide every observation by the same number then the data set will have a new scale. If you are given a mean, (or ), and a standard deviation, s (or ), and want to convert your data so you have a new mean, new (or  new), and new standard deviation, snew (or new), all you need to remember is what shift and scales changes affect. In our linear transformation formula: shift scale Index

  37. Transformation Standard deviation are only affected by scale changes, but means are affected by both shift and scales changes. This means that: For example suppose College Station has an average annual temperature of 72 degrees with a standard deviation of 10 degrees. We want to know what these statistics are in Celsius. The formula for Celsius is: Index

  38. Transformations- Shifts Suppose we discover that a measuring instrument was off by 3 inches because someone was measuring from the top of the shoe to the head. Well obviously the given heights would not be the height of the subjects. If we assume every suject’s shoes where the same height of 3 inches then we can fix the data appropriately with the equation: Ordered heights 62 65 68 68 69 69 70 71 71 72 Shifted heights 65 68 71 71 72 72 73 74 74 75 Notice what this does to the following statistics. What we see from this is that a shift change adds or substracts the same amount from every statistic that is not related to spread. The statistics that describe the spread (ie s2 and IQR) are not affected by the shift. Index

  39. Transformations - Scale Going back to our original subjects for whom we have their height. Suppose that instead of inches we wanted to know how tall every one was in cm. 2.54 cm = 1 inch. Therefore our new data is as follows Ordered heights 62 65 68 68 69 69 70 71 71 72 Heights in cm 157.48 165.10 172.72 172.72 175.26 175.26 177.80 180.34 180.34 182.88 Unlike with the shifts notice that every single one of these statistics is affected by the scale change. Index

More Related