810 likes | 819 Views
Learn how to construct frequency tables and histograms to summarize data and understand the distribution. Includes definitions and examples.
E N D
Section 1B Descriptive Statistics William Christensen, Ph.D.
Definitions • Center • A value that indicates where the middle of the data set is located • Variation • A measure of the amount that the values vary among themselves
Definitions • Distribution • The nature or shape of the distribution of data (such as bell-shaped, uniform, or skewed) • Outliers • Sample values that are a lot smaller or larger than most of the other sample values
In learning to summarize data, one good way to start is by learning how to construct “Frequency Tables” and “Histograms” • Frequency Table – a table in which we list classes (categories) of values, along with the frequencies (counts) of the number of values that fall into each class. • …Let’s take a look at some actual data and use it to construct a Frequency Table
Sample Data 2 2 5 1 2 6 3 3 4 2 4 0 5 7 7 5 6 6 8 10 7 2 2 10 5 8 2 5 4 2 6 2 6 1 7 2 7 2 3 8 1 5 2 5 2 14 2 2 6 3 1 7
Class Frequency 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Frequency Table of Sample Data Here are the classes or categories that we setup. YOU decide what these classes are, but you should normally have an odd number of categories (5 or 7 is good). Notice how each category is the same size (difference between low-end and high-end is always 2 in this case). Also notice that none of the categories overlaps any other category – they are all separate. Here are the frequencies or counts. We got these by actually going back to the data list and counting how many zeros, ones and twos we had (20 of them), then how many threes, fours and fives (14 of them), etc. No rockets science of formulas here, just simple counting. Notice that if we add up all the frequencies (20+14+15+2+1) we get 52, which is exactly how many data we had (see previous slide). STOP – You must understand this simple concept, so review it and ask questions until you do.
Class Frequency 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Lower Class Limits • Lower Class Limits • The smallest numbers that can actually belong to the different classes or categories. Frequency Tables Attributes & Definitions
Class Frequency 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Upper Class Limits • Upper Class Limits • The largest numbers that can actually belong to the different classes or categories. Frequency Tables Attributes & Definitions
Class Frequency - 0.5 2.5 5.5 8.5 11.5 14.5 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Class Boundaries • Class Boundaries • The numbers exactly half-way between the classes or categories (no gaps). Frequency Tables Attributes & Definitions
Class Frequency 0 - 1 2 20 3 - 4 5 14 6 - 7 8 15 9 - 10 11 2 12 - 13 14 1 Class Midpoints • Class Midpoints • The numbers exactly in the middle of each class or category. Frequency Tables Attributes & Definitions
Class Frequency 3 0 - 2 20 3 3 - 5 14 3 6 - 8 15 3 9 - 11 2 3 12 - 14 1 Class Width • Class Width • The difference between two consecutive lower class limits (or two consecutive class boundaries). IT IS NOT the difference between the upper and lower class limits. Frequency Tables Attributes & Definitions 3 – 0 =
Be sure the classes are mutually exclusive (i.e., they must not overlap) Include all intermediate classes, even if the frequency is zero (i.e., don’t skip a class/category just because the frequency is zero) Use the same width for all classes Frequency Tables Summary STOP – before proceeding, find some data and use it to construct a frequency table – you must experience this concept in order to fully understand it. Data sets are available at http://www.awl.com/triolaexcel
Pictures of Data
Napoleon IEmperor of France (a.k.a. - short, dead dude) 1769 - 1821 ”A picture is worth a thousand words.”
We can use pictures of data to show us the nature or shape of how data is distributed A Histogram Is a picture or graph of a Frequency Table. More specifically, it is a bar graph in which the horizontal scale represents the classes/categories (from our frequency table) and the vertical scale represents frequencies (from our frequency table)
Class Frequency 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Remember our Frequency Table: To make a HISTOGRAM, we take our Classes/Categories and make them into the labels on the horizontal axis of our Histogram bar graph. To make a HISTOGRAM, the frequencies become the height of each bar on the Histogram bar graph
Frequency Class 0 - 2 20 3 - 5 14 6 - 8 15 9 - 11 2 12 - 14 1 Histogram from our Frequency Table Frequencies become values for bars Classes become labels on the horizontal axis
Information on how to use Excel to make a Frequency Table and Histogram is included at the end of this section
In addition to Histograms, there are other “pictures of data” or graphs that are commonly used and very helpful in explaining data to us. Pareto Charts Are a special type of graph, often presented as vertical bar graphs or “pie charts” that help us to prioritize categories of data. Pareto Charts are very helpful when we have many categories of data that we are trying to sort through, looking for what is most important or frequent versus what is less important or frequent. Pareto Charts always list the categories in order from the most frequent to the least. For example, the following Pareto Chart shows the frequencies for various causes of accidental death. If you wanted to help reduce the number of accidental deaths, would you find this chart helpful in deciding what things you should focus on?
45,000 40,000 35,000 30,000 25,000 Frequency 20,000 15,000 10,000 5,000 0 Fire Falls Poison Firearms Drowning Motor Vehicle Ingestion of food or object Pareto Chart–Bar Graph Accidental Deaths by Type
Firearms (1400. 1.9%) Ingestion of food or object (2900. 3.9% Pareto Chart - Pie Chart Fire (4200. 5.6%) Motor vehicle (43,500. 57.8%) Drowning (4600. 6.1%) Poison (6400. 8.5%) Falls (12,200. 16.2%) Accidental Deaths by Type
Another “picture of data” or graph that is commonly used and very helpful in explaining data to us. Scatter Diagram A simple point graph used to see the relationship between two factors/variables. For example, the following Scatter Diagram shows the Tar and Nicotine content for various types of cigarettes. When you look at this graph can you see if there is a relationship between Tar and Nicotine in cigarettes? (By relationship, we mean either that; as one goes up the other also tends to go up, or as one goes down the other tends to go down.)
20 • • • • • • • • • • • TAR • 10 • • • • • • • • • 0 0.5 0.0 1.0 1.5 NICOTINE Scatter Diagram Each point on this scatter diagram represents the tar & nicotine content for a particular brand/type of cigarette. It appears that as one goes up, so does the other, which suggests there is a relationship between Tar & Nicotine levels in cigarettes.
You can use your imagination to draw “pictures of data” to help express whatever story you are trying to tell. Example: The following chart was constructed by Florence Nightingale (1820-1910) who wanted to show the British government the great many deaths caused by poor sanitary conditions in British military hospitals during the Crimean War. Take a good look at the graph – it clearly shows that many more soldiers died due to unsanitary hospital conditions than died in combat. Because of her skilled use of statistics, Florence Nightingale was able to convince the government to improve conditions and save many lives.
Deaths in British Military Hospitals During the Crimean War Deaths from wounds Deaths from other causes Deaths from preventable diseases Graph developed in 1850’s By Florence Nightingale
Measures of Center
Measures of Center In describing data, one of the simplest and most meaningful things we can know is some kind of value to represent the center or middle
Measures of Center • Mean (average) • Median • Mode • Midrange
Mean (average) • Surely you already know what an “average” is and how to calculate it • Mean or Average is the number obtained by adding the values and dividing by the number of values • The following slide shows the symbols and formula for doing this calculation – you must know and understand all of this
Mean (average) • ∑ the symbol that tells us to sum or add everything that comes after it • x the variable/letter usually used to represent the individual data values • n represents the number of data values in a sample • N represents the number of data values in an entire population (do you remember the difference between a sample and a population?)
x is pronounced ‘x-bar’ and denotes the mean of a set of sample values ∑x x = n • Mean (average) We read the above formula as follows: The sample Mean (x-bar) equals the sum of x’s (each value in the sample) divided by n (the number of values in the sample). The Excel command =Average(range of values) calculates the mean of a data set
∑x µ = N • Mean (average) µis pronounced ‘mu’ and denotes the mean of all values in a population We read the above formula as follows: The population Mean (mu) equals the sum of x’s (each value in the population) divided by N (the number of values in the population). The Excel command =Average(range of values) calculates the mean of a data set
Median • The middle value when the data values are arranged in order from low-to-high (or high-to-low)
~ • often denoted by x (pronounced ‘x-tilde’) • is not affected by an extreme value • For example: if we had the data values 2, 4, 6, 8, 50 the median would simply be the middle value 6 and is not effected by the large value (50). However, the mean would be (2+4+6+8+50)/5 = 14, a value higher than all but one of the data values and strongly effected by the large value (5). • Median
6.72 3.46 3.60 6.44 (here is data) 3.46 3.60 6.44 6.72 (first, arrange in order) no exact middle -- shared by two numbers (even number of values) Take average of two middle values (3.60 + 6.44) / 2 = 5.02 MEDIAN is 5.02 2 Steps is calculating Median • Arrange the data in order low-to-high (or high-to-low) • Pick the middle value • If there are an even number of data values, take the average of the two middle values • If there are an odd number of data values, simply take the middle value by itself. • Median
6.72 3.46 3.60 6.44 26.70 (here is data) 3.46 3.60 6.44 6.72 26.70 (first arrange in order) (with odd number of data values, simply take middle value) MEDIAN is 6.44 Steps is calculating Median • Arrange the data in order low-to-high (or high-to-low) • Pick the middle value • If there are an even number of data values, take the average of the two middle values • If there are an odd number of data values, simply take the middle value by itself. • Median
Mode (denoted by M) • Mode is simply the value that occurs most frequently • If only one value occurs more frequently than any other value then we have a simple “Mode” • If no value occurs more than once there is “No Mode” • If two values occur with exactly the same frequency, but more than any other values, then we say the data is “Bimodal” • If three or more values (but not all values) occur with exactly the same frequency, but more than any other values, then we say the data is “Multimodal”
a. 5 5 5 3 1 5 1 4 3 5 b. 1 2 2 2 3 4 5 6 6 6 7 9 c. 1 2 3 6 7 8 9 10 • Mode (denoted by M) • Mode is simply the value that occurs most frequently • Examples: • Mode is 5 • Bimodal - 2 & 6 • No Mode
Midrange • Midrange the value midway between the highest and lowest values in the data set highest score + lowest score Midrange= 2
Round-off Rule for Measures of Center Carry one more decimal place than is present in the original set of values Example: If data is whole numbers (e.g., 1, 5, 10) then round the mean, median, mode, or midrange to one decimal place
∑(w •x) x = ∑ w Special Case –Weighted Mean • Example Given three test scores (85, 90, 75). The first test counts 20% of grade, second counts 30% and third counts 50%. Weighted average = [(20*.85)+(30*.90)+(50*.75)] / (.20+.30+.50) = 81.5
What is the Best Measure of Center? • Each measure (mean, median, mode, and midrange) has its advantages & disadvantages. Depending on what you are trying to show, you might use any of these. Perhaps the best/fairest thing to do is to report all the measures and tell the whole story. • Example: Let’s say we collect data on wages/salaries in St. George. What measure of center would be best? • If we report Mean/average wage in St. George it might be pulled up because of some very high wage earners and might be $15 - $20 an hour. • If we report Median wage it would be like putting everyone in St. George in a line arranged from lowest-to-highest wage and picking the person in the middle. This might be about $10/hour and tells a different story than the mean. • On the other hand, perhaps we want to report the wage Mode, which would be the wage that more people earn than any other wage. This would probably be minimum wage.
Mode = Mean = Median SYMMETRIC Summary: Measures of Center • If data is perfectly symmetrical (as shown by a bell-shaped histogram), the mean, median, and mode would all be equal. • Example: given data 1,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7 the mean=4, median=4 and mode=4. • A smoothed histogram (line instead of bars), gives us a picture of this kind of distribution would look like this:
Mean Mean Mode Mode Median Median SKEWED LEFT (negatively) SKEWED RIGHT (positively) Summary: Measures of Center • If data is not symmetrical then we say it is “skewed” and the mean, median, and mode will NOT all be equal. • Example: the following graphs show skewed distributions of data
Measures of Variation
Measures of Variation As we just learned, measures of center can be very helpful in understanding data. However, measures of center alone don’t tell the whole story and we need a another way to describe how data is distributed. The following example illustrates this point.
Jefferson Valley Bank Bank of Providence 6.5 4.2 6.6 5.4 6.7 5.8 6.8 6.2 7.1 6.7 7.3 7.7 7.4 7.7 7.7 8.5 7.7 9.3 7.7 10.0 Jefferson Valley Bank Bank of Providence Mean Median Mode Midrange 7.15 7.20 7.7 7.10 7.15 7.20 7.7 7.10 Measures of Variation Example: Waiting Times at two Banks Which Bank would you prefer? Why? We timed 10 customers at each bank and recorded the following waiting times in minutes. When we calculate our Measures of Center, we obtain the following results – and find they are exactly the same for both banks. So is there any difference between the banks. If so, it is not in their measures of center – there must be something else going on.
Measures of Variation Example: Waiting Times at two Banks Which Bank would you prefer? Why? Let’s take a look at the histograms for both banks and perhaps we can see some difference in distribution or variation. From the histograms it looks like Jefferson Valley Bank almost always serves their customers in 6.5 to 8.4 minutes, whereas the service times for Bank of Providence have a much wider range. Which would you prefer?
Measures of Variation We can express differences in data distribution or Variation through various Measures of Variation, including: • Range • Standard Deviation (the most important) • Variance
Measures of Variation • Range • This is the simplest measure of variation • Range = highest value – lowest value • Example: • Jefferson Valley Bank Highest (longest) wait time is 7.7 minutes Lowest (shortest) wait time is 6.5 minutes Range = 7.7 – 6.5 = 1.2 minutes • Bank of Providence Highest (longest) wait time is 10.0 minutes Lowest (shortest) wait time is 4.2 minutes Range = 10.0 – 4.2 = 5.8 minutes