E N D
Chapter 1 Examining Distributions
A class in statistics will have the student think about data. As in most subject areas in academia, early work in the area is devoted to definitions and learning the jargon of the folks who work in the area. In a way, one begins by learning the language of the subject matter. Let’s do this here. Data sets are made up of cases and variables. Cases are the objects described by a data set. As a student you are an object in my data set of grades on assignments and exams. On the next few slides I will show an example where each state in the United States is an object or case.
A variable is a characteristic of a case. Note that each case can have a different value on the variable. Final exam score is an example of a variable in the class grades data set I will be keeping. In the state data example to follow a variable is the unemployment rate in the state in 2008. Note also that the variable “State” is a label that helps me identify each case.
You can see here that the data set is initially organized in alphabetical order. The unemployment rate listed is not in any order, other than being matched with the state. But that makes sense because that is the way the data is set up and is easy for a user to check on each state.
In order to make sense out of the unemployment rate information I have rearranged or sorted the data from low to high unemployment rates. Note also that when the data was sorted on the unemployment rate the names of the states moved as well!
Here I even put the data in a horizontal presentation because with the numbers you and I typically think of the number line in this way.
Here I used line segments to mark off, or group, rates that all have the same starting percentage range. You see the 3, 4, 5, 6, 7, and 8 percent ranges.
16 12 9 6 5 2
On the previous slide I began to do some additional processing of the information provided in the data set. Although what I have done is somewhat crude, you get a visual sense of what is happening in each range of percentages. You can see for example that the unemployment rates in the 4% range occurs the most often across the 50 states in the year covered by this data set. The distribution of a variable has special meaning. In a distribution on a variable we see the values the cases have on the variable and we see how often each value occurs. What I presented on the unemployment variable is getting close to the idea of the distribution of unemployment rates across the states.
Back on slide 4 I had an Excel spreadsheet. As is typical, there I had cases in the rows and variables in the columns. Variables are classified in 2 broad ways 1) Categorical or qualitative 2) Numerical or quantitative A categorical variable places cases into one of several groups or categories. In our example, the variable “State” is categorical and we in fact have a group for each state. (note that sometimes when referring to states that we may be interested in “regions” of the country and maybe a category would be “Midwest” and so several states would be in this category. A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense.
Some folks in statistics want to give 4 types of variables and talk about units of measurement. Categorical variables have 2 subgroups: Nominal variables just have different groups with no natural order to the groups. Examples would be hair color, national origin, and gender. Ordinal variables are still categorical, but the groups have a natural ordering. An example would be your response to how much you like a product. You could say not at all, a little, a lot, or maybe even a great deal. Numerical variables have 2 subgroups: Interval variables and scale variables for us mean about the same thing. Temperature is a variable that is interval. Note that 2 degrees is not really half of 4 degrees because the interval between the numbers is somewhat arbitrary. But we may still want to take an average, for example.
A variable such as the wage is a scale variable because $20 per hour is twice the scale of $10 per hour. We also might average the wage across many people. One other point. Sometimes it is useful to use numbers to represent categories. For example, in a data set at a beauty salon we might use a code for the categorical variable “hair color” where, for example, a 1 means blond, 2 means brown, and so on. But, even though we are using numbers in the data, the variable is still a categorical variable.
Let’s say I am interested in every person who is in the city limits of Wayne, Nebraska at the point in time January 10, 2012 at 12:30pm. Just for kicks, let’s imagine at this time that every person in the city limits at this time is frozen in place. It would be like freezing the video of a movie at a particular frame. Let’s say then that you and I can walk around while all the rest of the folks in town are frozen in place. But, when we get to a person we can talk to them. Say we ask each personwhat was their income in 2011, what was their age on January 1, 2012, how much do they like Coca Cola (and the person has to answer either “hate it,” “its okay,”, or “I love it.”), what continent would you like to live on in 2013 (and you have to say 1 of the 9 ;).
Data As we talk to each person we could record their responses. We would probably want to be organized, so let’s use the following (note each row represents measurements on cases and each colum is variable): PersonIncomeAgeCoke Continent Person 1 23,750 22Love it North America Person 2 72,800 54 Hate itAsia Person 335,432 36 Hate it North America Person 4 10,000 29its okay Europe We would have data on more people when we are done.
Cases • Any data set provides information about some group of individual cases. • In my example, the folks in Wayne at the specified time are the cases. • In other studies the cases can be people, states, organizations, objects, and many other things.
What is a variable? Each case in a data set may have 1 or more characteristics of interest. Each characteristic would be called a vaiable. For any variable in a study each case has to be assigned a valule. So each case has a “measurement” taken and the value is assigned. For the most part, in our class the measurements have already taken place. We tend to look at variables on subjects or cases in which we are interested. Each case has a value on each variable.
Qualitative or Categorical variable The variable I labeled Coke, which is really about how much each person likes Coke, in our example is an example of a qualitative or categorical variable. The data, or observed values, from the people on the variable just yield a categorical response. Note that sometimes in a data set numbers may be used to express the values on the variable, but all we really have are categories of responses. For example, we could have 1 = hate it, 2 = its okay, 3 = love it and in the data set all you would see are the numbers. But, the numbers really just represent a different category.
Quantitative or Numerical variable In our example the variable Income is an example of a quantitative or numerical variable. The data, or observed values, from the people on the variable yield a numerical response. What type of variable is Age and what type is Continent? Hint: 1 is qualitative and 1 is quantitative.
Population Often in statistics we are interested in a group. The group may be large, or even huge! Plus we want to be able to make statements or draw conclusions about the group. A population is the set of all cases we want to study or know something about. So, the population is the main group we want to know about or draw conclusions about. A census is conducted if we have measurements on all the cases in a population. Remember, a caseis a single entity of the population.
Sample Many times in a study all the cases of the population will not be observed, so a sample is said to have been taken. A sample is a subset of the cases of a population – just part of the population.
Descriptive Statistics Describing data is a big part of statistics. A fair amount of time will be spent in this class describing data by calculating measures such as the mean and the standard deviation and we might use tables and graphs to assist in learning about the elements of the study. Descriptive statistics is the science of describing the important aspects of a set of measurements.
Inferential Statistics Inferential Statistics is a method used when only a sample from a population has been drawn, but we want to make statements about the important aspects of the larger population. Any cooks reading this? In order to tell if a pot of soup is ready to go, is taking a sample okay? Sure it is, but first make sure you have stirred the soup to mix in the ingredients. In statistics, we feel pretty good about samples as long as we have “mixed” things well.
Examples Say we want to study faculty salaries at WSC. Our research topic is faculty salaries. The population is WSC faculty. Cases are individual faculty. Parker is a case of the population, as is Lutt, Paxton, Nelson, and others. Another example might be we want to study the budgets of state governments. The population is all 50 states. The cases are the states. What are the cases? (Did you say something like Ohio, Nebraska, Iowa….?) Our interest may be people, companies, states, etc…
Describing CategoricalData Here we study ways of describing a variable that is categorical or qualitative.
Describing CategoricalData Here we study ways of describing a variable that is categorical or qualitative.
Say 50 people you know purchased a soft drink from a machine recently. A variable of interest might be the BRAND PURCHASED. Say the brands are made up of the 5 soft drinks Coke Classic, Diet Coke, Dr. Pepper, Pepsi-Cola, and Sprite (of course there are more varieties of soft drinks, but this is an illustrative example.) Here each specific brand represents a different value on the variable brand purchased. Each specific brand represents a nonoverlappingclass – each specific class represents a mutually exclusive category. Here the variable brand purchased is a categorical or qualitative variable - values of the variable represent categories. One thing that makes sense to do is ask each of the 50 people what they purchased. Then we could count the number of people who purchased Coke Classic and the others. The total number of people of the 50 who purchased Coke Classic would be the frequency.
The first two columns on the previous screen, the Soft Drink and Frequency columns, make up what is called a frequency distribution. It is a tabular summary of data showing the number, or frequency, of items in each of several nonoverlapping classes. The third column shows the relative frequency. We need the second column to create the third. To get the relative frequency in each row take the frequency in that row and divide by the total frequency. The fourth column shows the percent frequency. The fourth column equals the third column multiplied by 100.
Do you know why we put information in columns? Because then we can call’um as we see’um. Sorry:) So, the frequency, relative frequency and percent frequency distributions are different ways of summarizing information about a categorical variable. Notes about our table. 1) The total, or sum, of the frequency column is equal to the number of observations, sometimes called n, in general. 2) The total, or sum, of the relative frequency column is equal to 1. 3) The total, or sum, of the percent frequency column is equal to 100 (sometimes it may be a little off due to rounding of decimal places).
In our example here we had 50 people and we asked what soft drink they purchased. Studies occur that have thousands of people and they are asked several questions. Using a computer can help in the counting of responses. Bar Graphs Bar graphs just put the frequency, relative frequency and percent frequency distributions into visual form. The form is a graph with certain properties. The horizontal axis does not have numbers on it and the axis represents the categories. In our soft drink example we would put each brand in a different location on the axis.
Imagine you have a piece of construction paper that is red. Do you remember way back when in school you would cut strips of paper and then curl the paper with the scissors? Well, we will not need to curl the paper here! I mention this silly example because I want you to think about cutting strips that are one inch wide. The height of each strip would then represent the frequency, relative frequency or percent frequency on the variable. You would tape each strip onto the graph above each category. (You could also put the bars sideways.) So the vertical axis, or height, in the bar graphs is either the frequency, relative frequency or percent frequency distributions. In constructing the bar graph on a qualitative variable a space is left between each bar to help us remember we have a qualitative a variable.
This is an example of what a percent frequency graph would look like. The variable is “what is the type of area in which you live” and the height of each bar is the percent frequency. (See how each bar is like a cut out from a piece of paper?)
Pie Charts Say we order a pizza pie and it is cut up into pieces. Below I show a pizza pie cut, and I wanted it to show it cut into slices that hits the middle. If you get a quarter of the pie, you get one of the sections shown. 0.25 of the pie is an example of the relative frequency. So, par charts show each category getting its relative share of the pie. A pie chart could really be the frequency, relative frequency or percent frequency pie, but the size of each piece of the pie is always the relative frequency.
Remember that a circle has 360 degrees. A way to think about this is if you go from “12 o’clock” on the pie to “3 o’clock” you have gone 90 degrees. A way to construct a pie chart is that each category will take up its respective relative frequency times the 360 dgrees. From the earlier example Coke Classic had a relative frequency of .38 and will thus take .38(360) = 136.8 degrees. On a bar chart, you could take an 8.5 by 11 sheet of paper and cut out an inch strip 11 inches long. Then cut this strip into the same number of pieces as the number of categories, where each cut is the relative frequency of the group times 11. For Coke you would have a cut .38(11) = 4.18 inches long. The relative frequency is a very important descriptor for a qaulititative variable and is the basis for bar and pie charts.
Describing Numerical Data Here we study ways of describing a variable that is numerical.
Numerical variables have values that are real numbers. Remember that categorical variables may use numbers, but the variable really has values that represent groups. Example of a categorical variable: eye color 1 = blue, 2 = green, 3 = red (especially on Friday morning). Our initial method of describing a numerical variable will be basically the same as with a qualitative variable, with some modification in our understanding. Let’s consider the variable age. Consider the first 20 people you see today. Consider yourself if you look in the mirror, but just count yourself once. The age of these folks could be 1 day to 110 years in Nebraska, right?
Remember, a frequency distribution is a tabular summary of data showing the number, or frequency, of items in each of several nonoverlapping classes. With a variable like eye color (qualitative), we typically make each color a class. But with a variable like age (quantitative), if we make each age a class then we could have so many classes that the distribution is hard to interpret. The authors suggest grouping the ages into classes and having anywhere from 5 to 15 classes. Let’s digress for a minute and think about a data set. Say I have data on people. Say I have social security number, eye color, age and blood alcohol level last Thursday night at 11:30. On the next screen I have what the data might look like in Excel, or other computer programs. Note each column is a variable. Each row represents a person in this example. Thus in each row we see the values of the variables for each person.
The reason for my digression was to have you begin to think about data sets. (Typically) A variable is in a column. The values down the column are for different people (or what ever the subject might be – the cases). I believe it is useful to think about data as you consider statistical ideas. Here we are looking at how to describe a column of data, one variable. Now, when we have a numerical variable like age we have to think about how many classes to have. We want each class to have more than a few people in it. For now, let’s not worry too much about how many classes to have. The “width” of each class should be equal. Using age as an example, we might have classes that have 5 consecutive ages included. The first class might be 10-14 year olds, then 15-19 year olds and so on.
Class “limits” need to be considered. Each person should be in only one class. Each class has a lower limit and an upper limit and these limits are exclusive to the class. On the next screen I have an example of the frequency, relative frequency and percent frequency distributions for the variable age for 50 people. The frequency column is just the counting of the number of people in each class. The relative frequency is the frequency of each class divided by the total number of people in the data set. The percent frequency is the relative frequency times 100. (Look back at the distributions we had for the qualitative variable. Does it look the same?)
Do you know why we put information in columns? Because then we can call’um as we see’um. Sorry:) So, the frequency, relative frequency and percent frequency distributions are different ways of summarizing information about a numerical variable. Notes about our table. 1) The total, or sum, of the frequency column is equal to the number of observations, n. 2) The total, or sum, of the relative frequency column is equal to 1. 3) The total, or sum, of the percent frequency column is equal to 100.
Bar graphs are used for qualitative variables. What amounts to the same thing for quantitative variables are called histograms. Histograms just put the frequency, relative frequency and percent frequency distributions into visual form. The form is a graph with certain properties. The variable of interest is put along the horizontal axis. We would have the variable age on the axis.
Imagine you have a piece of construction paper that is blue. Do you remember way back when in school you would cut strips of paper and then curl the paper with the scissors? Well, we will not need to curl the paper here! I mention this silly example because I want you to think about cutting strips that are of the same with and are as wide as the class width (remember class widths are equal). The height of each strip would then represent the frequency, relative frequency or percent frequency on the variable. You would tape each strip onto the graph above each category. So the vertical axis, or height, in the bar graphs is either the frequency, relative frequency or percent frequency distributions. In constructing the histogram on a quantitative variable THERE IS NO SPACE between each bar to help us remember we have a quantitative variable.
Pie Charts The authors do not mention it, but pie charts could be made in a similar fashion to what we saw before. Cumulative Distributions Have you every accumulated a bunch of junk in your room? Yea, me to. Each day more stuff just shows up. So tomorrow I will have all the stuff I have today and more. Cumulative distributions are kind of like my story. When you look at the frequency distribution we just saw, a slight modification can make then into cumulative distributions. For the cumulative frequency, start with the first class in the first row. The cumulative value for this row is the frequency.
But the cumulative value for the second row is the frequency for the first row plus the frequency for the second row. So to get the cumulative frequency for a given row, add up the frequencies for that row and all previous rows. The cumulative relative frequency and cumulative percent frequency are found as before: cumulative relative frequency is cumulative frequency divided by total and the cumulative percent frequency is the cumulative relative frequency times 100. What’s a henway? About 4 or 5 pounds! What’s an Ogive? It is what we call a graph of a cumulative frequency distribution. The horizontal axis has values of the variable and the vertical axis has the appropriate cumulative frequency.
What is the most frequently occurring age group in this example? How many times does it occur? (the group is 30-39 and the frequency 17)
This is a frequency Ogive (or polygon). Note here that what was accumulated was just the frequency. The highest frequency is 50 because that was the total number of folks in the study. What would be the highest value if we had a cumulative relative frequency? (1, right?)
Summary With both categorical and numerical data one way to summarize the data is to look at frequency information of groups. The categorical data are already in natural groups and the numerical data has to be grouped. Then we might look at the frequency, relative frequency, or the percent frequency of each group. The relative frequency of a group = group frequency divided by the total frequency across all groups. The percent frequency is the relative frequency times 100. Bar charts, pie charts and histograms are based on these ideas.