Graphical Methods for Describing Data Distributions

Chapter 2 Graphical Methods for Describing Data Distributions Created by Kathy Fritz

Variable any characteristic whose value may change from one individual to another College Home Political affiliation Number of textbooks purchased Distance from home to college

Data The values for a variable from individual observations Political affiliation: Democrat, Republican, etc. Number of textbooks purchased: 1, 2, 3, 4, . . . Distance from home to college: 25 miles, 53.5 miles, 347.2 miles, etc.

Suppose that a PE coach records the heightof each student in his class. Univariate – consist of observations on a single variable made on individuals in a sample or population This is an example of a univariatedata

Suppose that the PE coach records the height and weightof each student in his class. Bivariate- data that consist of pairs of numbers from two variables for each individual in a sample or population This is an example of a bivariatedata

Suppose that the PE coach records the height, weight, number of sit-ups, and number of push-upsfor each student in his class. Multivariate - data that consist of observations on two or more variables This is an example of a multivariatedata

Two types of variables categorical numerical

Categorical variables • Qualitative • Consist of categorical responses • Car model • Birth year • Type of cell phone • Your zip code • Which club you have joined Which of these variables are NOT categorical variables? They are all categorical variables!

Numerical variables It makes sense to perform math operations on these values. There are two types of numerical variables - discrete and continuous • quantitative • observations or measurements take on numerical values • GPAs • Height of students • Codes to combination locks • Number of text messages per day • Weight of textbooks Which of these variables are NOT numerical? Does it makes sense to find an average code to combination locks?

Two types of variables categorical numerical discrete continuous

Discrete (numerical) • Isolated points along a number line • usuallycountsof items • Example: number of textbooks purchased

Continuous (numerical) • Variable that can be any value in a given interval • usually measurements of something • Example: GPAs

the color of cars in the teacher’s lot the number of calculators owned by students at your college the zip code of an individual the amount of time it takes students to drive to school the appraised value of homes in your city Identify the following variables: Categorical Discrete numerical Is money a measurement or a count? Categorical Continuous numerical Discrete numerical

Use the following table to determine an appropriate graphical display a data set. What types of graphs can be used with categorical data? In section 2.3, we will see how the various graphical displays for univariate, numerical data compare.

Displaying Categorical Data Bar Charts Comparative Bar Charts

Bar Chart This is called a frequency distribution. A frequency distribution is a table that displays the possible categories along with the associated frequencies or relative frequencies. When to Use:Univariate, Categorical data To comply with new standards from the U. S. Department of Transportation, helmets should reach the bottom of the motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 – Overall Results” (National Highway Traffic Safety Administration, August 2005) summarized data collected by observing 1700 motorcyclists nationwide at selected roadway locations. Each time a motorcyclist passed by, the observer noted whether the rider was wearing no helmet (N), a noncompliant helmet (NC), or a compliant helmet (C). The data are summarized in this table: A bar chart is a graphical display for categorical data. The frequency for a particular category is the number of times that category appears in the data set. This should equal the total number of observations.

Bar Chart To compile with new standards from the U. S. Department of Transportation, helmets should reach the bottom of the motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 – Overall Results” (National Highway Traffic Safety Administration, August 2005) summarized data collected by observing 1700 motorcyclists nationwide at selected roadway locations. Each time a motorcyclist passed by, the observer noted whether the rider was wearing no helmet (N), a noncompliant helmet (NC), or a compliant helmet (C). The data is summarized in this table: This should equal 1 (allowing for rounding).

Bar Chart All bars should have the same width so that both the height and the area of the bar are proportional to the frequency or relative frequency of the corresponding categories. How to construct • Draw a horizontalline; write the categories or labels below the line at regularly spaced intervals • Draw a verticalline; label the scale using frequency or relative frequency • Place a rectangular bar above each category label with a height determined by its frequency or relative frequency

Bar Chart What to Look For Frequently or infrequently occurring categories Here is the completed bar chart for the motorcycle helmet data. Describe this graph.

Comparative Bar Charts You use relative frequency rather than frequency on the vertical axis so that you can make meaningful comparisons even if the sample sizes are not the same. Bar charts can also be used to provide a visual comparison of two or more groups. When to UseUnivariate, Categorical data for two or more groups How to construct • Constructed by using the same horizontal and vertical axes for the bar charts of two or more groups • Usually color-coded to indicate which bars correspond to each group • Shoulduse relative frequencies on the vertical axis Why?

Each year the Princeton Review conducts a survey of students applying to college and of parents of college applicants. In 2009, 12,715 high school students responded to the question “Ideally how far from home would you like the college you attend to be?” Also, 3007 parents of students applying to college responded to the question “how far from home would you like the college your child attends to be?” Data is displayed in the frequency table below. What should you do first? Create a comparative bar chart with these data.

Found by dividing the frequency by the total number of students Found by dividing the frequency by the total number of parents What does this graph show about the ideal distance college should be from home?

Displaying Numerical Data Dotplots Stem-and-leaf Displays Histograms

Dotplot When to UseUnivariate, Numerical data How to construct • Draw a horizontal line and mark it with an appropriate numerical scale • Locate each value in the data set along the scale and represent it by a dot. If there are two are more observations with the same value, stack the dots vertically

Dotplot What to Look For • A representative or typical value (center) in the data set • The extent to which the data values spread out • The nature of the distribution (shape) along the number line • The presence of unusual values (gaps and outliers) An outlier is an unusually large or small data value. A precise rule for deciding when an observation is an outlier is given in Chapter 3. What we look for with univariate, numerical data sets are similar for dotplots, stem-and-leaf displays, and histograms.

The first three observations are plotted – note that you stack the points if values are repeated. Professor Norm gave a 10-question quiz last week in his introductory statistics class. The number of correct answers for each student is recorded below. First draw a horizontal line with an appropriate scale. This is the completed dotplot. Write a few sentence describing this distribution.

What to Look For • The representative or typical value (center) in the data set • The extent to which the data values spread out • The nature of the distribution (shape) along the number line • The presence of unusual values • What to Look For • The representative or typical value (center) in the data set • The extent to which the data values spread out • The nature of the distribution (shape) along the number line • The presence of unusual values • What to Look For • The representative or typical value (center) in the data set • The extent to which the data values spread out • The nature of the distribution (shape) along the number line • The presence of unusual values A symmetrical distribution is one that has a vertical line of symmetry where the left half is a mirror image of the right half. If we draw a curve, smoothing out this dotplot, we will see that there is ONLY one peak. Distributions with a single peak are said to be unimodal. Distributions with two peaks are bimodal, and with more than two peaks are multimodal. Professor Norm gave a 10-question quiz last week in his introductory statistics class. The number of correct answers for each student is recorded below. The center for the distribution of the number of correct answers is about 6. There is not a lot of variability in the observations. The distribution is approximately symmetrical with no unusual observations. The center for the distribution of the number of correct answers is about 6. There is not a lot of variability in the observations. The center for the distribution of the number of correct answers is about 6.

Comparative Dotplots When to UseUnivariate, numerical data with observations from 2 or more groups How to construct • Constructed using the same numerical scale for two or more dotplots • Be sure to include group labels for the dotplots in the display What to Look For Comment on the same four attributes, but comparing the dotplots displayed.

Create a comparative dotplot with the data sets from the two statistics classes, Professors’ Norm and Skew. Distributions where the right tail is longer than the left is said to be positively skewed (or skewed to the right). The direction of skewness is always in the direction of the longer tail. Is the distribution for Prof. Skew’s class symmetric? Why or why not? In another introductory statistics class, Professor Skew also gave a 10-question quiz. The number of correct answers for each student is recorded below. The center of the distribution for the number of correct answers on Prof. Skew’s class is largerthan the center of Prof. Norm’s class. There is also more variability in Prof. Skew’s distribution. Prof. Skew’s distribution appears to have an unusual observation where one student only had 2 answers correct while there were no unusual observations in Prof. Norm’s class. The distribution for Prof. Skew is negatively skewed while Prof. Norm’s distribution is more symmetrical. Prof. Skew Notice that the left side (or lower tail) of the distribution is longer than the right side (or upper tail). This distribution is said to be negatively skewed (or skewed to the left). Write a few sentences comparing these distributions. Prof. Norm

Stem-and-Leaf Displays When to UseUnivariate, Numerical data How to construct • Select one or more of the leading digits for the stem • List the possible stem values in a vertical column • Record the leaf for each observation beside the corresponding stem value • Indicate the units for stems and leaves someplace in the display Stem-and-leaf displays are an effective way to summarize univariate numerical data when the data set is not too large. Each observation is split into two parts: Stem – consists of the first digit(s) Leaf - consists of the final digit(s) Be sure to list every stem from the smallest to the largest value

Stem-and-Leaf Displays What to Look For • A representative or typical value (center) in the data set • The extent to which the data values spread out • The presence of unusual values (gaps and outliers) • The extent of symmetry in the data distribution • The number and location of peaks

So the leaf will be the last two digits. With 05.6%, the leaf is 5.6 and it will be written behind the stem 0. For the second number, 5.7 also is written behind the stem 0 (with a comma between). What is the leaf for 20.0% and where should that leaf be written? The completed stem-and-leaf display is shown below. However, it is somewhat difficult to read due to the 2-digit stems. A common practice is to drop all but the first digit in the leaf. Let 5.6% be represented as 05.6% so that all the numbers have two digits in front of the decimal. If we use the 2-digits, we would have stems from 05 to 20 – that’s way too many stems! So let’s just use the first digit (tens) as our stems. The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U.S. states and the District of Columbia. Data for the 19 Eastern states are given here. A stem-and-leaf display is an appropriate way to summarize these data. (A dotplot would also be a reasonable choice.) This makes the display easier to read, but DOES NOT change the overall distribution of the data set. What is the variable of interest? Wireless percent

The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U.S. states and the District of Columbia. Data for the 19 Eastern states are given here. While it is not necessary to write the leaves in order from smallest to largest, by doing so, the center of the distribution is more easily seen. The center of the distribution for the estimated percentage of households with only wireless phone service is approximately 11%. There does not appear to be much variability. This display appears to be a unimodal, symmetric distribution with no outliers. Write a few sentences describing this distribution.

Comparative Stem-and-Leaf Displays When to UseUnivariate, numerical data with observations from 2 or more group How to construct • List the leaves for one data set to the right of the stems • List the leaves for the second data set to the left of the stems • Be sure to include group labels to identify which group is on the left and which is on the right

The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U.S. states and the District of Columbia. Data for the 13 Western states are given here. Create a comparative stem-and-leaf display comparing the distributions of the Eastern and Western states. The center of the distribution of the estimated percentage of households with only wireless phone service for the Western states is a little larger than the center for the Eastern states. Both distributions are symmetrical with approximately the same amount of variability. Write a few sentences comparing these distribution.

Histograms Dotplots and stem-and-leaf displays are not effective ways to summarize numerical data when the data set contains a large number of data values. Histograms are displays that don’t work well for small data sets but do work well for larger numerical data sets. When to UseUnivariate numerical data How to constructDiscrete data • Draw a horizontal scale and mark it with the possible values for the variable • Draw a vertical scale and mark it with frequency or relative frequency • Above each possible value, draw a rectangle centered at that value with a height corresponding to its frequency or relative frequency What to look for Center or typical value; spread; general shape and location and number of peaks; and gaps or outliers Constructed differently for discrete versus continuous data Discrete numerical data almost always result from counting. In such cases, each observation is a whole number

Queen honey bees mate shortly after they become adults. During a mating flight, the queen usually takes multiple partners, collecting sperm that she will store and use throughout the rest of her life. A paper, “The Curious Promiscuity of Queen Honey Bees” (Annals of Zoology [2001]: 255-265), provided the following data on the number of partners for 30 queen bees. 12 2 4 6 6 7 8 7 8 11 8 3 5 6 7 10 1 9 7 6 9 7 5 4 7 4 6 7 8 10 Here is a dotplot of these data.

The bars should be centered over the discrete data values and have heights corresponding to the frequency of each data value. Queen honey bees continued Frequency In practice, histograms for discrete data ONLY show the rectangular bars. We built the histogram on top of the dotplot to show that the bars are centered over the discrete data values and that heights of the bars are the frequency of each data value. Number of partners The variable, number of partners, is discrete. To create a histogram: we already have a horizontal axis – we need to add a vertical axis for frequency The distribution for the number of partners of queen honey bees is approximately symmetric with a center at 7 partners and a somewhat large amount of variability. There doesn’t appear to be any outliers.

Here are two histograms showing the “queen bee data set”. One uses frequency on the vertical axis, while the other uses relative frequency What do you notice about the shapes of these two histograms?

Histograms with equal width intervals When to UseUnivariate numerical data How to constructContinuous data • Mark the boundaries of the class intervals on the horizontal axis • Use either frequency or relative frequency on the vertical axis • Draw a rectangle for each class interval directly above that interval. The height of each rectangle is the frequency or relative frequency of the corresponding interval What to look for Center or typical value; spread; general shape and location and number of peaks; and gaps or outliers

The top dotplot shows all the data values in each interval stacked in the middle of the interval. With continuous data, the rectangular bars cover an interval of data values (not just one value). Looking at this dotplot, it is easy to see that we could use intervals with a width of 5. Consider the following data on carry-on luggage weight for 25 airline passengers. This interval includes 10 and all values up to but not including 15. The next intervals will include 15 and all values up to but not including 20, and so on. Here is a dotplot of this data set. This is a continuous numerical data set.

From the dotplot, it is easy to see how the continuous histogram is created.

Comparative Histograms The article “Early Television Exposure and Subsequent Attention Problems in Children” (Pediatrics, April 2004) investigated the television viewing habits of U.S. children. These graphs show the viewing habits of 1-year old and 3-year old children. The biggest difference between the two histograms is at the low end, with a much higher proportion of 3-year-old children falling in the 0-2 TV hours interval than 1-year-old children. • Must use two separate histograms with the same horizontal axis and relative frequency on the vertical axis 1-yr-olds 3-yr-olds

Histograms with unequal width intervals When to use when you have a concentration of data in the middle with some extreme values How to construct construct similar to histograms with continuous data, but with densityon the vertical axis

When using relative frequency on the vertical axis, the proportional area principle is violated. Notice the relative frequency for the interval 0.4 to < 2.0 is smaller than the relative frequency for the interval -0.1 to < 0, but the area of the bar is MUCH larger. When people are asked for the values such as age or weight, they sometimes shade the truth in their responses. The article “Self-Report of Academic Performance” (Social Methods and Research [November 1981]: 165-185) focused on SAT scores and grade point average (GPA). For each student in the sample, the difference between reported GPA and actual GPA was determined. Positive differences resulted from individuals reporting GPAs larger than the correct value.

To fix this problem, we need to find the density of each interval. GPAs continued This is a correct histogram with unequal widths.

Cumulative Relative Frequency Plots When to use when you want to show the approximate proportion of data at or below any given value How to construct Mark the boundaries of the class intervals on a horizontal axis Add a vertical axis with a scale that goes from 0 to 1 For each class interval, plot the point that is represented by (upper endpoint of interval, cumulative relative frequency) Add the point to represented by (lower endpoint of first interval, 0) Connect consecutive points in the display with line segments

Cumulative Relative Frequency Plots What to Look For Proportion of data falling at or below any given value along the x axis The cumulative relative frequency of a given interval is the sum of the current relative frequency and all the previous relative frequencies.

Cumulative relative frequency = Current relative frequency + Previous relative frequency relative frequency = frequency/58 The National Climatic Data Center has been collecting weather data for many years. A frequency distribution for annual rainfall totals for Albuquerque, New Mexico, from 1950 to 2008 are shown in the table below. 0.052 + 0.155 + 0.241 0.344 0.516 0.585 0.792 0.895 0.947 0.999

To create the cumulative relative frequency plot: Plot the point (upper value of the interval, cumulative relative frequency of the interval) Plot the point: (smallest value of the first interval, 0) The National Climatic Data Center has been collecting weather for many years. The frequency of the annual rainfall totals for Albuquerque, New Mexico, from 1950 to 2008 are shown in the table below. 0.052 0.155 0.241 0.344 0.516 0.585 0.792 0.895 0.947 0.999

Graphical Methods for Describing Data Distributions

Graphical Methods for Describing Data Distributions

Presentation Transcript

Chapter 2-2

Chapter 2-2

Chapter 2 - 2

Chapter 2

Chapter 2

Chapter 2

CHAPTER 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

CHAPTER 2