320 likes | 493 Views
II. Graphical Displays of Data Like many other things, statistical analysis can suffer from garbage in, garbage out This often happens because no one bothered to look at the data . Simple data displays can convey a lot of information. A. Stem-and-Leaf Displays
E N D
II. Graphical Displays of Data Like many other things, statistical analysis can suffer from garbage in, garbage out This often happens because no one bothered to look at the data. Simple data displays can convey a lot of information. A. Stem-and-Leaf Displays Purpose: To provide a basis for evaluating the “shape” of the data without the loss of any information.
1. Basic Stem and Leaf Display This technique is best illustrated by an example. Pencil lead is actually a ceramic matrix filled with graphite. A measure of the quality of many ceramic bodies is the porosity. Porosity is a measure of the void space in the body. The following data set represents the result of a porosity test on “good” pencil lead
Let the numbers to the left of the decimal point be the “stems”. Let the numbers to the right of the decimal point be the “leaves”. Stem Leaves 11: 12: 13: 14: Representing the value 12.1: Stem Leaves 11: 12: 1 13: 14:
Representing the first row of data: Stem Leaves 11: 7 12: 1557 13: 5 14: The entire data set: StemLeaves 11. 7 12. 1557375836 13. 538552796633027 14. 3221 This display is the “raw” stem-and-leaf display.
Usually, we “refine” the stem-and-leaf display. First, we order the leaves on each stem. StemLeaves 11. 7 12. 1335556778 13. 022333555667789 14. 1223
Next, we add the depth information. The depth represents how far from the closest end of the data set a particular point is. For example, the data value 11.7 is the smallest observation; thus, it has a depth of 1. What is the depth of the data value 14.3? What is the depth of the data value 12.6? The completed stem-and-leaf gives the depth of the last value on the stems for the top part of the display. It gives the depth for the first value of the stems for the bottom part of the display.
We do not give the depth for the stem which contains the middle value of the data set. In this case, the depth information would be ambiguous. An aid for finding the depth is to report the number of leaves on each stem. Until we reach the middle stem, the depth for any stem is just the depth reported from the previous stem plus the number of leaves on the stem. Stem LeavesNo.Depth 11. 7 1 1 12. 1335556778 10 11 13. 022333555667789 15 14. 1223 4 4
2. Stretched Stem-and-Leaf • Consider a “stretched” stem and leaf display. • Basically it splits each simple stem into two. • Let X* be X0 – X4 • Let X• be X5 – X9 • No. Depth • 11* • 11• 7 1 1 • 12* 133 3 4 • 12• 5556778 7 11 • 13* 022333 6 • 13• 555667789 9 13 • 14* 1223 4 4 • 14• • Two other extensions of the basic stem-and-leaf display: • the squeezed stem-and-leaf display • side-by-side or back-to-back stem-and-leaf display
3. Reading a Data Display • Goal of a data display: let the data speak to you? • Like any conversation, some points are obvious, others come only from questioning the data. • Some obvious questions: • What is the ``center'' of the data? • What is the ``spread'' of the data. • More subtle questions: • Do the data follow some pattern? • Is the pattern symmetric?
If the pattern is not symmetric, is it right or left tailed? • A right tailed or right skewed pattern: • A left tailed or left skewed pattern:
Are there multiple peaks? • What do multiple peaks suggest? • Are there outliers?
B. Box Plots • Purpose: To give a quick display of some important features of the data. • Note: The box plot represents a distillation of the data. • The stem-and-leaf display only loses the time order of the data. • The box plot loses some of the information in the data. • However, under several very reasonable assumptions, the information lost is of little or no value. • 1. Preliminaries • The box plot is based upon: • the median • the quartiles • To find these quantities, we first must order the data set.
Let $y_1, y_2, \cdots, y_n$ denote our data set. Rearrange the data in ascending order, and let the new data set be denoted by where Note: the stem and leaf with ordered leaves is such an ordered data set. a. The median The median, , is the middle value of the ordered data set and is a measure of the “center”. Literally, the median splits the data set into two equal parts.
Let denote the “location” of the median in the ordered data set. If n is odd, then is an integer; thus, If n is even, then contains the fraction 1/2. In such a case, the median is the average of the two values “closest” to the “center”.
First Example: The following five values represent the ash content of pencil lead.
Second example: the porosities of good pencil lead Note: the stem and leaf is an ordered data set No.Depth 11* 11• 7 1 1 12* 133 3 4 12• 5556778 7 11 13* 022333 6 13• 555667789 9 13 14* 1223 4 4 14•
b. The upper and lower quartiles While the median divides the data into two parts of equal numbers, the quartiles (Q1, Q3) divide the date into four parts. Note: the second quantile (Q2) is the median. Let be the location of the first and third quantiles.
If is an integer, then If is not an integer, then the quartile is the average of the two values closest to it.
2. The Box Plot Itself • We shall illustrate this technique through the porosity data for the “good” pencil lead. • Construct a horizontal scale, marked conveniently, which covers at least the range of the data • Find , Q1, Q3
Use Q1 and Q3 to make a rectangular box above the scale. Draw a vertical line across the box for the median.
Determine the “Step” • The Interquartile Range is a measure of variability or spread defined by • Q3 - Q1 • We define the stepsize by • Step = (1.5)(Q3 – Q1)\ • For the good pencil lead data, • Q3 - Q1 = 13.7 – 12.6 = 1.1 • Step = 1.5(1.1) = 1.65
Determine the “inner fences” • The fences help us isolate possible outliers • The inner fences define the bounds for the unquestionably good data • The Upper Inner fence (UIF) is • UIF = Q3 + Step • The Lower Inner Fence (LIF) is • LIF = Q1 – Step • For the good pencil lead data • UIF = 13.7 + 1.65 = 15.35 • LIF = 12.6 – 1.65 = 10.95
5. Locate the most extreme data points which are on or within the inner fences. These data values are called the adjacents. Draw vertical lines at these points, and connect these points to the “box” with a horizontal line. This line is called a whisker. For the good pencil lead data, all of the values fall within the inner fences. Thus, the adjacents are: 11.7 and 14.3
6. Calculate the “outer fences” The outer fences allow us to discriminate between “mild” and “extreme” outliers. Data values between the inner and outer fences are considered mild. Data values beyond the outer fences are considered extreme. The Upper Outer Fence (UOF) is UOF = Q3 + 2(step) The Lower Outer Fence (LOF) is LOF = Q1 - 2(step)
For the good pencil lead data UOF = 13.7 + 2(1.65) = 17.0 LOF = 12.6 - 2(1.65) = 9.3 7. Mark possible “outliers” We use a ◦ to denote the mild outliers. We use a • to denote the extreme outliers. Note: No outliers occur in our example.
Parallel Box Plots allow us to compare two or more sets of data. The Key: must use a common scale. Place box plots above each other or side-by-side. ____________ o |-------|_____|______|--------| o o ____________ |-------|_____|______|--------| o ____________ |-------|_____|______|--------| |---------------------------------------------------| scale
Box Plots can also be used to analyze designed experiments. When there are categorical factors, the design can be “unstripped” and analyzed using parallel box plots. Example: Consider an experiment to study the influence of operating temperature and glass type on light output.
The resulting box plot is given below. The box plots show that a higher temperature yields higher light output. Also at the low temperature, glass type does not affect light output, but at the high temperature, glass type A produces higher light output.
Importance of Box Plots: Boxplots allow us to tell at a glance: 1. center 2. spread 3. outliers
Other important data displays: • histograms • time plots • We generally use software to generate all data displays. • The instructor should do an class demonstration using the software selected by the instructor.