120 likes | 157 Views
Learn about percentiles, quartiles, interquartile range, outlier detection, and how to create boxplots and quantile plots for effective data visualization and analysis.
E N D
Lecture 7Sections 2.3 – 2.4 • Objectives: • More Detailed Summary Quantities • Quartiles and IQR • Boxplots • Quantile Plots
More Detailed Summary Quantities Percentiles The median divides a data set into two equal parts. A finer partition can be obtained by dividing a data set into more than two parts. The (100p)th percentile separates the smallest 100p% of the data or distribution from the remaining values.
Quartiles and the Interquartile Range Certain percentiles are particularly important. Quartiles (first quartile, median, third quartile) separates a data set or distribution into four equal parts: 25%th percentile=first quartile or lower quartile, denoted by Q1. 50%th percentile=median, 75%th percentile=third quartile or upper quartile, denoted by Q3. Sample quartiles Separate the n ordered sample observations into a lower half and an upper half. If n is odd, include the median in each half. Then, Q1=median of the lower half of the data Q3=median of the upper half of the data Note that there are several different sensible ways to define the sample quartiles. R uses different ways of finding sample quartiles.
Examples Example. n = 15 20 25 25 27 28 31 33 34 36 37 44 50 59 85 86 Find Q1 ,Median and Q3. Example. n=14 20 25 25 27 28 31 33 34 36 37 44 50 59 85 Find Q1, Median and Q3.
IQR and Outlier Detection • Interquartile range (IQR) • IQR = Q3 - Q1 • Resistant to the effect of outliers. • Useful for the estimation of the variability when the distribution is skewed. Determining outliers Suspected (mild) outlier – any observation is a suspected outlier if it is farther than 1.5 IQR from the closest quartile (i.e., falls beyond Q1-1.5IQR and Q3-1.5IQR). Highly suspected (extreme) outlier – any observation is an extreme outlier if it is farther than 3IQR form the nearest quartile (i.e., falls beyond Q1-3IQR and Q3-3IQR).
Boxplots A boxplot is a visual display of data based on the following five-number summary: Min, Q1, Median, Q3, Max Note: Boxplots always run from bottom-to-up or from left-to-right. A central box spans Q1 and Q3 and a line in the box marks the median. Outliers are marked with “o”. In a box plot the upper whisker extends to the largest data value within the upper limit, Q3 + 1.5IQR, and the lower whisker extends to the smallest value within the lower limit, Q1 -1.5IQR.
Boxplot Examples Ultrasound was used to gather the accompanying corrosion data on the thickness of the floor plate of an aboveground tank used to store crude oil (“Statistical Analysis of UT Corrosion Data from Floor Plates of a Crude Oil Aboveground Storage Tank”, Material Eval., 1994: 846-849). Each observation is the largest pit depth in the plate, expressed in milli-in. 40 52 55 60 70 75 85 85 90 90 92 94 94 95 98 100 115 125 125 Find the five-number summary and plot the boxplot. • The effects of partial discharges on the degradation of insulation cavity material have important implications for the lifetimes of high-voltage components. Consider the following sample of n=25 pulse widths from slow discharges in a cylindrical cavity made of polyethylene: • 5.3 8.2 13.8 74.1 85.3 88.0 90.2 91.5 92.4 92.9 93.6 94.3 94.8 94.9 95.5 95.8 95.9 96.6 96.7 98.1 99.0 101.4 103.7 106.0 113.5 • Find the five-number summary and plot the boxplot.
Comparative Boxplots Comparative boxplot (or side-by-side boxplot) provides a very effective way of revealing similarities and differences between two or more data sets consisting of observations on the same variable. Example. The article “Compression of Single-Wall Corrugated Shipping Containers Using Fixed and Floating Test Platens” (J. of Testing and Evaluation, 1992: 318-320) describes an experiment in which several different types of boxes were compared with respect to compression strength. Consider the following observations on four different types of boxes: Type of Box Compression Strength (lb) 1 655.5 788.3 734.3 721.4 679.1 699.4 2 789.2 772.5 786.9 686.1 732.1 774.8 3 737.1 639.0 696.3 671.7 717.2 727.1 4 535.1 628.7 542.4 559.0 586.9 520.0
Quantile Plots Quantile Plots An investigator frequently wishes to know whether data was selected from a particular type of population distribution (e.g., normal distribution). For one thing, many inferential procedures are based on the assumption that the underlying distribution is of a specified type. The use of such procedures is inappropriate if the actual distribution differs greatly from the assumed type. Additionally, understanding the underlying distribution can sometimes give insight into the physical mechanisms involved in generating the data. An effective way to check distributional assumption is to construct a quantile plot (or probability plot). Idea: Plot the sample quantiles vs. the theoretical quantiles (population quantiles). If the data come from the correct distribution, the points in the plot will fall close to a straight line. If the actual distribution is quite different from the one used to construct a plot, the points should depart substantially from a linear pattern.
Normal Quantile Plot A Normal Quantile Plot is a plot of the (z quantile, sample quantile) pairs. Example. The accompanying sample consisting of n=20 observations on dielectric breakdown voltage of a piece of epoxy resin appeared in the article “Maximum Likelihood Estimation in the 3-Parameter Weibull Distribution” (IEEE Trans on Dielectrics and Elec. Insul., 1996: 43-55). 24.46 25.61 26.25 26.42 26.66 27.15 27.31 27.54 27.74 27.94 27.98 28.04 28.28 28.49 28.50 28.87 29.11 29.13 29.50 30.88 Is the population distribution of dielectric breakdown voltage normal?