420 likes | 623 Views
CHEE320. Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics. Graphical Methods for Analyzing Data. What is the pattern of variability? Techniques histograms dot plots stem and leaf plots box plots quantile plots. Histogram.
E N D
CHEE320 Module 1: Graphical Methods for Analyzing Data, and Descriptive Statistics J. McLellan
Graphical Methods for Analyzing Data What is the pattern of variability? Techniques • histograms • dot plots • stem and leaf plots • box plots • quantile plots J. McLellan
Histogram • summary of frequency with which certain ranges of values occur • ranges - “bins” • choosing bin size - influences ability to recognize pattern • too large - data clustered in a few bins - no indication of spread of data • too small - data distributed with a few points in each bin - no indication of concentration of data • there are quantitative rules for choosing the number of bins - typically automated in statistical software • not automated in Excel! J. McLellan
Histogram - Important Features symmetry? number of peaks tails? - extreme data points max, min data values - range of values spread in the data centre of gravity J. McLellan
Dot Plots • similar to histogram • plot data by value on horizontal axis • stack repeated values vertically • look for similar shape features as for histogram • e.g., data set for solder thickness • {0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1} 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 J. McLellan
Stem and Leaf Plots • illustrate variability pattern using the numerical data itself • choose base division - “stem” • build “leaves” by taking digit next to base division Data 12.00 10.00 14.00 20.00 18.00 18.00 25.00 21.00 36.00 44.00 11.00 15.00 22.00 21.00 27.00 25.00 18.00 21.00 18.00 20.00 Decimal point is 1 place to the right of the colon 1 : 0124 1 : 58888 2 : 001112 2 : 557 3 : 3 : 6 Stems 10-14 15-19 20-24 Tooth Discoloration by Fluoride 25-29 Leaves J. McLellan
Stem and Leaf Plots Solder example • numbers viewed as 0.070, 0.080, 0.090, 0.100, 0.110,… • decision - what is the stem? • considerations similar to histogram - size of bins Decimal point is 2 places to the left of the colon 7 : 0 8 : 9 : 00 10 : 000 11 : 0 12 : 0 13 : 0 J. McLellan
Box Plots • graphical representation of “quartile” information • quartiles - describe how data occurs - ordering • 1st quartile - separates bottom 25% of data • 2nd quartile (median) - separates bottom 50% of data • 3rd quartile - separates bottom 75% of data and extreme data values • add “whiskers” - extend from box to largest data point within • upper quartile + 1.5 * interquartile range • lower quartile - 1.5 * interquartile range • interquartile range = Q3 - Q1 • plot outliers - data points outside Q3 + 1.5*IQR, Q1-1.5*IQR J. McLellan
Interpretation no outliers relatively symmetric distribution longer tails on both sides fairly tightly clustered about centre Box Plot - for solder data J. McLellan
Interpretation no outliers asymmetric distribution - long lower tail some tails on both sides fairly tightly clustered at higher range of discoloration Box Plot - for teeth discoloration J. McLellan
Quantile Plots • plot cumulative progression of data • values vs. cumulative fraction of data • comparison to standard distribution shapes • e.g., normal distribution, lognormal distribution, … • can be plotted on special axes • analogous to semi-log graphs to provide visual test for closeness to given distribution • e.g., test to see if data are normally distributed J. McLellan
Interpretation data don’t follow linear progression underlying distribution not normal? Quantile Plot - teeth discoloration Note the irregular spacing - similar to “semi-log” paper - cumulative points should follow linear progression on this scale if distribution is normal. J. McLellan
Graphical Methods for Quality Investigations • primary purpose - help organize information in quality investigation Examples • Pareto Charts • Fishbone diagrams - Ishikawa diagrams J. McLellan
Pareto Chart • used to rank factors • typically present as a bar chart, listing in descending order of significance • significance can be determined by • number count - e.g., of defects attributed to specific causes • by size of effect - e.g., based on coefficients in regression model J. McLellan
Example - Circuit Defects Number of Defects Attributed to: Stamping_Oper_ID 1 Stamping_Missing 1 Sold._Short 1 Wire_Incorrect 1 Raw_Cd_Damaged 1 Comp._Extra_Part 2 Comp._Missing 2 Comp._Damaged 2 TST_Mark_White_Mark 3 Tst._Mark_EC_Mark 3 Raw_CD_Shroud_Re. 3 Sold._Splatter 5 Comp._Improper_1 6 Sold._Opens 7 Sold._Cold_Joint 20 Sold._Insufficient 40 Data from Montgomery J. McLellan
Pareto Chart • for circuit defect data J. McLellan
Fishbone Diagrams • organize causes in analysis • have spine, with cause types branching from spine, and sub-groups branching further Example - factors influencing poor conversion in reactive extrusion catalyst used - metallocene/Ziegler-Natta half-life initiator type polymer grade poor conversion barrel temperature temperature distribution along barrel temperature control J. McLellan
Graphical Methods for Analyzing Data Looking for time trends in data... • Time sequence plot • look for • jumps • ramps to new values • meandering - indicates time correlation in data • large amount of variation about general trend - indication of large variance } indicate shift in mean operation J. McLellan
Time Sequence Plot - for naphtha 90% point - indicates amount of heavy hydrocarbons present in gasoline range material excursion - sudden shift in operation meandering about average operating point - time correlation in data J. McLellan
Graphical Methods for Analyzing Data Monitoring process operation • Quality Control Charts • time sequence plots with added indications of variation • account for fluctuations in values associated with natural process noise • look for significant jumps - shifts - that exceed normal range of variation of values • if significant shift occurs, stop and look for “assignable causes” • essentially graphical “hypothesis tests” • can plot - measurements, sample averages, ranges, standard deviations, ... J. McLellan
Example - Monitoring Process Mean • is the average process operation constant? • collect samples at time intervals, compute average, and plot in time sequence plot • indication of process variation - standard deviation estimated from prior data • propagates through sample average calculation • if “s” is sample standard deviation, calculated averages will lie between of the historical average 99% of the time if the mean operation has NOT shifted • values outside this range suggest that a shift in the mean operation has occurred - alarm - “something has happened” ± 3 s / n J. McLellan
Example - Monitoring Process Mean • time sequence plot with these alarm limits is referred to as a “Shewhart X-bar Chart” • X-bar - sample mean of X centre-line or target line - indicates mean when process is operating properly upper and lower control limits no points exceed limits in a state of statistical control J. McLellan
Example - Monitoring Process Mean Point exceeds region of natural variation - significant shift has occurred • X-bar chart J. McLellan
Graphical Methods for Analyzing Data Visualizing relationships between variables Techniques • scatterplots • scatterplot matrices • also referred to as “casement plots” J. McLellan
Scatterplots ,,, are also referred to as “x-y diagrams” • plot values of one variable against another • look for systematic trend in data • nature of trend • linear? • exponential? • quadratic? • degree of scatter - does spread increase/decrease over range? • indication that variance isn’t constant over range of data J. McLellan
Scatterplots - Example • tooth discoloration data - discoloration vs. fluoride trend - possibly nonlinear? J. McLellan
Scatterplot - Example • tooth discoloration data -discoloration vs. brushing signficant trend? - doesn’t appear to be present J. McLellan
Scatterplot - Example • tooth discoloration data -discoloration vs. brushing Variance appears to decrease as # of brushings increases J. McLellan
Scatterplot matrices … are a table of scatterplots for a set of variables Look for - • systematic trend between “independent” variable and dependent variables - to be described by estimated model • systematic trend between supposedly independent variables - indicates that these quantities are correlated • correlation can negatively ifluence model estimation results • not independent information • scatterplot matrices can be generated automatically with statistical software, manually using Excel J. McLellan
Scatterplot Matrices - tooth data J. McLellan
Describing Data Quantitatively Approach - describe the pattern of variability using a few parameters • efficient means of summarizing Techniques • average - (sample “mean”) • sample standard deviation and variance • median • quartiles • interquartile range • ... J. McLellan
Sample Mean - “Average” Given “n” observations xi : Notes - • sensitive to extreme data values - outliers - value can be artificially raised or lowered n 1 = å x x i n = i 1 J. McLellan
Sample Variance • sum of squared deviations about the average • squaring - notion of distance (squared) • average - is the centre of gravity • sample variance provides a measure of dispersion - spread - about the centre of gravity n 1 2 2 = - å s ( x x ) i - n 1 = i 1 Note - there is an alternative form of this equation which is more convenient for computation. Note that we divide by “n-1”, and NOT “n” - degrees of freedom argument J. McLellan
Sample Standard Deviation … is simply • sample standard deviation provides a more direct link to dispersion • e.g., for Normal distribution • 95% of values lie within 2 standard devn’s of the mean • 99% of values like within 3 standard devn’s of the mean 2 = s s J. McLellan
Range • provides a measure of spread in the data • defined as maximum data value - minimum data value • can be sensitive to extreme data points • is often monitored in quality control charts to see if process variance is changing J. McLellan
“Order” Statistics … summarize the progression of observations in the data set Quartiles • divide the data in quarters Deciles • divide the data in tenths ... J. McLellan
Quartiles • order data - N data points {yi}, i=1,…N • if N is odd, • median is observation • if N is even, • median is • i.e., midpoint between two middle points y + ( N 1 ) / 2 y y N N + + 1 2 2 2 J. McLellan
Quartiles - Q1 and Q3 • Q1: Compute (N+1)/4 = A.B • Q3: Compute 3(N+1)/4 = A.B • i.e., interpolate between adjacent points • Note - there are other conventions as well - e.g., for Q1, take bottom half of data set, and take midpoint between middle two points if there are an even number of points... = + - Q 1 y B * ( y y ) + A A 1 A J. McLellan
Quartiles - Example • solder data set • observations • 0.1, 0.12, 0.09, 0.07, 0.09, 0.11, 0.1, 0.13, 0.1 • ordered: 0.07, 0.09, 0.09, 0.1, 0.1, 0.1, 0.11, 0.12, 0.13 • 9 points --> median is 5th observation: 0.1 • Q1: (N+1)/4 = 2.5 • Q1 = 0.09+0.5*(0.09-0.09) = 0.9 • Q3: 3(N+1)/4 = 7.5 • Q3 = 0.11 + 0.5*(0.12-0.11) = 0.115 J. McLellan
Robustness … refers to whether a given descriptive statistic is sensitive to extreme data points Examples • sample mean • is sensitive to extreme points - extreme value pulls average toward the extreme • sample variance • sensitive to extreme points - large deviation from the sample mean leads to inflated variance • median, quartiles • relatively insensitive to extreme data points J. McLellan
Robustness -Solder Data Example • replace 0.13 by 0.5 - output from Excel With 0.13 With 0.5 J. McLellan
Robustness • Other robust statistics • “m-estimator” - involves iterative filtering out of extreme data values, based on data distribution • trimmed mean - other bases for eliminating extreme data point effect • median absolute deviation J. McLellan