480 likes | 613 Views
Application of Statistical Techniques to Interpretation of Water Monitoring Data. Eric Smith, Golde Holtzman, and Carl Zipper. Outline. I. Water quality data: program design (CEZ, 15 min) II. Characteristics of water-quality data (CEZ, 15 min) III. Describing water quality(GIH, 30 min)
E N D
Application of Statistical Techniques to Interpretationof Water Monitoring Data Eric Smith, Golde Holtzman, and Carl Zipper
Outline I. Water quality data: program design (CEZ, 15 min) II. Characteristics of water-quality data (CEZ, 15 min) III. Describing water quality(GIH, 30 min) IV. Data analysis for making decisions A, Compliance with numerical standards (EPS, 45 min) Dinner Break B, Locational / temporal comparisons (“cause and effect”) (EPS, 45) C, Detection of water-quality trends (GIH, 60 min)
III. Describing water quality(GIH, 30 min) • Rivers and streams are an essential component of the biosphere • Rivers are alive • Life is characterized by variation • Statistics is the science of variation • Statistical Thinking/Statistical Perspective • Thinking in terms of variation • Thinking in terms of distribution
The present problem is multivariate • WATER QUALITY as a function of • TIME, under the influence of co-variates like • FLOW, at multiple • LOCATIONS
Water Variable Time in Years WQ variable versus time
Univariate WQ Variable Water Quality Time
Water Quality Water Quality Water Quality Water Quality Water Quality Water Quality Water Quality Water Quality Water Quality Water Quality Water Quality Water Quality Time Univariate WQ Variable
The three most important pieces of information in a sample: • Central Location • Mean, Median, Mode • Dispersion • Range, Standard Deviation, Inter Quartile Range • Shape • Symmetry, skewness, kurtosis • No mode, unimodal, bimodal, multimodal
Central Location: Sample Mean • (Sum of all observations) / (sample size) • Center of gravity of the distribution • depends on each observation • therefore sensitive to outliers
Central Location: Sample Mean • (Sum of all observations) / (sample size) • Center of gravity of the distribution • depends on each observation • therefore sensitive to outliers
Central Location: Sample Mean • (Sum of all observations) / (sample size) • Center of gravity of the distribution • depends on each observation • therefore sensitive to outliers
Central Location: Sample Mean • (Sum of all observations) / (sample size) • Center of gravity of the distribution • depends on each observation • therefore sensitive to outliers
Central Location: Sample Mean • (Sum of all observations) / (sample size) • Center of gravity of the distribution • depends on each observation • therefore sensitive to outliers
Central Location: Sample Mean • (Sum of all observations) / (sample size) • Center of gravity of the distribution • depends on each observation • therefore sensitive to outliers
Central Location: Sample Median • Center of the ordered array • I.e., the (0.5)(n + 1) observation in the ordered array. • If sample size nis odd, then the median is the middle value in the ordered array. • Example A: • 1, 1, 0, 2 , 3 • Order: • 0, 1, 1, 2, 3 • n = 5, odd • (0.5)(n + 1) = 3 • Median = 1 • If sample size nis even, then the median is the average of the two middle values in the ordered array. • Example B: • 1, 1, 0, 2, 3, 6 • Order: • 0, 1, 1, 2, 3, 6 • n = 6, even, • (0.5)(n + 1) = 3.5 • Median = (1 + 2)/2 = 1.5
Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers
Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers
Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers
Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers
Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers
Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers
Central Location: Mean vs. Median • Mean is influenced by outliers • Median is robust against (resistantto) outliers • Mean “moves” toward outliers • Median represents bulk of observations almost always Comparison of mean and median tells us about outliers
Dispersion • Range • Standard Deviation • Inter-quartile Range
Dispersion: Range • Maximum - Minimum • Easy to calculate • Easy to interpret • Depends on sample size (biased) • Therefore not good for statistical inference
0 5 -1 +1 Dispersion: Standard Deviation 0 1 2 SD = 1 1 -1 3 SD = 2 0 5 -2 +2
Dispersion: Properties of SD • SD > 0 for all data • SD = 0 if and only if all observations the same (no variation) • Familiar Intervals for a normal distribution, • 68% expected within 1 SD, • 95% expected within 2 SD, • 99.6% expected within 3 SD, • Exact for normal distribution, ballpark for any distn • For any distribution, nearly all observations lie within 3 SD
Mean = 7.6 SD = 0.41 Median = 7.6 Interpretation of SD n = 200
Quartiles, Percentiles, Quantiles, Five Number Summary, Boxplot
Quartiles (undergrad classes) E.g., Sample: 0, −3.1, −0.4, 0, 2.2, 5.1, 3.8, 3.8, 3.9, 2.3, n = 10 Note: Quartiles Q0, Q1, Q2, Q3, Q4, = Quantiles Q0.00, Q0.25, Q0.50, Q0.75, Q1.00
Terminology Warning:Quartiles, a.k.a. Percentiles, a.k.a. Quantiles Note: Quartiles Q0, Q1, Q2, Q3, Q4, = Quantiles Q0.00, Q0.25, Q0.50, Q0.75, Q1.00
Terminology Warning:But Percentiles and Quantiles are more general Note: Quartiles Q0, Q1, Q2, Q3, Q4, = Quantiles Q0.00, Q0.25, Q0.50, Q0.75, Q1.00
Quantile Location and Quantilesby weighted averages (graduate classes) E.g., Sample: 0, −3.1, −0.4, 0, 2.2, 5.1, 3.8, 3.8, 3.9, 2.3, n = 10 Example: Find the 20th percentile of the sample above. Step 1: q = 0.20, n =10 L= 0.20(10 + 1) = 2.2 indicating the “2.2th “ observation in the ordered array. Step 2: Therefore the 0.20 quantile is a weighted average of the 2nd and 3rd observations in the ordered array, which are a = − 0.4, b = 0 and the weight is w = 0.2 Q = -0.4 + 0.2(0 – (– 0.4)) = – 0.40 + 0.08= – 0.32
Quantile Location and Quantilesby weighted averages (graduate classes) E.g., Sample: 0, −3.1, −0.4, 0, 2.2, 5.1, 3.8, 3.8, 3.9, 2.3, n = 10 Step 2: a= − 0.4, b = 0, w = 0.2 Q = a + w(b – a) = – 0.4 + 0.2(0 – (– 0.4)) = – 0.4 + 0.2(0.4) = – 0.40 + 0.08 = – 0.32 0.4 0 – 0.4 – 0.32
Quantile Location and Quantiles Example: 0, − 3.1, − 0.4, 0, 2.2, 5.1, 3.8, 3.8, 3.9, 2.3, n = 10
5-Number Summary and Boxplotusing weighted averages for quantiles Note slightly different results by using weighted averages.
Dispersion: IQRInter-Quartile Range • (3rd Quartile - (1st Quartile) • Robust against outliers
n = 200 Mean = 7.6 SD = 0.41 Median = 7.6 IQR = 0.54 For a Normal distribution, Median 2IQR includes 99.3% Interpretation of IQR
Shape: Symmetry and Skewness • Symmetry mean bilateral symmetry
Shape: Symmetry and Skewness • Symmetry mean bilateral symmetry • Positive Skewness (asymmetric “tail” in positive direction)
Shape: Symmetry and Skewness • “Symmetry” mean bilateral symmetry, skewness = 0 • Mean = Median (approximately) • Positive Skewness (asymmetric “tail” in positive direction) • Mean > Median • Negative Skewness (asymmetric “tail” in negative direction) • Mean < Median Comparison of mean and median tells us about shape
Outliers Whisker 75th %-tile = 3rd Quartile Median IQR 25th %-tile = 1st Quartile Whisker Outlier Box Plot
Wise, VA, below STP pH TKN mg/l
Wise, VA below STP BOD (mg/l) DO (% satur)
Wise, VA below STP Fecal Coliforms Tot Phosphorous (mg/l