Data, Statistics, and Environmental Regulation: Avoiding Pitfalls

Data, Statistics, and Environmental Regulation:Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us 292-7530

Value of statistics Some common problems Non-detects Central tendency Uncertainty Significance Some suggestions for dealing with common problems Some examples based on real data

Once data are properly collected and quantified, statistics come into play. Statistics can help make sense of a jumble of data, can help identify meaningful inferences, and can prevent blunders that could lead to serious consequences. But, they can be misused. “The statistics on sanity are that one out of every four Americans is suffering from some form of mental illness. Think of your three best friends. If they're okay, then it's you.” Rita Mae Brown

Identification of some potential problems can help avoid misuse and help harness the full power of statistics. The first problem we often face is dealing with non-detects (NDs); i.e., values that are below the minimum detection limit (MDL). What to do with these data? We typically use substitution methods. There are more sophisticated statistical approaches, but they are more complicated. With the substitution approach, NDs are typically replaced with the MDL itself, or with 1/2 the MDL. Sometimes, the NDs are discarded (not recommended) or, worse, replaced with zero.

Example of a problem with non-detects, based on typical data, e.g. discharges from different facilities to a waste-stream. To quantify flows, or in some other way to characterize the entire group, requires replacement of NDs with a number.

Problems arise when a large proportion, or a large weighted proportion, of the data are below the detection limit

If the way that non-detects are treated influences the conclusions that are drawn from the data in an important way, seek expert advice on the best way to treat the values that are below the detection limit.

Assuming that the issue of non-detects is adequately dealt with, another frequently-faced challenge is describing or summarizing the data, i.e., determining the central tendency of the data. Here again there are some pitfalls.

Most widely-used statistical methods assume a normal distribution of data (also called “Gaussian” or “bell-shaped” distribution).2 Unfortunately, environmental data often do not have a normal distribution. But, there are methods to assess central tendency of non-normally distributed data that are relatively robust. And there are some pitfalls that should be avoided. 2There are other statistical methods that don’t assume normality, e.g. distribution of ranks and resampling methods. These are not discussed here.

A typical environmental data set.... .....cont’d....

And its histogram, showing non-normal distribution....

Non-normal distributions often look normal if the log of each value is used instead of the actual value......

One can convert the values of “log-normal” data to the logarithms of the values, and then perform statistical tests using common statistical methods. However, with environmental data, use of the logs of the actual data is often questionable, and can lead to conclusions that are not conservative and may be bogus. This is especially apparent in materials accounting contexts. (“There is a no law of conservation of logarithms of mass.”1) 1 Parkhurst, David F., 1999, Arithmetic Versus Geometric Means for Environmental Concentration Data, Environ. Sci. Technol., 32, 92A-98A.

Another example of environmental data... What is the central tendency?

Three measures of central tendency; which is best? It depends. If you want to know what sort of value you are most likely to encounter on any given occasion, median or geometric mean may be best. But........

In a materials accounting context, the median and geometric mean are non-conservative and can lead to bogus conclusions regarding the central tendency. 1 1 100 Stream system with five tributaries, all with equal flows; varying concentrations (C) of a pollutant - what is concentration of the pollutant at point P? 10 1 Mean = 113/5 = 22.6 Median = 1 Geometric mean =10(2+1+0+0+0)/5 = 100.6 = 4.0 P C at point P = ?

There are more sophisticated ways to estimate the central tendency, but they require additional expertise. In a materials accounting context, i.e., where you care about the mass of something, the simplest and most conservative approach is to use the mean (arithmetic average). Unless you have a very good reason, do not use the median or geometric mean to describe or summarize non-normally distributed environmental data. Both the median and geometric mean undervalue high readings.

Once we’ve estimated the central tendency, how confident are we about its accuracy? How much uncertainty is associated with the estimate? Estimating the uncertainty is important; it may be the most important tool of statistics. With a mean value, uncertainty is typically expressed as the range in which we can be 95% (or sometimes 90% or 99%) certain that the actual mean of the entire population will be.

With non-normally distributed data, especially if the number of samples is relatively small, a simple, relatively robust approach to determine the confidence interval is to use the formula:  = x t/2 s/ n where  is the population mean, x is the sample mean, t/2 is the critical value of the t distribution with n-1 degrees of freedom, s is the standard deviation of the sample, and n is the number of samples.

For Site A, the mean is 16.8 pg/m3, and the 95% confidence interval of the mean is  9.5 pg/m3. So, the true mean, which can be expected to emerge if enough samples are taken, may be as low as 7.3 pg/m3 or as high as 26.3 pg/m3. The same method can be applied to the other sites’ data. This can have important implications relative to standards or criteria.

Confidence intervals also apply to the assessment of differences and trends. The human brain excels at finding patterns, even from random data. Statistics can prevent us from thinking we see a pattern in what is actually random variation. The term significant is used to denote that some result or conclusion is not likely due to chance. Are differences or trends significant?In statistical terms, significant means there’s only a small probability (usually <5%, sometimes <1% or <10%) that the results are due to random variability. Sometimes a little more data can clarify whether an apparent trend is in fact significant.

NJ GHG Emissions; 1985 through 1999; trend not significant 1986 1988 1990 1992 1994 1996 1998

NJ GHG Emissions; 1985 through 1999; trend not significant 1986 1988 1990 1992 1994 1996 1998 The apparent positive trend is not significant at the 95% confidence level because there’s at least a 5% chance the true slope could be zero or negative

NJ GHG Emissions; 1985 through 2000; significant trend 1986 1988 1990 1992 1994 1996 1998 2000

Statistics can be misused! Important potential pitfalls: * inappropriate handling of values below detection limit * inappropriate use of median and geometric mean * failure to consider uncertainty * failure to consider significance But used properly, statistics will * help make sense of a jumble of data * help identify meaningful inferences, and * prevent unwarranted conclusions.

Data, Statistics, and Environmental Regulation: Avoiding Pitfalls