170 likes | 273 Views
Statistical basis for dynamic prediction of beach bacteria concentrations. Zhongfu Ge National Research Council Walter E. Frick Ecosystems Research Div., NERL, USEPA, Athens, GA. NOAA’s Ocean and Human Health Initiative All PI’s 2006 Annual Meeting January 18-20, 2006 Charleston, SC.
E N D
Statistical basis for dynamic prediction of beach bacteria concentrations Zhongfu Ge National Research Council Walter E. Frick Ecosystems Research Div., NERL, USEPA, Athens, GA NOAA’s Ocean and Human Health Initiative All PI’s 2006 Annual Meeting January 18-20, 2006 Charleston, SC
Objectives • To demonstrate a multiple linear regression modeling of E coli concentrations • To clarify a few misunderstandings and pitfalls in practice • To promote the idea of dynamic modeling: based on a growing data-base
Example of modeling at a Lake Erie beach • Huntington Beach, OH: data of 247 days in 2001; only four explanatory variables available
Correlation coeff. with time delay : Insignificant correlations Cross-correlation with time delays • Do the data records need to be synchronized? Not for this case Highest correlation at zero time delay
Transformation is very necessary • Inspect scatter plots to see if we need any transformation to make equal spread
Still remember transformations for equal spreads Interaction terms • Including interaction terms can greatly improve fitting performance
Categorized data nearly normally distributed Categorization of wind direction • Wind direction was categorized into northerly (WD=0) and southerly (WD=1) winds; histograms to show equal spreads Northerly winds Southerly winds
Other issues with data inspection • Multicollinearity: variance inflation factor (VIF) for each explanatory variable Correlation coefficients • Adjustment for “time-series effect”
Residuals are highly normal MLR fitting of the full model and the normality of the residuals
Outlier identification • Adjustment for serial correlation Table 3: Influential cases and outliers from the full model; total number of cases 247; numbers in red are influential outliers (appearing in the both rows) It’s not simple to deal with outliers; if there is no evidence of errors in measurements, they should be kept
: number of variables; : sample size; : standard deviation Best models Model selection • Backward elimination: Cp + R2 or BIC + R2 Sequence of elimination
Model selection • What if we didn’t have interaction terms? All models are biased, R2 means nothing
Dynamic modeling • Models are updated when new observations are added to the data base Predictions with (left) and without (right) outliers (#1 and #135 days)
Dynamic modeling • The table below shows how models change with time The variable is in the model; R2 is consistently around 48%
Conclusions • Model selection should be implemented using Cp and R2 as criteria; R2 or t-statistic alone doesn’t mean anything • Transformations make models correct • Inclusion of interaction terms can improve R2 of the model; it’s useful especially when variables are limited. (48% in the current case compared with 41% in previous works without interactions) • Optimal models are both beach-specific and time-varying
References • Francy, D.S. and R.A. Darner 1998. Factors affecting Escherichia coli concentrations at Lake Erie public bathing beaches. USGS Water Resources Investigations Report 98-4241. Columbus, Ohio • Nevers M.B. and R.L. Whitman. Protecting visitor health in beach waters of Lake Michigan: problems and opportunities. The State of Lake Michigan: Ecology, Health and Management, Eds. T. Edsall & M. Munawar. Ecovision World Monograph Series, 2004 Aquatic Ecosystem Health and Management Society • Ramsey, F.L. and D.W. Schafer 2002. The statistical sleuth: a course in methods of data analysis, second edition. Duxbury Thomson Learning Acknowledgements