640 likes | 776 Views
K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications). Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO Denver Water February 2007.
E N D
K-Nearest Neighbor Resampling Technique(Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO Denver Water February 2007
“Translation” of Climate Info • Users most interested in sectoral outcomes (streamflows, crop yields, risk of disease X) Climate Forecast / Projection Forecast / Projection Translation Process Models Distribution of Outcomes
Translation Historical Data Synthetic series Process model 28.5 … … … 12.4 23.1 … … … 10.2 29.1 … … … 11.4 25.8 … … … 9.7 … Frequency distribution of outcomes
Why Simulation? • Limited historical data • cannot capture the full range of variability • electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Need – tool to generate ‘scenarios’ that capture the historical statistical properties • Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) • These are cumbersome, restrictive (in their assumptions) • Re-sampling techniques are simple and robust • Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.
Why Simulation? • Limited historical data • cannot capture the full range of variability • electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Need – tool to generate ‘scenarios’ that capture the historical statistical properties • Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) • These are cumbersome, restrictive (in their assumptions) • Re-sampling techniques are simple and robust • Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.
Re-sampling Techniques • Drawing cards from a well shuffled deck • Selecting a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Drawing card from a biased deck • Selecting a (single or a set of) historical years with unequal chance. E.g., selecting only El Nino years Conditional bootstrap • K-Nearest Neighbor Bootstrap – “pattern matching” • Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’ • Select one of the K neighbors at random • Repeat to produce an ensemble
Examples • Ensemble Weather Generation • Scenario generation • Forecast Argentina - Pampas Region • Water Quality Modeling (Boulder Water Utility)
Two Step Weather Generator • Estimate Transition (wet to dry, etc.) Probabilities of the Markov Chain order-1 from historical data – for each month • Generate Precipitation State time series using Markov Chain • Suppose we need weather simulation for January 5th - January 4th is a wet day • Get Neighbors from a 7-day window (7*50) centered on January 4th • Screen days using the Precipitation state [(1,0), days in blue] – i.e., “Potential Neighbors” • Calculate the distances between weather variables of current day feature vector and the potential neighbors • Select the K-nearest neighbors • Assign them weights Generated Precipitation State time series • Pick a day from k-NN using the weight function – say, Jan 1st 1953 • The simulated weather for Jan 5th is Jan 2nd 1953. • Repeat
Single Site Simulation • Pergamino, Argentina • Daily weather variables 1931-2003 • Precipitation • Max. Temperature • Min. Temperature • 100 simulations of 73 year length (as length of record) • Statistics of simulated and historical data are compared
Spell Properties Pergamino, Argentina
Conditional K-NN Re-sampling • Conditioned on IRI seasonal forecast • Get the prediction (A:N:B=40:35:25) • Divide historical (seasonal) total into 3 tercile categories • Bootstrap 40, 35 and 25 sample of historical years from wet, normal and dry categories • Apply the two-step weather generator on this sample.
Multi-site extension • Same procedure as single site is used but • Calculate the Average time series – “single site virtual weather data” • Apply the two-step generator • Select the weather at all the locations on the picked day – to obtain multi-site simulation • Stations in Pampus region, Argentina • Pergamino • Junin • Nueve de Julio
Multisite Case wet and dry spell Statistics Pergamino, Argentina
Motivation Finished water must comply with a given regulation Water Treatment Plant • TOC • TSUVA • Alkalinity • pH • Turbidity • Temperature Finished Water Quality Influent Water Quality
Motivation Uncertainty helps us to understand the risk of non-compliance with a given regulation WTP Comply Non-Compliance Distribution Distribution Input Output The possibilities!
Data Set Information Collection Rule (ICR) • Monitoring effort mandated by USEPA • Large public water systems • Water quality and operating data • Disinfection by-products (DBPs) and microorganisms to support rulemakings • Most comprehensive view of large drinking water systems to date
Data Set ICR • 18 months (Jul. 1997 – Dec. 1998) • 458 continental US locations
Data Set ICR Database • Water Quality • Influent • Intermediate • Finished • Distribution system • Chemical Additions
Characterize Variability Influent water quality has significant variability due to - climate - geology - water management practices Source Water • TOC • TSUVA • Alkalinity • pH • Turbidity • Temperature • Total Hardness
Variability • Examine influent water quality for surface waters (SWs) • Spatial variability • Temporal variability • Focus on total organic carbon (TOC) • TOC is a precursor in formation of DBPs • Methods extend to other water quality parameters
Variability Spatial Variability • Local polynomial approach • Find best K and P combination • Contour estimates
Variability Spatial VariabilitySW Average Annual TOC (mg/L)
Variability Spatial Variability Similar spatial patterns found for • Finished water TOC (lower) • Distribution system DBPs • TTHM (total trihalomethanes) • HAA5 (five haloacetic acids)
Variability Spatial Variability Spatial patterns consistent with previous research for other influent water quality variables • Alkalinity • Bromide
Variability Temporal Variability City of Boulder’s Betasso Water Treatment Plant (CO) Influent TOC (mg/L) 0 1 2 3 4 J F M A M J J A S O N D
Variability Temporal Variability • Some locations exhibited seasonal trends, others did not • Month to month variations should be considered
Variability • Inherent variability in water quality contributes to uncertainty • How can we quantify uncertainty?
Quantify Uncertainty Simulate “ensembles” of influent water quality (Monte Carlo) Ensembles Observed data
Quantify Traditional Method • Fit a probability density function (pdf) to the data • -Normal, • Lognormal, etc. • Simulate from pdf Normal Lognormal
Quantify Limitations - What if the pdf is not a good fit? - What if you don’t have enough data to make the pdf? ex. 18 months/location in ICR database
Quantify Space-Time Bootstrapping Method • Skip fitting a pdf to the data • Simulate by bootstrapping • Randomly sample data with replacement • Expand bootstrapping pool to include “similar” locations (nearest neighbors) • What is limited in time is available in space
Quantify • Find nearest neighbors (locations) in terms of a feature vector that includes variables of interest • Feature vector includes: - Average Annual Concentration - Latitude - Longitude
Quantify Average annual concentration helps finds neighbors that are similar but may not be geographically nearby. Geographically close, but not good “neighbors” for bootstrapping Average annual TOC (mg/L) for Ohio surface waters
Quantify • Sample monthly TOC values based on feature vector • Conditional probability
Quantify Simulation Algorithm 1) User inputs their location and their average annual TOC concentration 2) The ICR database is queried for all eligible entries
Quantify Algorithm- cont. 3) Calculate distances, d,between the xuser vector and the xICR vector
Quantify Algorithm- cont. 3) Calculate distances using weighted Mahalanobis equation
Quantify Algorithm- cont. Remove the weights (W) and the covariance matrix (S) and it’s Euclidean Distance
Quantify Algorithm- cont. By including S, covariance matrix, components of the feature vector do not have to be scaled (Davis 1986 )
Quantify Algorithm- cont. Weights are assigned as
Quantify Weights offer flexibility in neighbor selection (a) (b) (c) (d)
Quantify Algorithm- cont. 4) Obtain observed monthly data for each nearest neighbor
Quantify Algorithm- cont. 5) Bootstrap xNNusing a weight function Increases likelihood of picking nearer neighbors
Quantify Apply algorithm to quantify uncertainty in influent TOC concentration City of Boulder’s Betasso Water Treatment Plant (CO) Boulder SWs only, N = 334