K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications)

K-Nearest Neighbor Resampling Technique(Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO Denver Water February 2007

“Translation” of Climate Info • Users most interested in sectoral outcomes (streamflows, crop yields, risk of disease X) Climate Forecast / Projection Forecast / Projection Translation Process Models Distribution of Outcomes

Translation Historical Data Synthetic series Process model 28.5 … … … 12.4 23.1 … … … 10.2 29.1 … … … 11.4 25.8 … … … 9.7 … Frequency distribution of outcomes

Why Simulation? • Limited historical data • cannot capture the full range of variability • electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Need – tool to generate ‘scenarios’ that capture the historical statistical properties • Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) • These are cumbersome, restrictive (in their assumptions) • Re-sampling techniques are simple and robust • Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.

Re-sampling Techniques • Drawing cards from a well shuffled deck • Selecting a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Drawing card from a biased deck • Selecting a (single or a set of) historical years with unequal chance. E.g., selecting only El Nino years Conditional bootstrap • K-Nearest Neighbor Bootstrap – “pattern matching” • Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’ • Select one of the K neighbors at random • Repeat to produce an ensemble

Examples • Ensemble Weather Generation • Scenario generation • Forecast Argentina - Pampas Region • Water Quality Modeling (Boulder Water Utility)

Two Step Weather Generator • Estimate Transition (wet to dry, etc.) Probabilities of the Markov Chain order-1 from historical data – for each month • Generate Precipitation State time series using Markov Chain • Suppose we need weather simulation for January 5th - January 4th is a wet day • Get Neighbors from a 7-day window (7*50) centered on January 4th • Screen days using the Precipitation state [(1,0), days in blue] – i.e., “Potential Neighbors” • Calculate the distances between weather variables of current day feature vector and the potential neighbors • Select the K-nearest neighbors • Assign them weights Generated Precipitation State time series • Pick a day from k-NN using the weight function – say, Jan 1st 1953 • The simulated weather for Jan 5th is Jan 2nd 1953. • Repeat

Single Site Simulation • Pergamino, Argentina • Daily weather variables 1931-2003 • Precipitation • Max. Temperature • Min. Temperature • 100 simulations of 73 year length (as length of record) • Statistics of simulated and historical data are compared

Spell Properties Pergamino, Argentina

wet and dry spell statistics

Moments (wet month - Jan)

Moments (dry month - July)

Conditional K-NN Re-sampling • Conditioned on IRI seasonal forecast • Get the prediction (A:N:B=40:35:25) • Divide historical (seasonal) total into 3 tercile categories • Bootstrap 40, 35 and 25 sample of historical years from wet, normal and dry categories • Apply the two-step weather generator on this sample.

Conditional Weather Generation (results)

Multi-site extension • Same procedure as single site is used but • Calculate the Average time series – “single site virtual weather data” • Apply the two-step generator • Select the weather at all the locations on the picked day – to obtain multi-site simulation • Stations in Pampus region, Argentina • Pergamino • Junin • Nueve de Julio

Multisite Case wet and dry spell Statistics Pergamino, Argentina

Basic Distribution Properties

Spatial Correlation

Motivation Finished water must comply with a given regulation Water Treatment Plant • TOC • TSUVA • Alkalinity • pH • Turbidity • Temperature Finished Water Quality Influent Water Quality

Motivation Uncertainty helps us to understand the risk of non-compliance with a given regulation WTP Comply Non-Compliance Distribution Distribution Input Output The possibilities!

Data Set Information Collection Rule (ICR) • Monitoring effort mandated by USEPA • Large public water systems • Water quality and operating data • Disinfection by-products (DBPs) and microorganisms to support rulemakings • Most comprehensive view of large drinking water systems to date

Data Set ICR • 18 months (Jul. 1997 – Dec. 1998) • 458 continental US locations

Data Set ICR Database • Water Quality • Influent • Intermediate • Finished • Distribution system • Chemical Additions

Characterize Variability Influent water quality has significant variability due to - climate - geology - water management practices Source Water • TOC • TSUVA • Alkalinity • pH • Turbidity • Temperature • Total Hardness

Variability • Examine influent water quality for surface waters (SWs) • Spatial variability • Temporal variability • Focus on total organic carbon (TOC) • TOC is a precursor in formation of DBPs • Methods extend to other water quality parameters

Variability Spatial Variability • Local polynomial approach • Find best K and P combination • Contour estimates

Variability Spatial VariabilitySW Average Annual TOC (mg/L)

Variability Spatial Variability Similar spatial patterns found for • Finished water TOC (lower) • Distribution system DBPs • TTHM (total trihalomethanes) • HAA5 (five haloacetic acids)

Variability Spatial Variability Spatial patterns consistent with previous research for other influent water quality variables • Alkalinity • Bromide

Variability Temporal Variability City of Boulder’s Betasso Water Treatment Plant (CO) Influent TOC (mg/L) 0 1 2 3 4 J F M A M J J A S O N D

Variability Temporal Variability • Some locations exhibited seasonal trends, others did not • Month to month variations should be considered

Variability • Inherent variability in water quality contributes to uncertainty • How can we quantify uncertainty?

Quantify Uncertainty Simulate “ensembles” of influent water quality (Monte Carlo) Ensembles Observed data

Quantify Traditional Method • Fit a probability density function (pdf) to the data • -Normal, • Lognormal, etc. • Simulate from pdf Normal Lognormal

Quantify Limitations - What if the pdf is not a good fit? - What if you don’t have enough data to make the pdf? ex. 18 months/location in ICR database

Quantify Space-Time Bootstrapping Method • Skip fitting a pdf to the data • Simulate by bootstrapping • Randomly sample data with replacement • Expand bootstrapping pool to include “similar” locations (nearest neighbors) • What is limited in time is available in space

Quantify • Find nearest neighbors (locations) in terms of a feature vector that includes variables of interest • Feature vector includes: - Average Annual Concentration - Latitude - Longitude

Quantify Average annual concentration helps finds neighbors that are similar but may not be geographically nearby. Geographically close, but not good “neighbors” for bootstrapping Average annual TOC (mg/L) for Ohio surface waters

Quantify • Sample monthly TOC values based on feature vector • Conditional probability

Quantify Simulation Algorithm 1) User inputs their location and their average annual TOC concentration 2) The ICR database is queried for all eligible entries

Quantify Algorithm- cont. 3) Calculate distances, d,between the xuser vector and the xICR vector

Quantify Algorithm- cont. 3) Calculate distances using weighted Mahalanobis equation

Quantify Algorithm- cont. Remove the weights (W) and the covariance matrix (S) and it’s Euclidean Distance

Quantify Algorithm- cont. By including S, covariance matrix, components of the feature vector do not have to be scaled (Davis 1986 )

Quantify Algorithm- cont. Weights are assigned as

Quantify Weights offer flexibility in neighbor selection (a) (b) (c) (d)

Quantify Algorithm- cont. 4) Obtain observed monthly data for each nearest neighbor

Quantify Algorithm- cont. 5) Bootstrap xNNusing a weight function Increases likelihood of picking nearer neighbors

Quantify Apply algorithm to quantify uncertainty in influent TOC concentration City of Boulder’s Betasso Water Treatment Plant (CO) Boulder SWs only, N = 334

K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications)