300 likes | 431 Views
Quantifying Monte Carlo Uncertainty in the Ensemble Kalman Filter. Kristian Thulin* (CIPR), Geir Nævdal (IRIS) Hans Julius Skaug (UiB, Dpt. Math.) and Sigurd Ivar Aanonsen (CIPR) EnKF Workshop, 18 - 20 June 2008 Park Hotel, Voss, Norway. Example: Synthetic 2D case.
E N D
Quantifying Monte Carlo Uncertainty in the Ensemble Kalman Filter Kristian Thulin* (CIPR), Geir Nævdal (IRIS) Hans Julius Skaug (UiB, Dpt. Math.) and Sigurd Ivar Aanonsen (CIPR) EnKF Workshop, 18 - 20 June 2008 Park Hotel, Voss, Norway
Example: Synthetic 2D case • Reservoir model is simplified to a parabolic equation for single-phase • p – dynamic variable • κ – static unknown parameter • g – five spot sink/source term
Example: Synthetic 2D case • Square triangularized grid 665 nodes (p) 1248 triangles (K) • Red crosses – sources • Black crosses – sinks
Ensemble Kalman filter (EnKF) • Estimates the posterior probability density function (PDF/CDF) • Statistics estimated from a finite ensemble of realization • Solution dependent on sampling of initial ensemble (and data perturbations)
Motivation:Inconsistent posterior CDFs • Have noticed that different initial ensembles resulted in visually very different CDFs • 10 ensembles with 100 members each Initial Updated
Motivation • Posterior CDF only estimated for non-linear problem • Would like to sample from the same distribution for each repeated EnKF • Using different initial ensemble (e.g. same prior, different seed)
Kolmogorov-Smirnov test • K-S test for two samples • Null-hypothesis: • The two samples are from the same underlying distribution • Test the runs pair-wise • Bonferroni correction • Null-hypotheses: all samples are from the same underlying distribution
Kolmogorov-Smirnov test Static Variable • K-S test for all variables • After second assimilation step • Points with p-value below critical value are blank
Kolmogorov-Smirnov test • This confirms what we have seen visually • Ensemble members gets positively correlated during update • Lorentzen et al. (2005 at SPE ATCE) pointed out that forecasts from PUNQ-S3 study did not fulfil the K-S test for 100 members
Motivation – ensemble size • Typically 100 ensemble members are used • Good history matches and mean estimates reported using 30-40 ensemble members • To have good estimates of the uncertainty much larger ensemble size is needed
New proposed methodology:Multiple runs • Propose running multiple EnKF runs (m), each with fewer ensemble members (n) • Keeping (n x m) fixed • Gain more independent information • Members from different runs will be independent samples from the distribution we are seeking • Can construct a confidence interval on the estimated CDF
Confidence interval • Pink background: • Span of CDFs from the m runs • Solid blue line: • Mean over the runs • Dashed blue lines: • Confidence interval on your estimated CDF! • Black line: • Infinite ensemble run (10.000 members)
Optimal combination of ensemble size (n) and number of EnKF runs (m) • Keeping their product (n*m) fixed, what is the optimum combination of ensemble size and number of runs? • Too few members will give a biased result (no history match) • Too few runs will give a very large uncertainty in the estimated CDF
Mean Square Error • Find the optimum combination by minimizing the Mean Square Error MSE = Variance + Bias2 • Variance and Bias calculated for each value on x-axis, and then integrated • An t-distribution standard error is used for the variance to account for the large uncertainty with very few runs
Example • Focus on one selected point in space (0,0) • One dynamic and one static parameter • Look at the posterior CDF after two data assimilation steps at this point
Example 1 – 1000 members • Given a total of 1000 ensemble members • Different combinations of m and n • Mean of the m runs gives the final estimate • Compare with an “infinite” ensemble run
Example (static variable) 200 x 5 members 50 x 20 members 10 x 100 members 3 x 333 members Black line: infinite run Blue lines: 95% confidence interval
Example - Bias • Means over a large number of runs • The trend of the bias is clear • Runs with 5 and 10 members does not give satisfying results • Similar for the dynamic variable
Example • Similar behaviour for both variables • Too few ensemble members gives a biased estimate • Too few runs gives a very large variance • Want optimize n (or m) given n x m
Example (MSE) • Calculate the bias and variance along the x-axis • Integrate over x to obtain a single number for each combination of • m and n • Want to minimize the integrated MSE Static variable
Example • IMSE has a clear minimum • Interval or point • As long as n > 50-100 and m > 3-4 we have a satisfying combination in this example • Difficult to conclude on a specific minimum without more information
Example 2 – 200 members • Given a total of 200 ensemble members • Different combinations of m and n
Example (dynamic variable) 40 x 5 members 20 x 10 members 5 x 40 members 3 x 66 members Black line: infinite run Blue lines: 95% confidence interval
Example (MSE) • Calculate the bias and variance along the x-axis • Integrate over x to obtain a single number for each combination of • m and n • Want to minimize the integrated MSE Static variable
Example • Similar results as in the 1000 members case • As long as n > approx. 20 and m > approx. 3 we have a satisfying result for this example
MSE plots 200 members 1000 members dynamic static
Summary and Conclusions • Observed inconsistencies in estimated posterior CDFs • Running with different seed in the initial sampling • Used Kolmogorov-Smirnov test to verify that most updated variables become positively correlated
Summary and Conclusions • Suggest running multiple EnKF runs, each with fewer ensemble members • Obtain more independent information • Gain information about the size of the Monte Carlo error in the final estimate • Allows for a point-wise confidence interval on the CDF • Present a methodology for finding an optimal combination of the number of runs (m) and ensemble members (n) • Keeping n*m (resources) fixed
Summary and Conclusions • Methodology tested with a simplified single-phase flow example in 2D • Final estimated CDF from multiple runs give equally good or better results than with one single run • For our examples • Found a interval minimum • Difficult to conclude any further for a minimum • More than 3-4 runs are required (in general?) • MSE for 1000 members was ≈ 1/10 of the MSE for 200 members
Summary and Conclusions • It might be better to run multiple runs with fewer ensemble members. • Should to be able to run at least around 4 runs with “large enough” ensemble size to have an effect • General guidelines for optimum can be made from experience with synthetic problems