Data-driven Kriging models based on FANOVA decomposition

Data-driven Kriging models based on FANOVA decomposition O. Roustant, Ecole des Mines de St-Etienne, www.emse.fr/~roustant joint work with T. Muehlenstädt1, L. Carraro2 and S. Kuhnt1 1 University of Dortmund - 2 Telecom St-Etienne 1

Cliques of FANOVA graph and block additive decomposition f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) Z(x) = Z1,2,3(x1,x2,x3) + Z4,5,6(x4,x5,x6) + Z3,4(x3,x4) k(h) = k1,2,3(h1,h2,h3) + k4,5,6(h4,h5,h6) + k3,4(h3, h4) Cliques: {1,2,3}, {4,5,6}, {3,4} 2

This talk presents – with many pictures - the main ideas of the corresponding paper: we refer to it for details. 3

Introduction

Computer experiments x1 f x2 y xd A keyword associated to the analysis of time-consuming computer codes 5

Metamodeling and Kriging Metamodeling: construct a cheap-to-evaluate model of the simulator (itself modeling the reality) Kriging: basically an interpolation method based on Gaussian processes 6

Kriging model (definition) Some conditional simulations Y(x) = b0 + b1g1(x) + … + bkgk(x) + Z(x) linear trend (deterministic) + centered stationary Gaussian process (stochastic) 7

Kriging model (prediction) Conditional mean and 95% conf. int. Some conditional simulations 8

Kriging model (kernel) • Kriging model is a kernel-based method K(x,x’) = cov(Z(x), Z(x’))  FLEXIBLE (see after…) • When Z is stationary, K(x,x’) depends on h=x-x’  we denote k(h) = K(x,x’) 9

Kriging model (kernel) • “Making new kernels from old” (Rasmussen and Williams, 2006) • K1 + K2 • cK, with c>0 • K1K2 • … 10

Kriging model (common choice) • Tensor-product structure k(h) = k1(h1)k2(h2)…kd(hd) with hi=xi – xi’, and ki Gaussian, Matern 5/2… 11

The main idea on an example • Ishigami, defined on D = [-π,π]3, with A=7, B=0.1: f(x) = sin(x1) + Asin2(x2) + B(x3)4sin(x1) • This is a block additive decomposition f(x) = f2(x2) + f1,3(x1,x3) Z(x) = Z2(x2) + Z1,3(x1,x3)  k(h) = k2(h2) + k1,3(h1,h3) 12

The main idea on an example • Comparison of the two Kriging models • Training set: 100 points from a maximin Latin hypercube • Test set: 1000 additional points from a unif. distribution 13

The schema to be generalized f k = k2 + k1,3 14

Outline Introduction [How to choose a Kriging model for the Ishigami function] • From FANOVA graphs to block additive kernels [Generalizes 1.] 2. Estimation methodologies [With a new sensitivity index] 3. Applications • Some comments 15

From FANOVA graphs to block additive kernels

FANOVA decomposition (Efron and Stein, 1981) • Assume that X1, …, Xd are independent random variables. Let f be a function defined on D1x…xDd and dν=dν1…dvd an integration measure. Then: f(X) = μ0 + Σμi(Xi) + Σμi,j(Xi,Xj) + Σμi,j,k(Xi,Xj,Xk) + … where all terms are centered and orthogonal to the others. They are given by: μ0 := E(f(X)), μi(Xi) := E(f(X)|Xi) – μ0 μi,j(Xi,Xj) := E(f(X)|Xi,Xj) - μi(Xi) - μj(Xj) – μ0 and so on… 17

FANOVA decomposition μ1(x1) (main effect) μ0 μ2(x2) (main effect) μ1,3(x1,x3) (2nd order interaction) Example. Ishigami function, with uniform measure on D = [-π,π ]3. With a=π, b=2π5/5, we have: f(x) = sin(x1) + Asin2(x2) + B(x3)4sin(x1) = aA + sin(x1)(1+bB) + A(sin2(x2)-a) + B[(x3)4-b]sin(x1) 18

FANOVA decomposition Example (following) • Some terms can vanish, due to averaging, as μ3 , or μ1 if B = -1/b, but this depends on the integration measure, and only happens when there exist terms of higher order • On the other hand, and for the same reason, we always have: μ2,1 = μ2,3 = 0 and, under mild conditions: μ1,3 ≠ 0 19

FANOVA decomposition • The name “FANOVA” becomes from the relation on variances implied by orthogonality: var(f(X)) = Σvar(μi(Xi)) + Σvar(μi,j(Xi,j)) + … which measures the importance of each term. • var(μJ(XJ))/var(f(X)) is often called a Sobol indice 20

FANOVA graph Vertices: variables Edges: if there is at least one interaction (at any order) Width: prop. to variances Here (example above): μ1,2 = μ2,3 = 0 The graph does not depend on the integration measure, (under mild conditions) 21

FANOVA graph and cliques Cliques: {1,3} and {2} A complete subgraph: all edges exist A clique: maximal complete subgraph 22

Cliques of FANOVA graph and block additive decomposition f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) Cliques: {1,2,3}, {4,5,6}, {3,4} 23

Why cliques? f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) {1,2,3,4,5,6} Incomplete subgraphs  rough model forms f(x) = f1,…,6(x1,…,x6) !!! 24

Why cliques? f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) {1,2},{2,3},{1,3},{3,4}{4,5,6} f(x) = f1,2(x1,x2) +f2,3(x2,x3)+f1,3(x1,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) Non maximality  wrong model forms 25

Cliques of FANOVA graph and Kriging models Cliques: {1,2,3}, {4,5,6}, {3,4} f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) Z(x) = Z1,2,3(x1,x2,x3) + Z4,5,6(x4,x5,x6) + Z3,4(x3,x4) k(h) = k1,2,3(h1,h2,h3) + k4,5,6(h4,h5,h6) + k3,4(h3, h4) 26

Estimation methodologies

Graph estimation • Challenge: estimate all interactions (at any order) involving two variables • Two issues: • The computer code is time-consuming • Huge number of combinations for the usual Sobol indices 28

Graph estimation • Solutions • Replace the computer code by a metamodel, for instance a Kriging model with a standard kernel • Fix x3, …, xd, and consider the 2nd order interaction of the 2-dimensional function: (x1, x2)  f(x) = f1(x-2) + f2(x-1) + f12(x1,x2; x-{1,2}) Denote D12(x3,…,xd) the unnormalized Sobol indice, and define: D12 = E(D12(x3,…,xd)) Then D12 > 0 iif (1,2) is an edge of the FANOVA graph 29

Graph estimation • Comments: • The new sensitivity index is computed by averaging 2nd order Sobol indices, and thus numerically tractable • In practice “D12> 0” is replaced by “D12 > δ”  Different thresholds give different FANOVA graphs 30

Kriging model estimation • Assume that there are L cliques of size d1,…,dL. The total number of parameters to be estimated is: ntrend + (d1+1) + … + (dL + 1) (trend, “ranges” and variance parameters) • MLE is used, 3 numerical procedures tested 31

Kriging model estimation • Isotropic kernels are useful for high dimensions • Example: suppose that C1={1,2,3}, C2={4,5,6}, C3={3,4}, and x7, …, x16 have a smaller influence 1st solution: C4={x7}, …, C13={x16} N = ntrend + 4 + 4 + 3 + 10*2 = ntrend+ 31 2nd solution: C4 = {x7, …, x16}, with an isotropic kernel N = ntrend + 4 + 4 + 3 + 2 = ntrend+ 13 32

Applications

a 6D analytical case • f(x) = cos(- 0.8 - 1.1x1 + 1.1x2 + x3) + sin(- 0.5 + 0.9x4 + x5 – 1.1x6) + (0.5 + 0.35x3 - 0.6x4)2 • Domain: [-1,1]6 • Integration measure: uniform • Training set: 100 points from a maximin LHD • Test set: 1000 points drawn from a unif. dist. 34

a 6D analytical case Estimated graph Usual Sensitivity Analysis (from R package sensitivity) 35

a 6D analytical case 36

a 16D analytical case Consider the same function, but assume that it is in a 16D space (with 10 more inactive variables) Including all the inactive variables in one clique is improving the prediction 37

A 6D case study • Piston slap data set (Fang et al., 2006) • Unwanted noise of engine, simulated using a finite elements method • Training set: 100 points • Test set: 12 points • Leave-one-out is also considered 38

A 6D case study 39

A 6D case study Leave-one-out RMSE: 0.0864 (standard Kriging), 0.0371 (modified Kriging) 40

Some comments

Some comments • Main strengths • Adapting the kernel to the data in a flexible manner • A substantial improvement may be expected in prediction • Depending on the function complexity • Some drawbacks • Dependence on the first initial metamodel • Sometimes a large nb of parameters to be estimated • May decrease the prediction power 42

THANKS A LOT FOR ATTENDING! 43

Data-driven Kriging models based on FANOVA decomposition