Quantifying Uncertainty in Inverse Problems

Quantifying Uncertainty in Inverse Problems Workshop on Statistical Methods for Inverse Problems Institute for Pure and Applied MathematicsUniversity of California, Los Angeles5-6 November 2003 P.B. Stark Department of Statistics University of California Berkeley, CA 94720-3860 www.stat.berkeley.edu/~stark

Abstract Some qualitative and quantitative statistical measures of uncertainty apply to inverse problems. Decision theory provides a framework for thinking about inverse problems. The intrinsic uncertainty in an inverse problem usually is not equal to the uncertainty of a “solution” to the inverse problem constructed by any particular technique. The intrinsic uncertainty depends crucially on the prior constraints on the unknown (including prior probability distributions in the case of Bayesian analyses), on the forward operator, on the distribution of the observational errors, and on the kinds of properties of the unknown one wishes to estimate. Some aspects of uncertainty in inverse problems can be understood geometrically.

My Assumptions about You • All of you know much more math than I do. • I know a little more statistics than some of you do. • You haven’t heard me talk about this before. • You are more interested in theory than numerical algorithms or specific applications.

Outline • Inverse Problems as Statistics • Ingredients; Models • Forward and Inverse Problems—applied perspective • Statistical point of view • Some connections • Notation; linear problems; illustration • Example: geomagnetism from satellite observations • Qualitative uncertainty: Identifiability and uniqueness • Sketch of identifiablity and extremal modeling • Backus-Gilbert theory & extensions

Outline, contd. • Quantitative uncertainty: Decision Theory • Decision rules and estimators • Comparing decision rules: Loss and Risk • Example: Shrinkage estimators and MSE Risk • Strategies. Bayes/Minimax duality • Mean distance error and bias • Illustration: bounded normal mean • Illustration: Regularization • Illustration: Minimax estimation of linear functionals • Example: Gauss coefficients of the magnetic field • Distinguishing models: metrics and consistency

Inverse Problems as Statistics • Measurable space X of possible data. • Set  of descriptions of the world—models. • Family P = {Pq : q2Q} of probability distributions on X indexed by models . • Forward operatorqaPq maps model  into a probability measure on X. X –valued data X are a sample from Pq. Pq is everything: randomness in the “truth,” measurement error, systematic error, censoring, etc.

Models •  usually has special structure. •  could be a convex subset of a separable Banach space T. (geomag, seismo, grav, MT, …) • Physical significance of  generally gives qaPq reasonable analytic properties, e.g., continuity.

Forward Problems in Physical Science Often thought of as a composition of steps: • transform idealized model q into perfect, noise-free, infinite-dimensional data (“approximate physics”) • keep a finite number of the perfect data, because can only measure, record, and compute with finite lists • possibly corrupt the list with measurement error. Equivalent to single-step procedure with corruption on par with physics, and mapping incorporating the censoring.

Inverse Problems Observe data X drawn from Pθ for some unknown . (Assume  contains at least two points; otherwise, data superfluous.) Use X and the knowledge that  to learn about ; for example, to estimate a parameter g() (the value g(θ) at θ of a continuous G-valued function g defined on ).

Example: Geomagnetism

Geomagetic model parametrization

Geomagnetic inverse problem

Inverse Problems in Physical Science Inverse problems in science often “solved” using applied math methods for Ill-posed problems (e.g., Tichonov regularization, analytic inversions) Those methods are designed to answer different questions; can behave poorly with data (e.g., bad bias & variance) Inference  construction: Statistical viewpoint may be more appropriate for interpreting real data with stochastic errors.

Elements of the Statistical View Distinguish between characteristics of the problem, and characteristics of methods used to draw inferences. One fundamental qualitative property of a parameter: g is identifiable if for all η, z Θ, {g(η)  g(z)}  {PhPz}. In most inverse problems, g(θ) = θ not identifiable, and few linear functionals of θ are identifiable.

Deterministic and Statistical Connections Identifiability—distinct parameter values yield distinct probability distributions for the observables— is similar to uniqueness—forward operator maps at most one model into the observed data. Consistency—parameter can be estimated with arbitrary accuracy as the number of data grows— is related to stability of a recovery algorithm—small changes in the data produce small changes in the recovered model.  quantitative connections too.

More Notation Let T be a separable Banach space, T* its normed dual. Write the pairing between T and T* <•, •> : T*xTR.

Linear Forward Problems A forward problem is linear if • Θ is a subset of a separable Banach space T • X= Rn, X = (Xj)j=1n • For some fixed sequence (κj)j=1n of elements of T*, Xj = hkj, qi + ej, q2Q, where e = (ej)j=1n is a vector of stochastic errors whose distribution does not depend on θ.

Linear Forward Problems, contd. • Linear functionals {κj} are the “representers” • Distribution Pθ is the probability distribution of X. • Typically, dim(Θ) = ; at least, n < dim(Θ), so estimating θ is an underdetermined problem. Define K : TRn q(<κj, θ>)j=1n . Abbreviate forward problem by X = Kθ + ε, θΘ.

Linear Inverse Problems Use X = Kθ + ε, and the constraint θΘ to estimate or draw inferences about g(θ). Probability distribution of X depends on θ only through Kθ, so if there are two points θ1, θ2Θ such that Kθ1 = Kθ2 but g(θ1)g(θ2), then g(θ) is not identifiable.

Example: Sampling w/ systematic & random error • Observe • Xj = f(tj) + rj + ej, j = 1, 2, …, n, • f 2C, a set of smooth of functions on [0, 1] • tj2 [0, 1] • |rj| 1, j=1, 2, … , n • jiid N(0, 1) • Take Q = C£ [-1, 1]n, X = Rn, and q = (f, r1, …, rn). • Then Pq has density • (2p)-n/2 exp{-åj=1n (xj – f(tj)-rj)2}.

Sketch: Identifiability Pz = Ph Pq  X = Rn K K X = K    g() g(h) g(z) R {Pz = Ph} ; {h = z}, so q not identifiable g cannot be estimated with bounded bias {Pz = Ph} ; {g(h) = g(z)}, so g not identifiable

Backus-Gilbert Theory Let Q = T be a Hilbert space. Let g 2T = T* be a linear parameter. Let {kj}j=1nµT*. Then: g(q) is identifiable iff g = L¢ K for some 1 £ n matrix L. If also E[e] = 0, then L¢ X is unbiased for g. If also e has covariance matrix S = E[eeT], then the MSE of L¢ X is L¢S¢LT.

Sketch: Backus-Gilbert Pq  X = Rn K X = K  L¢ X R g() = L¢ Kq

Backus-Gilbert++: Necessary conditions Let g be an identifiable real-valued parameter. Suppose θ0Θ, a symmetric convex set Ť T, cR, and ğ: ŤR such that: • θ0 + ŤΘ • For t Ť, g(θ0 + t) = c + ğ(t), and ğ(-t) = -ğ(t) • ğ(a1t1 + a2t2) = a1ğ(t1) + a2ğ(t2), t1, t2 Ť, a1, a2 0, a1+a2 = 1, and • supt Ť | ğ(t)| <. Then  1×n matrix Λ s.t. the restriction of ğ to Ť is the restriction of Λ.K to Ť.

Backus-Gilbert++: Sufficient Conditions Suppose g = (gi)i=1m is an Rm-valued parameter that can be written as the restriction to Θ of Λ.K for some m×n matrix Λ. Then • g is identifiable. • If E[ε] = 0, Λ.X is an unbiased estimator of g. • If, in addition, ε has covariance matrix Σ = E[εεT], the covariance matrix of Λ.X is Λ.Σ.ΛT whatever be Pθ.

Decision Rules A (randomized) decision rule δ: X M1(A) x δx(.), is a measurable mapping from the space X of possible data to the collection M1(A) of probability distributions on a separable metric space A of actions. Anon-randomized decision rule is a randomized decision rule that, to each x X, assigns a unit point mass at some value a = a(x)  A.

Why randomized rules? • In some problems, have better behavior. • Allowing randomized rules can make the set of decisions convex (by allowing mixtures of different decisions), which makes the math easier. • If the risk is convex, Rao-Blackwell theorem says that the optimal decision is not randomized. (More on this later.)

Example: randomization natural Coin has chance 1/3 of landing with one side showing; chance 2/3 of the other showing. Don’t know which side is which. Want to decide whether P(heads) = 1/3 or 2/3. Toss coin 10 times. X = #heads. Toss fair coin once. U = #heads. Use data to pick the more likely scenario, but if data don’t help, decide by tossing a fair coin.

Estimators An estimator of a parameter g(θ) is a decision ruled for which the space A of possible actions is the space G of possible parameter values. ĝ=ĝ(X) is common notation for an estimator of g(θ). Usually write non-randomized estimator as a G-valued function of x instead of a M1(G)-valued function.

Comparing Decision Rules • Infinitely many decision rules and estimators. Which one to use? The best one! But what does best mean?

Loss and Risk • 2-player game: Nature v. Statistician. • Nature picks θ from Θ. θ is secret, but statistician knows Θ. • Statistician picks δ from a set D of rules. δ is secret. • Generate data X from Pθ, apply δ. • Statistician pays loss L(θ, δ(X)). L should be dictated by scientific context, but… • Riskis expected loss: r(θ, δ) = EqL(θ, δ(X)) • Good rule d has small risk, but what does small mean?

Strategy Rare that one d has smallest risk 8qQ. • d is admissible if not dominated (no estimator does at least as well for every q, and better for at least one q). • Minimaxdecision minimizes rQ(d) ´ supqQr(θ, δ) over dD • Minimax risk is rQ*´ infd2DrQ(δ) • Bayes decisionminimizes rp(d) ´ sQr(q,d)p(dq) over dD for a given priorprobability distributionp on Q. • Bayes risk is rp*´ infd2D rp(d).

Minimax is Bayes for least favorable prior If minimax risk >> Bayes risk, prior π controls the apparent uncertainty of the Bayes estimate. Pretty generally for convex , D, concave-convexlike r,

Common Risk: Mean Distance Error (MDE) Let dG denote the metric on G, and let ĝ be an estimator of g. MDE at θ of ĝ is MDEθ(ĝ, g) = Eq d(ĝ(X), g(θ)). If metric derives from norm, MDE is mean norm error (MNE). If the norm is Hilbertian, MNE2 is mean squared error (MSE).

Shrinkage Suppose X » N(q, I) with dim(q) ´ d ¸ 3. X not admissible for q for squared-error loss (Stein, 1956). Dominated by X(1 – a/(b + ||X||2)) for small a and big b. James-Stein better: dJS(X) ´ X(1-a/||X||2), 0 < a· 2(d-2). Even better to take positive part of shrinkage factor: dJS+(X) ´ X(1-a/||X||2)+, 0 < a· 2(d-2). dJS+ isn’t minimax for MSE, but close. Implications for Backus-Gilbert estimates of d¸ 3 linear functionals. 9 extensions to other distributions; see Evans & Stark (1996).

Bias When G is a Banach space, can define bias atθofĝ: biasθ(ĝ, g) ´Eq [ĝ - g(θ)] (when the expectation is well-defined). • If biasθ(ĝ, g) = 0, say ĝis unbiased atθ (for g). • If ĝ is unbiased at θ for g for every θ, say ĝ is unbiased for g. If such ĝ exists, say g is unbiasedly estimable. • If g is unbiasedly estimablethen g is identifiable.

Example: Bounded Normal Mean Observe X »N (q, 1). Know a prioriq2 [-t, t]. Want to estimate g(q) ´ q. f(¢): standard normal density.F(¢): standard normal cumulative distribution function. Suppose we choose to use squared-error loss: L(q, d) = (q - d)2 r(q, d) = Eq L(q, d(X)) = Eq (q - d(X))2 rQ(d) = supq2Q r(q, d) = supq2Q Eq (q - d(X))2 rQ* = infd2D supq2Q Eq (q - d(X))2

Risk of X for bounded normal mean Consider the simple (maximum likelihood) estimator d(X) ´ X. EX = q, so X is unbiased for q, and q is unbiasedly estimable. r(q, X) = Eq (q – X)2 = Var(X) = 1. Consider uniform Bayesian prior to capture constraint q2 [-t, t]: • »p = U[-t, t], the uniform distribution on the interval [-t, t]. rp(X) = s-tt r(q, X) p(dq) = s-tt1£ (2t)-1 dq = 1. In this example, frequentist risk of X equals Bayes risk of X for uniform prior p (but X is not the Bayes estimator).

Truncation is better (but not best) Easy to find an estimator better than X from both frequentist and Bayes perspectives. Truncation estimate dT dT is biased, but has smaller MSE than X, whatever be q2Q. (dT is the constrained maximum likelihood estimate.)

Risk of dT x Pq(X < -t) dT f(x|q) 0 t -t q -t 0 q t

Minimax MSE Estimate of BNM Truncation estimate better than X, but neither minimax nor Bayes. Clear that r*¸ min(1, t2): MSE(X) = 1, and rQ(0) = t2. Minimax MSE estimator is a nonlinear shrinkage estimator. Minimax MSE risk is t2/(1+t2).

Bayes estimation of BNM Posterior density of q given x is

Posterior Mean The mean of the posterior density minimizes the Bayes risk, when the loss is squared error:

Bayes estimator is also nonlinear shrinkage Philip B. Stark: function f = bayesUnif(x, tau) f = x - (normpdf(tau - x) - normpdf(-tau-x))./(normcdf(tau-x) - normcdf(-tau-x)); return 6 X dT 4 dp* 2 0 Bayes estimatordp*, t=3 -2 -4 -6 -6 -4 -2 0 2 4 6 For t = 3, Bayes risk rp*¼ 0.7 (by simulation) .Minimax risk rQ*= 0.75.

Bayes/Minimax Risks Philip B. Stark: nsim = 20000; risk = 0; tau = 5; for theta = -tau:.01:tau pred = bayesunif(theta + randn(nsim,1),tau); risk = risk + (pred - theta)'*(pred - theta)/nsim; end risk/length(-tau:.01:tau) Philip B. Stark: Philip B. Stark: Difference between knowing q2 [-t, t], and q» U[-t, t].

Confidence Interval for BNM • Might be interested in a confidence set for q instead of point estimate. A 1-a confidence set I satisfies Pq (I (X) 3q) ¸ 1-a, 8q2Q. • Actions are now sets, not points. Decision rules are probability measures on sets. • Sensible loss? Lebesgue measure of confidence set. But consider the application! • Risk is expected measure.

Some Confidence Procedures • Naïve: [X – za/2, X+ za/2] • Truncated: [X – za/2, X+ za/2] Å [-t, t] • Affine fixed-length: [aX+b–c/2, aX+b+c/2] • Nonlinear fixed length: [f(X)–c/2, f(X)+c/2], f measurable • Variable length: [l(X), u(X)], l, u measurable • Likelihood ratio test (LRT): Include all h2 [-t, t] s.t. f(h)/f(0) ¸ch. Choose ch to get 1-a coverage.

Minimax Expected Length • Seek procedure w/ min max expected length for q2 [-t, t]. • For t· 2za, minimax is Truncated Pratt; simple form. (Evans et al., 2003; Schafer & Stark, 2003.) Table for a = 0.05. Naïve has length 3.92.

Regularization • From statistical perspective, regularization tries to exploit a bias/variance tradeoff to reduce MNE. • There are situations in which regularization is optimal—depends on prior information, forward operator, loss function, regularization functional. See, e.g., Donoho (1995), O’Sullivan (1986). • Generally need some prior information about q to know whether a given amount of “smoothing” increases bias2 by more than it decreases variance. (But c.f. shrinkage.) • Can think of regularization in dual ways: minimizing a measure of size s.t. fitting data adequately, or minimizing measure of misfit subject to keeping the model small. Complementary interpretations of the Lagrange multiplier. • Generally, to get consistency, need the smoothing to decrease at the right rate as the number of data increases.

Sketch: Regularization  X = Rn K X = K  K error 0  g() g() bias R

Quantifying Uncertainty in Inverse Problems