CS&E and Statistics

CS&E and Statistics James Berger Duke University and Statistical and Applied Mathematical Sciences Institute (SAMSI)

Outline • A Glimpse of the World of Statistical Modeling in Science, Engineering and Society from the Viewpoint of a Statistician • Bringing the CS&E and Statistics Communities Together • Research Themes

I. An Idiosyncratic Glimpse of the World of Statistical Modeling in Science, Engineering, and Society • Example 1: Predicting Fuel Economy Improvements • Example 2: Understanding the Orbital Composition of Galaxies • Example 3: Protecting Confidentiality in Government Databases, while Allowing for their Use in Research

Example 1: An early 90’s study of the potential available gain in fuel economy, to gauge the possibility of changing CAFE • Statistical modeling of EPA data involved • physics/engineering-based data transformations • ‘multilevel random effects’ models, accounting for vehicle model effects, manufacturer effects, technology type, … (about 3000 parameters) • physics/engineering knowledge of effect on vehicle performance of technology changes, necessary to implement a ‘constant performance’ condition, some from simulation.

Prediction of the effect of technology change (highly non-linear) • was done in a Bayesian fashion; • involved thousands of 3000-dimensional integrals; • utilized Markov Chain Monte Carlo methods. • The total estimated fuel economy gains available by 1995 and 2001 were (within 2%) • 11% and 20% (Automobile) • 8% and 16% (Truck) (Note that legislation had proposed CAFE increases of 20% by 1995 and 40% by 2001.) See http://www.stat.duke.edu/~berger/papers/fuel.html

Example 2: Understanding the orbital composition of galaxies • Consider a galaxy as made of a collection of ‘rings’ of orbiting stars; each ring specified by • its location • a given velocity for the stars in the ring. • Available data is the luminosity in each (location,velocity) slit of the galaxy; • it is measured with noise. • Goal: find the luminosity ‘weight’ of each ‘ring’.

Finding the weights appears to be a linearly constrained quadratic minimization problem, but • there are many local minima, with nearly the same minimum value, so the actual minimum is unimportant • characterization of the uncertainty in the weights is crucial, leading to identification of the computationally ‘stable’ and ‘transient’ orbits. • A solution is to employ Bayesian analysis, leading to the posterior distribution of weights: • here, dimensions of integration are roughly equal to the number of orbits considered; • new Markov Chain Monte Carlo methods for highly constrained spaces are required .

Example 3: Protecting Privacy in an Electronic, Post-9/11/01 World • Underlying Tension: Federal statistical agencies must • protect confidentiality of data (and privacy of individuals and organizations), • disclose information to the public, researchers, … • Current Milieu: Sophisticated ways to break confidentiality. • Example: Linkage to external databases (many) using powerful software tools. • The need: equally powerful models and tools to protect confidentiality .

Full Data: Large (e.g., 40 dimensions x 10 categories) contingency table corresponding to a categorical database; note that there are 1040 cells in the full table (but most are 0). • To Disseminate to Researchers: Set of marginal sub-tables that maximize utility of released information subject to a risk disclosure constraint • The difficult computational challenges include • computation via MCMC or integer programming or ?? with huge contingency tables; • optimization of the utility, subject to the constraint; • determination of the statistical utility of sub-tables. See NISS Digital Government Project: http://www.niss.org/dg

II. Bringing the CS&E and Statistics Communities Together • Example : Inverse problems and validation for complex computer models • Barriers to closer association • Mechanisms for closer association

Example: Development, Analysis and Validation of Computer Models • Consider computer models of processes, created via applied mathematical modeling, statistical modeling, microsimulation, or other strategy. • Collect data from the real process, to • Find unknown parameters of the computer model (the inverse problem), and characterize uncertainty • Find inadequacies of the computer model and suggest improvements • Predict accuracy of the computer model

b(x) x

Illustration: Math modeling of vehicle crashes • A finite element applied math model • 100,000 elements • developed using LS-DYNA • 12+ hours to run • Accelerometer data is available at differing vehicle velocities • 36 computer runs • 36 field tests

Statistical modeling of velocities as a function of time: vfield(t) = vtrue(t) + e(t), vmodel(t) = vtrue(t) + b(t), where e(t) is noise and b(t) is computer model bias. Analysis: Use Bayesian analysis and Markov Chain Monte Carlo implementation to • provide estimates (with uncertainties) of unknown coefficients in the math model, e.g., damping; • assess accuracy of predictions of the computer model (e.g., at initial velocity v=30 mph, there is a 90% chance that the computer model prediction is within 1.5 of the true process value) • allows prediction of key engineering quantities, such as CRITV, the airbag deployment time. See: http://www.niss.org/technicalreports/tr128.pdf

Barriers to Bringing the CS&E and Statistics Communities Together • To many disciplinary scientists • we are each ‘providers of tools they can use’ • we are indistinguishable quantitative experts • Program and project funding rarely encourage inclusion of both CS&E and statistical scientists. • Our traditional application areas generally differ • CS&E tradition: physical sciences and engineering • Statistics tradition: strongest – as the statistics discipline – in social sciences, medical sciences,… (This could be an organizational strength for the CS&E initiative, but is a barrier at the personal level.)

Mechanisms for Bringing the CS&E and Statistics Communities Together • Most important is simply to bring them together on interdisciplinary teams. • Institute programs (e.g., at SAMSI), for extended cooperation • joint workshops • joint working groups • Emphasize need for joint funding on interdisciplinary projects. • At Universities?

Organizing and Delivering Joint CS&E and Statistics Educational Programs At SAMSI, we • provide integrated courses, jointly taught; • provide graduate students and postdocs with year-long exposure to joint programs; • provide 1 week outreach programs to undergraduates and high-school teachers, and 2 week outreach programs to beginning graduate students, to introduce them to the CS&E and Statistics worlds; • begin opening program workshops with extensive tutorials.

Research Challenges • Statistical computational research challenges: • MCMC development and implementation • data confidentiality and large contingency tables • dealing with large data sets • in real time • off-line • bioinformatics, gene regulation, protein folding, … • data mining • utilizing multiscale data • data fusion, data assimilation • graphical models/causal networks • open source software environments • visualization • many many more.

Challenges in the synthesis of statistics and development of computer modeling: • Statistical analysis in non-linear situations can require thousands of model evaluations (e.g., using MCMC), so the ‘real’ computational problem is the product of two very intensive computational problems; this is needed for • designing effective evaluation experiments; • estimating unknown model parameters (inverse problem), with uncertainty evaluation; • assessing model bias and predictive capability of the model; • detecting inadequate model components.

Simultaneous use of statistical and applied mathematical modeling is needed for • effective utilization of many types of data, such as • data that occurs at multiple scales; • data/models that are individual-specific. • replacing unresolvable determinism by stochastic or statistically modeled components (parameterization) This general area of validation of computer models should be a Grand Challenge.

CS&E and Statistics