1 / 21

CS&E and Statistics

This article provides a glimpse into the world of statistical modeling in science, engineering, and society from the viewpoint of a statistician. It covers research themes such as predicting fuel economy improvements, understanding the orbital composition of galaxies, and protecting confidentiality in government databases. It also discusses bringing the computer science and statistics communities together, focusing on the development, analysis, and validation of computer models.

jlibby
Download Presentation

CS&E and Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS&E and Statistics James Berger Duke University and Statistical and Applied Mathematical Sciences Institute (SAMSI)

  2. Outline • A Glimpse of the World of Statistical Modeling in Science, Engineering and Society from the Viewpoint of a Statistician • Bringing the CS&E and Statistics Communities Together • Research Themes

  3. I. An Idiosyncratic Glimpse of the World of Statistical Modeling in Science, Engineering, and Society • Example 1: Predicting Fuel Economy Improvements • Example 2: Understanding the Orbital Composition of Galaxies • Example 3: Protecting Confidentiality in Government Databases, while Allowing for their Use in Research

  4. Example 1: An early 90’s study of the potential available gain in fuel economy, to gauge the possibility of changing CAFE • Statistical modeling of EPA data involved • physics/engineering-based data transformations • ‘multilevel random effects’ models, accounting for vehicle model effects, manufacturer effects, technology type, … (about 3000 parameters) • physics/engineering knowledge of effect on vehicle performance of technology changes, necessary to implement a ‘constant performance’ condition, some from simulation.

  5. Prediction of the effect of technology change (highly non-linear) • was done in a Bayesian fashion; • involved thousands of 3000-dimensional integrals; • utilized Markov Chain Monte Carlo methods. • The total estimated fuel economy gains available by 1995 and 2001 were (within 2%) • 11% and 20% (Automobile) • 8% and 16% (Truck) (Note that legislation had proposed CAFE increases of 20% by 1995 and 40% by 2001.) See http://www.stat.duke.edu/~berger/papers/fuel.html

  6. Example 2: Understanding the orbital composition of galaxies • Consider a galaxy as made of a collection of ‘rings’ of orbiting stars; each ring specified by • its location • a given velocity for the stars in the ring. • Available data is the luminosity in each (location,velocity) slit of the galaxy; • it is measured with noise. • Goal: find the luminosity ‘weight’ of each ‘ring’.

  7. Finding the weights appears to be a linearly constrained quadratic minimization problem, but • there are many local minima, with nearly the same minimum value, so the actual minimum is unimportant • characterization of the uncertainty in the weights is crucial, leading to identification of the computationally ‘stable’ and ‘transient’ orbits. • A solution is to employ Bayesian analysis, leading to the posterior distribution of weights: • here, dimensions of integration are roughly equal to the number of orbits considered; • new Markov Chain Monte Carlo methods for highly constrained spaces are required .

  8. Example 3: Protecting Privacy in an Electronic, Post-9/11/01 World • Underlying Tension: Federal statistical agencies must • protect confidentiality of data (and privacy of individuals and organizations), • disclose information to the public, researchers, … • Current Milieu: Sophisticated ways to break confidentiality. • Example: Linkage to external databases (many) using powerful software tools. • The need: equally powerful models and tools to protect confidentiality .

  9. Full Data: Large (e.g., 40 dimensions x 10 categories) contingency table corresponding to a categorical database; note that there are 1040 cells in the full table (but most are 0). • To Disseminate to Researchers: Set of marginal sub-tables that maximize utility of released information subject to a risk disclosure constraint • The difficult computational challenges include • computation via MCMC or integer programming or ?? with huge contingency tables; • optimization of the utility, subject to the constraint; • determination of the statistical utility of sub-tables. See NISS Digital Government Project: http://www.niss.org/dg

  10. II. Bringing the CS&E and Statistics Communities Together • Example : Inverse problems and validation for complex computer models • Barriers to closer association • Mechanisms for closer association

  11. Example: Development, Analysis and Validation of Computer Models • Consider computer models of processes, created via applied mathematical modeling, statistical modeling, microsimulation, or other strategy. • Collect data from the real process, to • Find unknown parameters of the computer model (the inverse problem), and characterize uncertainty • Find inadequacies of the computer model and suggest improvements • Predict accuracy of the computer model

  12. b(x) x

  13. Illustration: Math modeling of vehicle crashes • A finite element applied math model • 100,000 elements • developed using LS-DYNA • 12+ hours to run • Accelerometer data is available at differing vehicle velocities • 36 computer runs • 36 field tests

  14. Statistical modeling of velocities as a function of time: vfield(t) = vtrue(t) + e(t), vmodel(t) = vtrue(t) + b(t), where e(t) is noise and b(t) is computer model bias. Analysis: Use Bayesian analysis and Markov Chain Monte Carlo implementation to • provide estimates (with uncertainties) of unknown coefficients in the math model, e.g., damping; • assess accuracy of predictions of the computer model (e.g., at initial velocity v=30 mph, there is a 90% chance that the computer model prediction is within 1.5 of the true process value) • allows prediction of key engineering quantities, such as CRITV, the airbag deployment time. See: http://www.niss.org/technicalreports/tr128.pdf

  15. Barriers to Bringing the CS&E and Statistics Communities Together • To many disciplinary scientists • we are each ‘providers of tools they can use’ • we are indistinguishable quantitative experts • Program and project funding rarely encourage inclusion of both CS&E and statistical scientists. • Our traditional application areas generally differ • CS&E tradition: physical sciences and engineering • Statistics tradition: strongest – as the statistics discipline – in social sciences, medical sciences,… (This could be an organizational strength for the CS&E initiative, but is a barrier at the personal level.)

  16. Mechanisms for Bringing the CS&E and Statistics Communities Together • Most important is simply to bring them together on interdisciplinary teams. • Institute programs (e.g., at SAMSI), for extended cooperation • joint workshops • joint working groups • Emphasize need for joint funding on interdisciplinary projects. • At Universities?

  17. Organizing and Delivering Joint CS&E and Statistics Educational Programs At SAMSI, we • provide integrated courses, jointly taught; • provide graduate students and postdocs with year-long exposure to joint programs; • provide 1 week outreach programs to undergraduates and high-school teachers, and 2 week outreach programs to beginning graduate students, to introduce them to the CS&E and Statistics worlds; • begin opening program workshops with extensive tutorials.

  18. Research Challenges • Statistical computational research challenges: • MCMC development and implementation • data confidentiality and large contingency tables • dealing with large data sets • in real time • off-line • bioinformatics, gene regulation, protein folding, … • data mining • utilizing multiscale data • data fusion, data assimilation • graphical models/causal networks • open source software environments • visualization • many many more.

  19. Challenges in the synthesis of statistics and development of computer modeling: • Statistical analysis in non-linear situations can require thousands of model evaluations (e.g., using MCMC), so the ‘real’ computational problem is the product of two very intensive computational problems; this is needed for • designing effective evaluation experiments; • estimating unknown model parameters (inverse problem), with uncertainty evaluation; • assessing model bias and predictive capability of the model; • detecting inadequate model components.

  20. Simultaneous use of statistical and applied mathematical modeling is needed for • effective utilization of many types of data, such as • data that occurs at multiple scales; • data/models that are individual-specific. • replacing unresolvable determinism by stochastic or statistically modeled components (parameterization) This general area of validation of computer models should be a Grand Challenge.

More Related