1 / 12

Statistics and Data Sciences Group Computer Science and Mathematics Division

Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov. Statistics and Data Sciences Group Computer Science and Mathematics Division Oak Ridge National Laboratory. U.S. Department of Energy. Office of Science.

Download Presentation

Statistics and Data Sciences Group Computer Science and Mathematics Division

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applied Statistics for the Office of ScienceUnderstanding Variability and BringingRigor to Scientific Investigation George Ostrouchov Statistics and Data Sciences Group Computer Science and Mathematics Division Oak Ridge National Laboratory

  2. U.S. Department of Energy Office of Science Office of Science Response to the Data Challenge: The Office of Science will initiate a long-term research program to address the “Curse of Dimensionality.” Raymond L. Orbach, AAAS, Feb. 19, 2006 Filling a Gap in Statistics to Address Office of Science Needs ASCR Strategic Plan “[AMR] weaknesses include an underinvestment or lack of investment in several critical areas: • . . . • Underinvestment in statistics” “The following gaps in the [AMR] program have been identified: • Multiscale mathematics • Ultrascale algorithms • Discrete mathematics • Statistics – investments in this area are required to deal with extracting knowledge from the oceans of data that large-scale simulations will produce. • Multiphysics” Through Applied Statistics, ASCR has the opportunity to engage the dominant segment of Applied Mathematics for its goals. ORNL Applied Statistics program can address the curse of dimensionality and other Office of Science goals.

  3. EXPERIMENTAL Conrad Habicht, Maurice Solovine, and Albert Einstein, the self-styled Olympia Academy, in about 1903. At Einstein’s suggestion, the first book read was Pearson’s “The Grammar of Science.” CREDIT: IMAGE ARCHIVE ETH-BIBLIOTHEK, ZÜRICH Statistics Brings Rigor and Efficiency to Scientific Investigation Statistics Brings Rigor and Efficiency to Scientific Investigation and Technology • Karl Pearson (1857-1936) • “The Grammar of Science” (1892) – Relativity • First Department of Statistics (1911) UCL • Founding editor of Biometrika

  4. Common Evolutionary Steps: Experimental Science and Computational Science • Early computational science relies largely on intuitive design and visual validation • Computational experiments are expensive • Petascale data sets are nearly as opaque as real systems – statistical analysis must select what to visualize • Uncertainty analysis is in its infancy • Statistics is a major partner in bringing computational science to the rigor and efficiency standards of experimental science • Methods to see through, examine, and classify variability • Uncertainty quantification • Statistical design of experiments • Fusion of data and computational experiment

  5. Statistics: the Study of Variability • The discipline concerned with the study of variability, with the study of uncertainty, and with the study of decision-making in the face of uncertainty. • Large scale user of mathematical and computational tools with a focused scientific agenda • Inherently interdisciplinary Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century, Cuts through the fog of variability and brings efficiency to science.

  6. Mathematics, Computer Science,andStatistics are Biology’s Next Microscope, Only Better Particle Physics’ Device, Astrophysics’ Telescope Here are five mathematical challenges that would contribute to the progress of biology. (1) Understand computation. Find more effective ways to gain insight and prove theorems from numerical or symbolic computations and agent-based models. We recall Hamming: “The purpose of computing is insight, not numbers” (Hamming 1971, p. 31). (2) Find better ways to model multi-level systems, for example, cells within organs within people in human communities in physical, chemical, and biotic ecologies. (3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we are still at the very beginning of a true understanding. Can we understand uncertainty and risk better by integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is an entirely new approach required? (4) Understand data mining, simultaneous inference, and statistical de-identification (Miller 1981). Are practical users of simultaneous statistical inference doomed to numerical simulations in each case, or can general theory be improved? What are the complementary limits of data mining and statistical de-identification in large linked databases with personal information? (5) Set standards for clarity, performance, publication and permanence of software and computational results. Computer Science and Mathematics Multiscale Math Statistics Statistics Computer Science Mathematics is Biology’s Next Microscope, Only Better Materials’ Chemistry’s Cohen JE (2004). PLoS Biol 2(12): e439 Fellow AAAS, Fellow AmPhilSoc, Member NAS Here are five mathematical challenges that would contribute to the progress of biology. (1) Understand computation. Find more effective ways to gain insight and prove theorems from numerical or symbolic computations and agent-based models. We recall Hamming: “The purpose of computing is insight, not numbers” (Hamming 1971, p. 31). (2) Find better ways to model multi-level systems, for example, cells within organs within people in human communities in physical, chemical, and biotic ecologies. (3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we are still at the very beginning of a true understanding. Can we understand uncertainty and risk better by integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is an entirely new approach required? (4) Understand data mining, simultaneous inference, and statistical de-identification (Miller 1981). Are practical users of simultaneous statistical inference doomed to numerical simulations in each case, or can general theory be improved? What are the complementary limits of data mining and statistical de-identification in large linked databases with personal information? (5) Set standards for clarity, performance, publication and permanence of software and computational results.

  7. “… since 1900 …statistics… takes over field after field … [as] …the methodology of choice… … people in astronomy and physics … are starting touse statisticsa lot more for the simple reason that they haveto be efficientnow. … I don't see any area where it's being resisted much.” Particle Physics Embraces Statistics Bradley Efron Chair, Department of Statistics, Stanford University and Max H. Stein Professor of Humanities and Sciences 2005 National Medal of Science Recipient

  8. Citations to Statistics Comprise the Dominant Group within Mathematics Highly Cited Authors in Mathematics for period 1991-2001 Rank Name Affiliation Department / Field Papers Citations • Pierre-Louis Lions University of Paris 9 Mathematics 75 1207 • David L. Donoho Stanford University Statistics 27 1182 • Adrian F.M. Smith Univ. London Statistics 40 1026 • Elizabeth A. Thompson U. Washington Biostatistics 11 973 • Iain M Johnstone Stanford University Statistics 17 968 • Jianqing Fan Chinese U. Hong Kong Statistics 53 901 • Donald B. Rubin Harvard University Statistics 38 854 • Ingrid Daubechies Princeton University Mathematics 20 807 • Adrian E. Raftery U. Washington Statistics/Sociol. 31 804 • Alan E. Gelfand U. Connecticut Statistics 35 747 • Sun-Wei Guo Med. Coll. Wisconsin Biostatistics 6 737 • Scott L. Zeger Johns Hopkins Univ. Biostatistics 23 723 • Peter J. Green University of Bristol Statistics 14 667 • Bradley P. Carlin University of Minnesota Biostatistics 28 663 • J. Stephen Marron U. North Carolina Statistics 43 618 • David G. Clayton MRC, Cambridge Biostatistics 4 598 • Gareth O. Roberts Lancaster Univ. Statistics 41 598 • Albert Cohen University of Paris Mathematics 61 572 • Michael Rockner Univ. Bielefeld, Germany Mathematics 69 572 • Yangbo Ye University of Iowa Mathematics 42 567 • Jinchao Xu Pennsylvania St. U. Mathematics 22 566 • Xiao-Li Meng University of Chicago Statistics 27 561 • Matthew P. Wand Harvard University Biostatistics 31 558 • Wally R. Gilks MRC Biostatistics 16 551 • M. Chris Jones Open University Statistics 52 542 Highly Cited Journals in Mathematics Rank Journal 1991-2001Citations 1. J. American Statistical Assn. 16,457 2. Biometrics 10,854 3. J. Math. Analysis 9,845 4. Annals of Statistics 9,702 5. Proc. Amer. Math Soc. 9,237 6. C.R. Acad. Sci. Ser. I Math. 9,153 7. Trans. Amer. Math. Soc. 8,586 8. Journal of Algebra 8,531 9. J. Functional Analysis 7,999 10. Biometrika 7,911 11. SIAM J. Numer. Anal. 7,383 12. Inventiones Mathmaticae 7,382 13. J. Royal Stat. Soc. B 6,575 14. Mathemat. Programming 6,444 15. Linear Algebra Appl. 6,112 19 of Top 25 most cited mathematics authors are from Statistics or Biostatistics ! Statistics is Highly Interdisciplinary !Citations per paper:Statistics and Biostatistics – 27Rest of Mathematics - 15 SOURCE: ISI Essential Science Indicators, Sci. Citation Index (300 Journals in pure mathematics, applied mathematics, statistics and probability)

  9. Statistics Disseminates Data Analysis Ideas Accross Science Domains Of 500 recent citations of Efron’s “Bootstrap” paper, 348 were outside statistics. [NSF2004] Mitchell’s “Detmax Algorithm” paper 200+ citations (funded by AMR at ORNL) - red are outside statistics.

  10. Statistics Core Research Disseminates and Unifies Data Analysis Ideas Tames the explosion of data analytic methods by • Providing portability between science domains • Deriving properties of new data analytic methods • Building bridges between data analytic methods Examples: • Latent Semantic Indexing (Dumais+ 1991) and Correspondence Analysis (Benzecri 1969, 1980,1992, Greenacre 1984) • Empirical Orthogonal Functions (Lorenz 1956) and a climate time series application of Principal Components Analysis (Pearson 1902, Hotelling 1935) • Support Vector Machines (Vapnik 1995) and Logistic Regression (Cox 1970) via hinge loss function (Hastie+ 2001) • FastMap approximation to Principal Components (Faloutsos+ 1995): Bridge to Convex Hull and new methods, RobustMap (Ostrouchov+ 2005) and to right Householder transformations (Ostrouchov+ 2006) Addressing the Curse of Dimensionality

  11. Science publication on Big Bang while others still plow through plethora of data “I … emphasize the symbiotic relationship … between the Statisticians and Astrophysicists …. It is now … clear that there are core common problems …” Bob Nichol (CMU Physics) Science Applications • Miller, CJ; Genovese, C; Nichol, RC; et al. • Controlling the false-discovery rate in astrophysical data analysisASTRONOMICAL JOURNAL, 122 (6): 3492-3505 DEC 2001 • Miller, CJ; Nichol, RC; Batuski, DJ • Acoustic oscillations in the early universe and todaySCIENCE, 292 (5525): 2302-2303 JUN 22 2001 Statistics Core Quantitative Rigor for Science: Transfer From Medicine via Core Statistics to Big Bang Statistics core is the hub that disseminates and unifies data analysis ideas. Critical mass engagement is needed to reap short term and long term returns. Family-wise error rate of statistical tests: One test: 0.05 probability of a false positive Fifty tests: 0.93 probability of a false positive need simultaneous inference (SI) Thousand tests: SI too conservative, need FDR False Discovery Rate: “Interdisciplinary” “Decision-making in the face of uncertainty” Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century,

  12. Science Applications Statistics Core Computational Chemistry Astrophysics Simulation Tuning Leadership Facilities Ontologies for Energy Climate Simulation Fusion Simulation Combustion Simulation Superscalable Algorithms Neutron Science Genome Science Engage Core Statistics for OASCR Goals • A gap exists between statistics research and simulation science • Engage statistics with leadership computing • Engage statistics with simulation science data • Engage statistics with Office of Science experimental data (neutron science)

More Related