A Toolkit for Statistical Data Analysis

A Toolkit for Statistical Data Analysis B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo CHEP 2004 Interlaken, 26-30 September 2004 http://www.ge.infn.it/geant4/analysis/HEPstatistics Work supported and partially funded by the European Space Agency (ESA) under Contract No.16339/02/NL/FM

A project to develop a statistical analysis system The project Provide tools for the statistical comparison of distributions (Goodness-of Fit Tests) • Regression testing • Throughout the software life-cycle • Online DAQ • Monitoring detector behaviour w.r.t. a reference • Simulation validation • Comparison with experimental data • Reconstruction • Comparison of reconstructed vs. expected distributions • Physics analysis • Comparisons of experimental distributions (ATLAS vs. CMS Higgs?) • Comparison with theoretical distributions (data vs. Standard Model) Typical use cases in HEP:

Software tools • Commercial products used by “professional” statisticians • SPSS, NCSS... In HEP: • A lot of activity: • workshops/conferences (CERN, Durham, SLAC etc.) • books (F. James et al., L. Lyons, R. Barlow etc.) • sophisticated statistical algorithms applied in various data analyses • ...but, in spite of the relevant role played by statistics in HEP, very limited availability of software tools for statistics in our field • and in open-source software in general

We need it, let’s do the work ourselves... A project to develop an open-source software system for statistical analysis Provide tools for thestatistical comparisonof distributions Create a hub toaggregate expertiseandcollaborative contributionsfrom scientists interested in statistical methods

Have a vision for the project • General purpose tool for statistical analysis • Toolkit approach (choice open to users) • Open source product Clearly define scope, objectives Software quality Flexible, extensible, maintainable system Build on a solid architecture Vision: the basics • Rigorous software process

Architectural guidelines • The project adopts a solid architectural approach • to offer the functionalityand the quality needed by the users • to be maintainableover a large time scale • to be extensible, to accommodate future evolutions of the requirements • Component-based architecture • to facilitate re-use and integration in diverse frameworks • Dependencies • adopt a standard (AIDA) for the user layer • no dependence on any specific analysis tool • Python • the “glue” for interactivity • The approach adopted is compatible with the recommendations of the LCG Architecture Blueprint Report

Software process • United Software Development Process, specifically tailored to the project • practical guidance and tools from the RUP • both rigorous and lightweight • mapping onto ISO 15504 • significant experience gained in the group from other projects • Incremental and iterative life-cycle model

User Requirements User requirementselicited, analysed and formally specified • Functional (capability) and not-functional (constraint) requirements • User Requirements Document available from the web site Requirement traceability • Requirements • Design • Implementation • Test & test results • Documentation

Simple user layer • Shields the user from the complexity of the underlying algorithms and design • Only deal withAIDA objectsand choice ofcomparison algorithm

GoF algorithms • Algorithms for binned distributions • Anderson-Darling Test • Chi-squared Test • Fisz-Cramer-von Mises Test • Tiku Test(Cramer-von Mises test in chi-squared approximation) • Algorithms for unbinned distributions • Anderson-Darling Test • Fisz-Cramer-von Mises Test • Goodman Test(Kolmogorov-Smirnov test in chi-squared approximation) • Kolmogorov-SmirnovTest • Kuiper Test • Tikutest(Cramer-von Mises test in chi-squared approximation)

Chi-squared test • Applies to binned distributions • It can be useful also in case of unbinned distributions, but the data must be grouped into classes • Cannot be applied if the counting of the theoretical frequencies in each class is < 5 • When this is not the case, one could try to unify contiguous classes until the minimum theoretical frequency is reached

EMPIRICAL DISTRIBUTION FUNCTION ORIGINAL DISTRIBUTIONS Tests based on a supremum statistics Unbinned distributions • Kolmogorov-Smirnov Test • Goodman approximation of KS Test • Kuiper Test Dmn

Tests containing a weighting function Unbinned distributions • Cramer-von Mises Test • Anderson-Darling Test Binned distributions • Fisz-Cramer-von Mises Test • k-sample Anderson-Darling Test

Test Power Characteristics Comparative evaluation of tests More about a comparative evaluation of tests in the User Documentation on our web Topic still subject to research activity in the domain of statistics

Supremum statistics tests Tests containing a weight function 2 < < Power of tests The power of a test is the probability of rejecting the null hypothesis correctly In terms of power: • 2 loses information in a test for unbinned distribution by grouping the data into cells • Kac, Kiefer and Wolfowitz (1955) showed that Kolmogorov-Smirnov test requires n4/5 observations compared to n observations for 2 to attain the same power • Cramer-von Mises and Anderson-Darling statistics are expected to be superior to Kolmogorov-Smirnov’s, since they make a comparison of the two distributions all along the range of x, rather than looking for a marked difference at one point Talk at IEEE NSS, Rome, 16-22 October 2004 + paper submitted for publication November 2004

Months Unit test: 2 Test from PICCOLO BOOK (STATISTICS - page 711) Exact p-value = 0.200758 Expected p-value = 0.200757 2 test-statistics = 15.8 Expected 2 = 15.8 Binned data Test from CRAMER BOOK (MATHEMATICAL METHODS OF STATISTICS - page 447) Exact p-value = 0 Expected p-value = 0 2 test-statistics = 123.203 Expected 2 = 123.203

Cumulative Function Months Body lengths Unit test: K-S Goodman Test from PICCOLO BOOK (STATISTICS - page 711) 2 test-statistics = 3.9 Expected 2 = 3.9 Exact p-value=0.140974 Expected p-value=0.140991 Test from LANDENNA BOOK (NONPARAMETRIC TESTS BASED ON FREQUENCIES - page 287) 2 test-statistics = 1.5 Expected 2 = 1.5 Exact p-value=0.472367 Expected p-value=0.472367

Unit test: Kolmogorov-Smirnov Test from LANDENNA BOOK (NONPARAMETRIC TESTS BASED ON FREQUENCIES - page 318-325) D test-statistics = 0.65 Expected D = 0.65 Cumulative Exact p-value = 2 10-19 Expected p-value = 8 10-19 …this is just a sample of the test process and results!

GPL License Feedback from users is welcome!

User Documentation • Download • Installation • User Guide • Statistics Reference Guide

NIST Geant4 Standard Geant4 LowE Anderson-Darling Test Ac (95%) = 0.752 Chi-squared Test 2N-L=13.1 =20 p=0.87 2N-S=23.2 =15 p=0.08 Photon attenuation coefficient, Al Fluorescence spectrum from Icelandic basalt (Mars-like rock): experimental data and simulation Example of application results Validation of Geant4 physics models w.r.t. NIST reference ESA Bepi Colombo mission to Mercury: Test beam at Bessy Kolmogorov-Smirnov Test Dosimetry at IST Cancer Inst. Monte Carlo and experimental data

A toolkit for modeling multi-parametric fit problems Initially developed while rewriting a FORTRAN fitter for BaBar analysis • Simultaneous estimate of: • B(BJ/) / B(BJ/K) • direct CP asymmetry • More control on the code was needed to justify a bias appeared in the original fitter F. Fabozzi, L. Lista INFN Napoli New components included in the Statistical Toolkit Toy Monte Carlo, PDF modelling, Max Likelihood Fits Architecture open to extension and evolution

Feel free to contact us!

Conclusions • A project to develop an open source, general purpose software toolkit for statistical data analysis is in progress • to provide a product of common interest to user communities • Rigorous software process • to contribute to the quality of the product • Component-based architecture, OO methods + generic programming • to ensure openness to evolution, maintainability, ease of use • GoF component • Component for modeling multi-parametric fit problems • Software released and application results available • toolkit in use for Geant4 physics validation and in experiments • paper published on IEEE Trans. Nucl. Sci., 3 October 2004 Thanks toFred James (CERN) andLouis Lyons (Oxford) for many useful suggestions, discussions, encouragement..

A Toolkit for Statistical Data Analysis

A Toolkit for Statistical Data Analysis

Presentation Transcript

Statistical Data Analysis

Statistical Data Analysis STAT221A

Statistical Data Analysis

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

Statistical Data Analysis

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

Statistical Toolkit

Statistical Data Analysis

Data Processing/Statistical Analysis

“EBHC Statistical Toolkit”

Statistical Analysis of Data

A general statistical analysis for fMRI data

HEON-E Data Analysis Toolkit

STATISTICAL DATA ANALYSIS

Qualitative data Statistical Analysis

Multivariate Data/Statistical Analysis

A general statistical analysis for fMRI data

Statistical Data Analysis

Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

STATISTICAL ANALYSIS FOR DATA SCIENCE PROFESSIONALS