150 likes | 195 Views
Application of statistical methods for the comparison of data distributions. Susanna Guatelli, Barbara Mascialino, Andreas Pfeiffer, Maria Grazia Pia, Alberto Ribon, Paolo Viarengo. Outline. The comparison of two data distribution is fundamental in experimental practice
E N D
Application of statistical methods for the comparison of data distributions Susanna Guatelli, Barbara Mascialino, Andreas Pfeiffer, Maria Grazia Pia, Alberto Ribon, Paolo Viarengo
Outline • The comparison of two data distribution is fundamental in experimental practice • Many algorithms are available for the comparison of two data distributions (the two-sample problem) Aim of this study: compare the algorithms available in statistics literature to select the most appropriate one in every specific case Detector monitoring (current versus reference data) Simulation validation (experiment versus simulation) Reconstruction versus expectation Regression testing (two versions of the same software) Physics analysis (measurement versus theory, experiment A versus experiment B) Parametric statistics Non-parametric statistics (Goodness-of-Fit testing)
The two-sample problem EXAMPLE 1: binned data EXAMPLE 2: unbinned data X-ray fluorescence spectrum Dosimetric distribution from a medical LINAC Which is the most suitable goodness-of-fit test?
Chi-squared test • Applies tobinneddistributions • It can be useful also in case of unbinned distributions, but the data must be grouped into classes • Cannot be applied if the counting of the theoretical frequencies in each class is < 5 • When this is not the case, one could try to unify contiguous classes until the minimum theoretical frequency is reached • Otherwise one could use Yates’ formula
EMPIRICAL DISTRIBUTION FUNCTION ORIGINAL DISTRIBUTIONS Tests based on the supremum statistics unbinned distributions • Kolmogorov-Smirnov test • Goodman approximation of KS test • Kuiper test Dmn SUPREMUM STATISTICS
Tests containing a weighting function EMPIRICAL DISTRIBUTION FUNCTION ORIGINAL DISTRIBUTIONS • Fisz-Cramer-von Mises test • k-sample Anderson-Darling test QUADRATIC STATISTICS + WEIGHTING FUNCTION Sum/integral of all the distances binned/unbinned distributions
G.A.P Cirrone, S. Donadio, S. Guatelli, A. Mantero, B. Mascialino, S. Parlati, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo “A Goodness-of-Fit Statistical Toolkit” IEEE- Transactions on Nuclear Science (2004), 51 (5): October issue. http://www.ge.infn.it/geant4/analysis/HEPstatistics/
The power of a test is the probability of rejecting the null hypothesis correctly Power evaluation Parent distribution 1 Parent distribution 2 N=1000 Monte Carlo replications Pseudoexperiment: a random drawing of two samples from two parent distributions GoF test Sample 1 n Sample 2 m Confidence Level = 0.05 Power = # pseudoexperiments with p-value < (1-CL) # pseudoexperiments For each test, the p-value computed by the GoF Toolkit derives from analytical calculation of the asymptotic distribution, often depending on the samples sizes.
Gaussian Uniform Double exponential Cauchy Exponential Contaminated Normal Distribution 1 Contaminated Normal Distribution 2 Parent distributions
Skewness and tailweight Skewness Tailweight
Uniform Normal Exponential Double Exponential Contaminated Normal 1 Contaminated Normal 2 Cauchy Case Parent1 = Parent 2 The “location-scale problem” Kolmogorov-Smirnov test CL = 0.05 Power increases as a function of the sample size (analytical calculation of the asymptotic distribution) Power small sized samples moderate sized samples N sample
CL = 0.05 Power For short-medium tailed distributions: Kolmogorov-Smirnov KS KS ~ ~ CVM CVM < ~ AD AD Cramér-von Mises For very long tailed distributions: Anderson-Darling Tailweight Distribution 2 Case Parent1 ≠ Parent 2 The “general shape problem” A)Symmetric versus symmetric (S1 = S2 = 1) Distribution 1 Double exponential (T1 = 2.161) B)Skewed versus symmetric T2
Supremum statistics tests Tests containing a weight function 2 < < Comparative evaluation of tests Tailweight Skewness
^ ^ X-variable: Ŝ=4T=1.43 Y-variable: Ŝ=4T=1.50 X-variable: Ŝ=1.53T=1.36 Y-variable: Ŝ=1.27T=1.34 ^ ^ Results for the data examples EXAMPLE 1: binned data EXAMPLE 2: unbinned data Extremely skewed – medium tail ANDERSON-DARLING TEST A2=0.085 – p>0.05 Moderate skewed – medium tail KOLMOGOROV-SMIRNOV TEST D=0.27 – p>0.05
Conclusions • Studied several goodness-of-fit tests for location-scale alternatives and general alternatives • There is no clear winner for all the considered distributions in general • To select one test in practice: 1. first classify the type of the distributions in terms of skewness S and tailweight T 2. choose the most appropriate test for the classified type of distribution Topic still subject to research activity in the domain of statistics