Statistical Toolkit

Statistical Toolkit S.Donadio, B.Mascialino July 2nd, 2003

Status of algorithms • Chi2 (binned distributions) • Chi2 (curves – sets of points) • Kolmogorov-Smirnov-Goodman • Kolmogorov-Smirnov • Cramer-von Mises (binned) • Cramer-von Mises (unbinned) • Anderson-Darling (binned) • Anderson-Darling (unbinned) • Kuiper

Status of Quality Checkers • Chi2 • Kolmogorov-Smirnov-Goodman • Kolmogorov-Smirnov • Cramer-von Mises • Anderson-Darling • Kuiper

Last algorithm (to be added still) Lilliefors test is similar to Kolmogorov test, but is based on the null hypothesis that the random continuous variable is distributed as a normal N(m,s2), when m and s2 are unknown. In practice, being the parameters unknown, the researcher must estimate them from the sample itself (x1,x2,...,xn) and in this way it becomes possible to study the standardized sample (z1,z2,...,zn). The test is performed comparing the empirical repartition function F of (z1,z2,...,zn) with the one of the standardized normal distribution F(z): D* = sup |FO(z) - F(z)|

Lilliefors needs a theoretical function in input DISTRIBUTION 2 DISTRIBUTION 1 TOOLKIT INPUT: BINNED DISTRIBUTIONS UNBINNED DISTRIBUTIONS THEORETICAL DISTRIBUTIONS Test for Normality, … THEORETICAL FUNCTION

New algorithmCramer von Mises Tiku It approximates Cramer von Mises test statistics with a 2. It uses 2 Quality Checker. Tiku M.L. Chi-squared approximation for the distributions of goodness of fit UN2 and WN2. Biometrika, 52, (1965b), 630.

New AlgorithmKolmogorov-Smirnov (binned) It allows the calculation of Kolmogorov-Smirnov test statistics in case of binned distributions. It uses a different quality checker (see Conover (1971), Gibbons and Chakraborti (1992) ). We must find it!

Uncertainties treatment We must decide how to treat errors inside the statistical toolkit. Distributions are entered as a couple of DataPointSets: Data  Weight The handling of Data and Weight in the computation of the test statistics is different in case of distributions, of curves or of sets of points.

An example 2 =  {(y1i – y2i)2 / [(1i)2 + (2i)2]} In the case of two distributions 2 is computed using only “Weights”. In the case of two curves or sets of points, the numerator involves “Data”and the denominator uses “Weights”. THIS COULD BE MISLEADING!

Data Weights Errors • So, in order to have a coherent language for all the algorithms, • we should have: • Data • Weights • Errors • Whenever errors are not necessary for the computation of the • test statistics we could fill them as a null vector.

Selecting data 1 Elimination of data points if n30 CRITERION OF 3-SIGMA: If a point is 3-standard deviation away from the mean of data points, there is about a 0.001 probability of obtaining in a single measurement a value that is that far from the mean. We can choose the elimination of this data point.

Selecting data 2 Elimination of data points if n10 CHAUVENET’S CRITERION: There are n sample observation from a gaussian distribution N(0,1), we should expect n’ to be in error by  or more, where P(z-/ )=P(/)= 1 – n’/n “If n’=0.5 means that even one observation with this amount of error is unlikely. We can discard a data point if we expect less than half an event to be further from the mean than the suspect data point.”

Statistical Toolkit

Statistical Toolkit

Presentation Transcript

Geographer’s Toolkit

An update on the Statistical Toolkit

Fundraising Toolkit

“EBHC Statistical Toolkit”

Fundraising Toolkit

Text Toolkit

TOOLKIT

Teacher’s Toolkit

Advocacy Toolkit

Statistical Language Modeling using SRILM Toolkit

SPREAD TOOLKIT

Toolkit

Geographer’s Toolkit

CAMPAIGN TOOLKIT

Chapter Toolkit

Fundraising Toolkit

ACPA Toolkit

Toolkit 4