300 likes | 434 Views
Selecting the Right Distribution Using MKSE. Jerzy Wieczorek, Portland State University. My statistics master’s project. I’m reading…
E N D
Selecting the Right DistributionUsing MKSE Jerzy Wieczorek, Portland State University Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
My statistics master’s project • I’m reading… • Weber, Leemis, and Kincaid, 2006, “Minimum Kolmogorov-Smirnov test statistic parameter estimates.” Journal of Statistical Computation and Simulation, vo. 76, no. 3, 195–206. • …and extending it to right-censored data. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
“Parameter estimates”? • Statisticians will tell you that your data sampleis from some larger underlying population • μ, σ² : parameters of the population • x, s² : summary measures on the sample which estimate the parameters • Examples of distributions and their parameters: • Normal: mean, variance • Exponential: failure rate • Uniform: lower bound, upper bound • ¯ Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
“Parameter estimates”? • Most common way to estimate parameters: Maximum Likelihood Estimators (MLEs). • For a given distribution, find values of the distribution parameters that maximize the likelihood of seeing your data sample. • My project: computational tool for finding Minimum Kolmogorov-Smirnov Estimates (MKSEs) of parameters. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
“Kolmogorov-Smirnov”? Kolmogorov Smirnov Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
“Kolmogorov-Smirnov”? • Kolmogorov-Smirnov test compares empirical CDF to proposedpopulation CDF,finds maximumdifference inheights, and tellsyou whether datacome from theproposed CDF. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
“Kolmogorov-Smirnov”? • MKSE: choose parameter values that give lowest K-S teststatistic (i.e.,lowest vertical differencebetween CDFs). Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
“Right-censored”? • Monitoring failure times, but for some observations you only know “censoring time”: • Patients drop out of medical study before having a relapse • Equipment test run ends before all components have failed • (For MKSE on right-censored, we can use “Kaplan-Meier” estimator of CDF.) Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
MP 302.5 All Incident Free 6 PM Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Current MKSE software output Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Contributions of MKSFitter • Sanity check: Is proposed parametric model reasonable, or do others have a far better fit? • Easy “black box” parameter estimation tool when other estimators have no closed form or are cumbersome to calculate Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Strengths and weaknesses • Most useful when fit of CDF is the most important consideration, i.e. for simulation • Performance comparable to MLE… • …but then again, performs no better than MLE • Does not provide standard error estimates • Requires optimization algorithm parameters to be prescribed Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Next steps • Embed C code directly within R in order to… • Make easier to use! • Test performance on censored data • Estimate standard error via bootstrapping • Make use of other CDF estimates available in R • Publish MKSFitter R package Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Resources • Weber, Leemis, and Kincaid, 2006, “Minimum Kolmogorov-Smirnov test statistic parameter estimates.” Journal of Statistical Computation and Simulation, vo. 76, no.3, 195–206. • Meead’s rainy-day speed data from PORTAL • Photographs: • http://en.wikipedia.org/wiki/Andrey_Kolmogorov • http://en.wikipedia.org/wiki/Vladimir_Ivanovich_Smirnov_(mathematician) Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Review of the K-S statistic • Let X1, …, Xn be an i.i.d. random sample from a continuous distribution, and let F(x) be the CDF of some continuous distribution. • We wish to test the following: • H0: the Xi are from the distribution F(x) • HA: the Xi come from some other distribution • Construct the empirical distribution function • Fn(x) = (number of sample X’s ≤ x) / n Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Review of the K-S statistic • The one-sample K-S statistic is • Dn = max{ |Fn(x) – F(x)| } • If Dn is “large enough,” reject H0 in favor of HA; otherwise conclude H0. • In other words,small Dn → proposed distribution is a good fit. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Minimum K-S Estimation (MKSE) • Calculate Fn(x) from your observations. • For a given set of parameter values for F(x), calculate and save the K-S statistic. • Repeat for many different sets of parameter values. The MKSE is the set of parameter values for which the K-S statistic is lowest. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
One-parameter example • Using Lieblein and Zelen (1956) dataset,23 observations of ball-bearing failure time(in millions of revolutions): • 17.88 28.92 33.00 41.52 42.12 45.60 48.48 51.84 • 51.96 54.12 55.56 67.80 68.64 68.64 68.88 84.12 • 93.12 98.64 105.12 105.84 127.92 128.04 173.40 • Fit this to exponential distribution,which has a single parameter. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
One-parameter example • MLE: • 72.22 • MKSE: • 96.10 Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
One-parameter example • MLE: • 72.22 • MKSE: • 96.10 Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Multiple-parameter case • Use “Bell-Curve Based” (BCB) evolutionary algorithm to minimize K-S statistic over a range of parameter values in two (or more) dimensions. • BCB has been shown to perform well at finding global optimum in multidimensional problems with many local optima. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
BCB algorithm • Optimize over k-dimensional parameter space. • For each of 100 generations: • Set 25 best points as “parents.” • Create 25 “children” at normally-distributed distances from weighted means of pairs of parents. z ~ N(0, 1) r ~ N(0, 4) P2 z M P1 r C1,2 Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
MKSFitter output example • Software covers wide range of distributions: Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
MKSE performance • Tested on simulated data from several different known distributions. • As sample size ↑: • frequency of selecting correct distribution ↑,mean K-S value ↓. • At large sample sizes, mean distance from estimated to true parameters is similar for MKSE and MLE. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
MKSE performance • 100 random samplesat n = 10 • 100 random samplesat n = 100 Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Modification for censored data • Replace the usual empirical CDF with Kaplan-Meier estimate of the survival function (or other estimate, as appropriate for censoring type). • Continue as before. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Censored MKSE performance • Mini-test: 10 random samples of size n=23, with event times and censoring times both Exp(1). • Mean of estimates of λ: 0.8985Standard deviation of estimates: 0.2747 • Each time, at least 4 other distributions had lower minimum K-S value than exponential did. • Exp. Power had lowest K-S value 4 of 10 times. Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Censored MKSE performance • For proper analysis I must learn to embed this code in R, so I can run and save results hundreds of times.I will evaluate: • frequency of selecting correct distribution • accuracy of parameter estimates vs. MLE • (Cannot compare mean K-S value for different numbers of noncensored observations…) Selecting the Right Distribution Using MKSE, Jerzy Wieczorek
Next steps • Embed C code directly within R in order to… • Test performance for censored data • Bootstrap for standard error estimates • Make use of other CDF estimates available in R • Publish MKSFitter R package • Enable alternate statistics (Cramer-VonMises is consistent for distributions where K-S is not) Selecting the Right Distribution Using MKSE, Jerzy Wieczorek