3.85k likes | 3.87k Views
Learn how to model input processes in simulation, from data analysis to distribution fitting. Understand assumptions, activities, and tools like Chi-Square test. Discover the benefits and challenges of fitting distributions for simulation optimization.
E N D
Simulation Modelling IE 519
Contents • Input Modelling 3 • Random Number Generation 41 • Generating Random Variates 80 • Output Analysis 134 • Resampling Methods 205 • Comparing Multiple Systems 219 • Simulation Optimization 248 • Metamodels 278 • Variance Reduction 292 • Case Study 350 IE 519
Input Modelling IE 519
Input Modelling • You make custom Widgets • How do you model the input process? • Is it deterministic? • Is it random? • Look at some data IE 519
Orders Now what? IE 519
Histogram IE 519
Other Observations • Trend? • Stationary or non-stationary process • Seasonality • May require multiple processes IE 519
Choices for Modelling • Use the data directly (trace-driven simulation) • Use the data to fit an empirical distribution • Use the data to fit a theoretical distribution IE 519
Assumptions • To fit a distribution, the data should be drawn from IID observations • Could it be from more than one distribution? • Statistical test • Is it independent? • Statistical test IE 519
Activity I • Hypothesize families of distributions • Look at the data • Determine what is a reasonable process • Summary statistics • Histograms • Quantile summaries and box plots IE 519
Activity II • Estimate the parameters • Maximum likelihood estimator (MLE) • Sometimes a very simple statistics • Sometimes requires numerical calculations IE 519
Activity III • Determine quality of fit • Compare theoretical distribution with observations graphically • Goodness of fit tests • Chi-square tests • Kolmogorov-Smirnov test • Software IE 519
Chi-Square Test • Formal comparison of a histogram and the probability density/mass function • Divide the range of the fitted distribution into intervals • Count the number of observations in each interval IE 519
Chi-Square Test • Compute the expected proportion • Test statistic is • Reject if too large IE 519
How good is the data? • Assumption of IID observations • Sometimes time-dependent (non-stationary) • Assessment • Correlation plot • Scatter diagram • Nonparametric tests IE 519
Correlation Plot • Calculate and plot the sample correlation IE 519
Scatter Diagram • Plot pairs • Should be scattered randomly through the plane • If there is a pattern then this indicates correlation IE 519
Multiple Data Sets • Often you have multiple data sets (e.g., different days, weeks, operators) • Is the data drawn from the same process (homogeneous) and can thus be combined? • Kruskal-Wallis test IE 519
Kruskal-Wallis (K-W) Statistic • Assign rank 1 to the smallest observation, rank 2 to the second smallest, etc • Calculate IE 519
K-W Test • The null hypothesis is H0: All the population distribution are identical H1: At least one is larger than at least one other • We reject H0 at a level if • In other words, the test statistic follows a chi-square distribution with k-1 degrees of freedom IE 519
Absence of Data • We have assumed that we had data to fit a distribution • Sometimes no data is available • Try to obtain minimum, maximum, and mode and/or mean of the distribution • Documentation • SMEs IE 519
Triangular Distribution IE 519
Symmetric Beta Distributions a=b=2 a=b=3 a=b=5 a=b=10 IE 519
Skewed Beta Distributions a=2, b=4 IE 519
Beta Parameters IE 519
Benefits of Fitting a Parametric Distribution • We have focused mainly on the approach where we fit a distribution to data • Benefits: • Fill in gaps and smooth data • Make sure tail behavior is represented • Extreme events are very important to the simulation but may not be represented • Can easily incorporate changes in the input process • Change mean, variability, etc. • Reflect dependencies in the inputs IE 519
What About Dependencies • Assumed so far an IID process • Many processes are not: • A customer places a monthly order. Since the customer keeps inventory of the product, a large order is often followed by a small order • A distributor with several warehouses places monthly orders, and these warehouses can supply the same customers • The behavior of customers logging on to a web site depends on age, gender, income, and where they live • Do not ignore it! IE 519
Solutions • A customer places a monthly order. • Should use a time-series model that captures the autocorrelation • A distributor with several warehouses • Need a vector time-series model • Customers logging on to a web site • Need a random vector model where each component may have a different distribution IE 519
Taxonomy of Input Models Examples of models Discrete Continuous ‘Mixed Binomial, etc. Univariate Normal, gamma, beta, etc. Empirical/Trace-driven Time-independent Discrete Continuous ‘Mixed Independent binomial Multivariate Multivariate normal Bivariate-exponential Discrete-state Markov chains (stationary?) Discrete-time Cont.-state Time-series models Discrete-state Stochastic Processes Poisson process (stationary?) Continuous-time Cont.-state Markov process IE 519
What if it Changes over Time? • Do not ignore it! • Non-stationary input process • Examples: • Arrivals of customers to a restaurant • Arrivals of email to a server • Arrivals of bug discovery in software • Could model as nonhomogeneous Poisson process IE 519
Goodness-of-Fit Test • The distribution fitted is tested using goodness-of-fit tests (GoF) • How good are those tests? • The null hypothesis is that the data is drawn from the chosen distribution with the estimated parameters • Is it true? IE 519
Power of GoF Tests • The null hypothesis is always false! • If the GoF test is powerful enough then it will always be rejected • What we see in practice: • Few data points: no distribution is rejected • A great deal of data: all distributions are rejected • At best, GoF tests should be used as a guide IE 519
Input Modeling Software • Many software packages exist for input modeling (fitting distributions) • Each has at least 20-30 distributions • You input IID data, the software gives you a ranked list of distributions (according to GoF tests) • Pitfalls? IE 519
Why Fit a Distribution at All? • There is a growing sentiment that we should never fit distributions (not consensus, just growing) • A couple of issues: • You don’t always benefit from data • Fitting distribution is misleading IE 519
Is Data Reality • Data is often • Distorted • Poorly communicated, mistranslated or recorded • Dated • Data is always old by definition • Deleted • Some of the data is often missing • Dependent • Often only summaries, or collected at certain times • Deceptive • This may all be on purpose! IE 519
Problems with Fitting • Fitting an input distribution can be misleading for numerous reasons • There is rarely a theoretical justification for the distribution. Simulation is often sensitive to the tails and this is where the problem is! • Selecting the correct model is futile • The model gives the simulation practitioner a false sense of the model being well-defined IE 519
Alternative • Use empirical/trace-driven simulation when there is sufficient data • Treat other cases as if there is no data, and use beta distribution IE 519
Empirical Distribution IE 519
Beta Distribution Shapes IE 519
What to Do? • Old rule of thumb based on number of data points available: • <20 : Not enough data to fit • 21-50 : Fit, rule out poor choices • 50-200 : Fit a distribution • >200 : Use empirical distribution IE 519
Random Number Generation IE 519
Random-Number Generation • Any simulation with random components requires generating a sequence of random numbers • E.g., we have talked about arrival times, service times being drawn from a particular distribution • We do this by first generating a random number (uniform between [0,1]) and then transforming it appropriately IE 519
Three Alternatives • True random numbers • Throw a dice • Not possible to do with a computer • Pseudo-random numbers • Deterministic sequence that is statistically indistinguishable from a random sequence • Quasi-random numbers • A regular distribution of numbers over the desired interval IE 519
Why is this Important? • Validity • The simulation model may not be valid due to cycles and dependencies in the model • Precision • You can improve the output analysis by carefully choosing the random numbers IE 519
Pseudo-Random Numbers • Want an iterative algorithm that outputs numbers on a fixed interval • When we subject this sequence to a number of statistical test, we cannot distinguish it from a random sequence • In reality, it is completely deterministic IE 519
Linear Congruential Generators (LCG) • Introduced in the early 50s and still in very wide use today • Recursive formula Every number is determined by these four values IE 519
Transform to Unit Uniform • Simply divide by m • What values can we take? IE 519
Examples IE 519
Characteristics • All LCGs loop • The length of the cycle is the period • LCGs with period m have full period • This happens if and only if • The only positive integer that divides both m and c is 1 • If q is a prime that divides m, then q divides a-1 • If 4 divides m then 4 divides a-1 IE 519
Types of LCGs • If c=0 then it is called multiplicative LCG, otherwise mixed LCG • Mixed and multiplicative LCG behave rather differently IE 519