280 likes | 345 Views
Statistical Methods for Computer Science. Marie desJardins ( mariedj@cs.umbc.edu ) CMSC 601 April 9, 2012. Material adapted from slides by Tom Dietterich, with permission. Statistical Analysis of Data. Given a set of measurements of a value, how certain can we be of the value?
E N D
Statistical Methods for Computer Science Marie desJardins (mariedj@cs.umbc.edu) CMSC 601 April 9, 2012 Material adapted from slides byTom Dietterich, with permission
Statistical Analysis of Data • Given a set of measurements of a value, how certain can we be of the value? • Given a set of measurements of two values, how certain can we be that the two values are different? • Given a measured outcome, along with several condition or treatment values, how can we remove the effect of unwanted conditions or treatments on the outcome?
Measuring CPU Time • Here are 37 measurements of the CPU time required to compute C(10000, 500): • 0.27 0.25 0.23 0.24 0.26 0.24 0.26 0.25 0.24 0.250.25 0.24 0.25 0.24 0.25 0.26 0.24 0.25 0.25 0.250.25 0.25 0.24 0.25 0.24 0.25 0.25 0.24 0.25 0.250.24 0.25 0.24 0.24 0.25 0.25 0.26 • What is the “true” CPU cost of this computation? • Before doing any calculations, always visualize your data!
Kernel Density Estimate • Kernel density: place a small Gaussian distribution (“kernel”) around each data point, and sum them • Useful for visualization; also often used as a regression technique
Sample Mean • The data seems to have reasonably close to a normal (Gaussian or bell curve) distribution • Given this assumption, we can compute a sample mean: • How certain can we be that this is the true value? • Confidence interval [min, max]: • Suppose we drew many random samples of size n=37, and computed the sample means • 95% of the time, this value would lie between max and min
Confidence Intervals via Resampling • We can simulate this process algorithmically • Draw 1000 random subsamples (with replacement) from the original 37 points • This process makes no assumption about a Gaussian distribution! • Sort the means of these subsamples • Choose the 26th and 975th values as min and max of a 95% confidence interval (includes 95% of the sample means!) • Result: The resampled confidence interval is [0.245, 0.251]
Confidence Intervals via Distributional Theory • The Central Limit Theorem says that the distribution of the sample means is normally distributed, • If the original data is normally distributed with mean μand standard deviation σ, then the sample means will be normally distributed with mean μ and standard deviation σ’ = σ/√n (but we don’t know the original μ and σ...): • Note that it isn’t important to remember this formula, since Matlab, R, etc. will do this for you. But it is very important to understand why you are computing it!
t Distribution • Instead of assuming a normal distribution, we can use a t distribution (sometimes called a “Student’s t distribution”), which has three parameters: μ, σ, and the degrees of freedom (d.f. = n-1) • The probability distribution function looks somewhat like a normal distribution, but gives a tighter peak (with longer tails) as n increases • This distribution yields just slightly tighter confidence limits, using the central limit theorem:
Distributional Confidence Intervals • We can use the mathematical formula for the t distribution to compute a p (typically, p=0.95) confidence interval: • The 0.025 t-value, t0.025, is the value such that the probability that μ-μ’ < t0.025 is 0.975 • The 95% confidence interval is then [μ’-t0.025, μ+t0.025] • For the CPU example, t0.025 is 0.028, so the distributional confidence interval is [0.220, 0.276] -- tighter than the bootstrapped CI of [0.245, 0.251]
Bootstrap Computations of Other Statistics • The bootstrap method can be used to compute other sample statistics for which the distribution method isn’t appropriate: • median • mode • variance • Because the tails and outlying values may not be well represented in a sample, the bootstrap method is not as useful for statistics involving the “ends” of the distribution: • minimum • maximum
Measuring the Number of Occurrences of Events • In CS, we often want to know how often something occurs: • How many times does a process complete successfully? • How many times do we correctly predict membership in a class? • How many times do we find the top search result? • Again, the sample rate θ’ is what we have observed, but we would like to know the “true” rate θ
Bootstrap Confidence Intervals for Rates • Suppose we have observed 100 predictions of a decision tree, and 88 of them were correct • Draw many (say, 1000) samples of size n, with replacement, from the n observed predictions (here, n=100), and compute the sample classification rate • Sort the sample rates θi in increasing order • Choose the 26th and 975th values as the ends of the confidence interval: here, the confidence interval is [0.81, 0.94]
Binomial Distributional Confidence • If we assume that the classifier is a “biased coin” with probability θ of coming up heads, then we can use the binomial distribution to analytically compute the confidence interval • This requires a small correction because the binomial distribution is actually discrete, but we want a continuous estimate
Comparing Two Measurements • Consider the CPU measurements of the earlier example, and suppose we have performed the same computation on a different machine, yielding these CPU times: • 0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.190.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.200.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.290.21 0.23 0.20 • These times certainly seem faster than the first machine, which yielded a distributional confidence interval of [0.220, 0.276] – but how can we be sure?
Kernel Density Comparison • Visually, the second machine (Shark3) is much faster than the first (Darwin):
Difference Estimation • Bootstrap testing: • Repeat many times: • Draw a bootstrap sample from each of the machines, computer sample means • If Shark3 is faster than Darwin more than 95% of the time, we can be 95% confident that it really is faster • We can also compute a 95% bootstrap confidence interval on the difference between the means – this turns out to be [0.0461, 0.0553] • If the samples are drawn from t distributions, then the difference between the sample means also has a t distribution • Confidence interval on this difference: [0.0463, 0.0555]
Hypothesis Testing • Is the true difference zero, or more than zero? • Use classical statistical rejection testing • Null hypothesis: The two machines have the same speed (i.e., μ, the difference in sample rate, is equal to zero) • Can we reject this hypothesis, based on the observed data? • If the null hypothesis were true, what is the probability we would have observed this data? • We can measure this probability using the t distribution • In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69 • The probability of seeing this t value, if μ was actually zero, is nearly nonexistent: The 99.999% confidence interval (for the null hypothesis) is [-4.59, 4.59], so the probability of this t value is (much) less than 0.00001
Paired Differences • Suppose we had a set of 10 different benchmark programs that we ran on the two machines, yielding these CPU times: • Obviously, we don’t want to just comparethe means, since theprograms have suchdifferent running times
Kernel Density Visualization • CPU1 seems to be systemically faster (offset to the left) than CPU2
Scatterplot Visualization • CPU1 is always faster than CPU2 (i.e., above the diagonal line that corresponds to equal speed)
Sequential Visualization • The co-correlation of program “difficulty” (and faster CPU speed of CPU1) is even more obvious in this ordered (by program number) line plot:
Distribution Analysis I • If the differences are in the same “units,” we can subtract the CPU times for the “paired” tests and assume a t distribution on these differences • The probability of observing a sample mean difference as large as 02779, given a null hypothesis of the machines having the same speed, is 0.0000466 – we can reject the null hypothesis • If we have no prior belief about which machine is faster, we should use a “two-tailed test” • The probability of observing a sample mean difference this large in either direction is 00000932 – slightly larger, but still sufficiently improbable that we can be sure that the machines have different speeds • Note that we can also use a bootstrap analysis on this type of paired data
Paired vs. Non-Paired • If we don’t pair the data (just compare the overall mean, not the differences for paired tests): • Distributional analysis doesn’t let us reject the null hypothesis • Bootstrap analysis doesn’t let us reject the null hypothesis
Sign Tests • I mentioned before that the paired t-test is appropriate if the measurements are in the same “units” • If the magnitude of the difference is not important, or not meaningful, we still can compare performance • Look at the sign of the difference (here, CPU1 is faster 10 out of 10 times; but in another case, it might only be faster 9 out of 10 times) • Use the binomial distribution (flip a coin to get the sign) to compute a confidence interval for the probability that CPU1 is faster
Other Important Topics • Regression analysis • Cross-validation • Human subjects analysis and user study design • Analysis of Variance (ANOVA) • For your particular investigation, you need to know which of these topics are relevant, and to learn about them!
Statistically Valid Experimental Design • Make sure you understand the nuances before you design your experiments... • ...and definitely before you analyze your experimental data! • Designing the statistical methods (and hypotheses) after the fact is not valid! • You can often find a hypothesis and associated statistical method and hypothesis ex post facto – i.e., design an experiment to fit the data instead of the other way around • In the worst case, doing this is downright unethical • In the best case, it shows a lack of clear research objectives and may not be reproducible or meaningful