510 likes | 675 Views
Empirical Research Methods in Computer Science. Lecture 4 November 2, 2005 Noah Smith. Today. Review bootstrap estimate of se (from homework). Review sign and permutation tests for paired samples. Lots of examples of hypothesis tests. Recall.
E N D
Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith
Today • Review bootstrap estimate of se (from homework). • Review sign and permutation tests for paired samples. • Lots of examples of hypothesis tests.
Recall ... • There is a true value of the statistic. But we don’t know it. • We can compute the sample statistic. • We know sample means are normally distrubuted (as n gets big):
But we don’t know anything about the distribution of other sample statistics (medians, correlations, etc.)!
Bootstrap world unknown distribution F empirical distribution observed random sample X bootstrap random sample X* statistic of interest bootstrap replication statistics about the estimate (e.g., standard error)
Bootstrap estimate of se • Run B bootstrap replicates, and compute the statistic each time: θ*[1], θ*[2], θ*[3], ..., θ*[B] (mean of θ* across replications) (sample standard deviation of θ* across replications)
Paired-Sample Design • pairs (xi, yi) • x ~ distribution F • y ~ distribution G • How do F and G differ?
Sign Test • H0: F and G have the same median median(F) – median(G) = 0 • Pr(x > y) = 0.5 • sign(x – y) ~ binomial distribution • compute bin(N+, 0.5)
Sign Test • nonparametric (no assumptions about the data) • closed form (no random sampling)
Example: gzip speed • build gzip with –O2 or with –O0 on about 650 files out of 1000, gzip-O2 was faster binomial distribution, p = 0.5, n = 1000 p < 3 x 10-24
Permutation Test • H0: F = G • Suppose difference in sample means is d. • How likely is this difference (or a greater one) under H0? • For i = 1 to P • Randomly permute each (xi, yi) • Compute difference in sample means
Permutation Test • nonparametric (no assumptions about the data) • randomized test
Example: gzip speed 1000 permutations: difference of sample means under H0 is centered on 0 -1579 is very extreme; p ≈ 0
Comparing speed is tricky! • It is very difficult to control for everything that could affect runtime. • Solution 1: do the best you can. • Solution 2: many runs, and then do ANOVA tests (or their nonparametric equivalents). “Is there more variance between conditions than within conditions?”
Sampling method 1 • for r = 1 to 10 • for each file f • for each program p • time p on f
Result (gzip first) student 2’s program faster than gzip!
Result (student first) student 2’s program is slower than gzip!
Sampling method 1 • for r = 1 to 10 • for each file f • for each program p • time p on f
Order effects • Well-known in psychology. • What the subject does at time t will affect what she does at time t+1.
Sampling method 2 • for r = 1 to 10 • for each program p • for each file f • time p on f
Result gzip wins
Sign and Permutation Tests all distribution pairs (F, G) F G median(F) median(G)
Sign and Permutation Tests all distribution pairs (F, G) F G median(F) median(G) sign test rejects H0
Sign and Permutation Tests all distribution pairs (F, G) F G permutation test rejects H0 median(F) median(G)
Sign and Permutation Tests all distribution pairs (F, G) F G permutation test rejects H0 median(F) median(G) sign test rejects H0
There are other tests! • We have chosen two that are • nonparametric • easy to implement • Others include: • Wilcoxon Signed Rank Test • Kruskal-Wallis (nonparametric “ANOVA”)
Pre-increment? • Conventional wisdom: “Better to use ++x than to use x++.” • Really, with a modern compiler?
Two (toy) programs for(i = 0; i < (1 << 30); ++i) j = ++k; for(i = 0; i < (1 << 30); i++) j = k++; • ran each 200 times (interleaved) • mean runtimes were 2.835 and 2.735 • significant well below .05
What? leal -8(%ebp), %eax incl (%eax) movl -8(%ebp), %eax movl -8(%ebp), %eax leal -8(%ebp), %edx incl (%edx) %edx is not used anywhere else
Conclusion • Compile with –O and the assembly code is identical!
Pre-increment, take 2 • Take gzip source code. • Replace all post-increments with pre-increments, in places where semantics won’t change. • Run on 1000 files, 10 times each. • Compare average runtime by file.
Sign test p = 8.5 x 10-8
Conclusion • Preincrementing is faster! • ... but what about –O? • sign test: p = 0.197 • permutation test: p = 0.672 • Preincrement matters without an optimizing compiler.
Your programs ... • 8 students had a working program both weeks. • 6 people changed their code. • 1 person changed nothing. • 1 person changed to –O3. • 3 people lossy in week 1. • Everyone lossy in week 2!
Your programs! • Was there an improvement on compression between the two versions? • H0: No. • Find sampling distribution of difference in means, using permutations.
Homework Assignment 2 6 experiments: • Does your program compress text or images better? • What about variance of compression? • What about gzip’s compression? • Variance of gzip’s compression? • Was there a change in the compression of your program from week 1 to week 2? • In the runtime?
Remainder of the course • 11/9: EDA • 11/16: Regression and learning • 11/23: Happy Thanksgiving! • 11/30: Statistical debugging • 12/7: Review, Q&A • Saturday 12/17, 2-5pm: Exam