430 likes | 945 Views
Exploratory Data Analysis. Hal Varian 20 March 2006. What is EDA?. Goals Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis Methods of analysis Primarily graphics and tables Online reference
E N D
Exploratory Data Analysis Hal Varian 20 March 2006
What is EDA? • Goals • Examine and summarize data • Look for patterns and suggest hypotheses • Provide guidance for more systematic analysis • Methods of analysis • Primarily graphics and tables • Online reference • http://www.itl.nist.gov/div898/handbook/eda/eda.htm • http://www.math.yorku.ca/SCS/Courses/eda/
Tools for EDA • We will use R = open source S • Very widely used by statisticians • Libraries for all sorts of things are available • Download from • cran.stat.ucla.edu • http://www.r-project.org/ • Recommend ESS (=Emacs Speaks Statistics) for interactive use • Windows interface is not bad
> library("foreign") > dat <- read.spss("GSS93 subset.sav") > attach(dat) > summary(AGE) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.0 33.0 43.0 46.4 59.0 99.0 > hist(AGE) Interactive R session
Recode missing data • AGE[AGE>90] <- NA • plot(density(AGE,na.rm=T)) • #plot both together • hist(AGE,freq=F) • lines(density(AGE,na.rm=T))
Boxplot • Boxplot • Outlier • 1.5 interquartile range • 3rd quartile • Median • 1st quartile • Smallest value
Boxplot enhancements • Notches: confidence interval for median • Varwidth=T: width of box is sqrt(n) • Useful for comparisons
Comparing distributions • boxplot(AGE~RACE) • boxplot(AGE~RACE,notch=T,varwidth=T) Doesn’t seem to be big diff in age distn
EDUC v RACE boxplot(EDUC[EDUC<90]~RACE[EDUC<90], notch=T,varwidth=T)
Violin plot • Combines density plot and boxplot • Good for weird shaped distributions…
Back to Back Histogram • library("Hmisc") • histbackback(EDUC[RACE=="black"],EDUC[RACE=="white"],probability=T)
Two-way table • GT12 <- EDUC>12 • temp <-table(GT12,RACE) • GT12 white black other • FALSE 614 100 37 • TRUE 640 67 38 • prop.table(temp,2) • GT12 white black other • FALSE 0.4896332 0.5988024 0.4933333 • TRUE 0.5103668 0.4011976 0.5066667
Comparing distributions • qqplot = quantile-quantile plot • Fraction of data less than k in x • Fraction of data less than k in y • Shapes • Straight line: same distribution • Vertical intercepts differ: different mean • Slopes differ: different variance • Reference distribution can be theoretical distn • qnorm – compare to standardized normal • Skew to right: both tails below straight line • Heavy tails: lower tail above, upper tail below line
qqplot(x,y) examples Mean1=0 Mean2=2 identical Sample v N(0,1), with ref line s1=1 s2=2
More qqnorm examples Skewed to right Heavy tails www.maths.murdoch.edu.au/units/statsnotes/samplestats/qqplot.html
Pairs of variables • Is one variable related to another? • Scatterplot • Basic: plot(x,y) • Enhanced from library(“car”): scatterplot(x,y) • Scatterplot matrix • Basic: pairs(data.frame(x,y,z)) • Enhanced: scatterplot.matrix(data.frame(x,y,z))
Labeling points in scatterplots • identify(x,y,labels=“foo”) • Color is also useful
Cigarettes and taxes • Discussant on paper by Austan Goolsbee, “Playing with Fire” • Question: did Internet purchases of cigarettes affect state tobacco tax revenues?
Price elasticity of use/sales • Across all states and years • Taxable sales elasticity: -0.802 • Use elasticity: -0.440 • Sales are much more responsive to price than usage suggesting that there is some cross border trade (aka “buttlegging”)
Reduced form • dp = log(p2001) – log(p1995) • dq = log(q2001) – log(q1995) • Regress dq/dp on internet penetration in 2000 • See next slide for result
What is Internet providing? • It was always a good deal for some to buy cigarettes out-of-state (in high tax states) • Mail order has been around for a long time and is certainly cost-effective • Internet makes it easier to find merchants – just type into search engine • Internet is great at matching buyers and sellers
Price of a match • Google doesn’t accept cigarette advertisements, but Overture does • Price for top listing: $1.20 per click • Avg price for click on Overture is 40 cents • Conversion rates might be 5%, so advertiser is paying $24 for introduction • But think of lifetime value…
Value of a match • Google doesn’t accept cigarette advertisements, but Overture does • Price for top listing: $1.20 per click • Avg price for click on Overture is 40 cents • Conversion rates might be 5%, so advertiser is paying $24 for introduction • But think of lifetime value…
Straightening out and scaling data • Find transform so that data looks linear, or normal, or fits on same scale • Log10 (easier to interpret than log) • Square root • Reciprocal • Box-Cox transform (xr – 1)/r which combines many of above; r=0 is log