1 / 56

correlation and percentages

correlation and percentages. association between variables can be explored using counts are high counts of bone needles associated with high counts of end scrapers? similar questions can be asked using percent-standardized data

Download Presentation

correlation and percentages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. correlation and percentages • association between variables can be explored using counts • are high counts of bone needles associated with high counts of end scrapers? • similar questions can be asked using percent-standardized data • are high proportions of decorated pottery associated with high proportions of copper bells?

  2. but… • these are different questions with different implications for formal regression • percents will show some correlation even if underlying counts do not… • ‘spurious’ correlation (negative) • “closed-sum” effect

  3. 10 vars. 5 vars. 3 vars. 2 vars. matrix(round(rnorm(100, 50, 15), nrow=10)))

  4.  original counts  %s (10 vars.)  %s (5 vars.)  %s (3 vars.)  %s (2 vars.)

  5. original counts %s 10 vars. %s 5 vars. %s 3 vars. %s 2 vars.

  6. outliers

  7. including outliers in regression analyses is usually a bad idea… • Tukey-line / least squares discrepancies are good red-flag signals

  8. “convex hull trimming”

  9. “convex hull trimming” > hull1 chull(x, y) > plot(x, y) > polygon(x[hull1], y[hull1]) > abline(lm(y[-hull1] ~ x[-hull1]))

  10. transformation

  11. transformation • at least two major motivations in regression analysis: • create/improve a linear relationship • correct skewed distribution(s)

  12. ex: density of obsidian vs. distance from the quarry:

  13. LG_DENS  log(DENSITY) old.par  par(no.readonly = TRUE) plot(DIST, DENSITY, log="y") par(old.par)

  14. > VAR1T  sqrt(VAR1)> plot(VAR1T, VAR2)

  15. transformation summary • correcting left skew: x4 stronger x3 strong x2 mild • correcting right skew: x weak log(x) mild -1/x strong -1/x2 stronger

  16. “coefficient of determination”

  17. regression/correlation • the strength of a relationship can be assessed by seeing how knowledge of one variable improves the ability to predict the other

  18. if you ignore x, the best predictor of y will be the mean of all y values (y-bar) • if the y measurements are widely scattered, prediction errors will be greater than if they are close together • we can assess the dispersion of y values around their mean by:

  19. r2= • “coefficient of determination” (r2) • describes the proportion of variation that is “explained” or accounted for by the regression line… • r2=.5  half of the variation is explained by the regression…  half of the variation in y is explained by variation in x…

  20. x “explaining variance” range vs.

  21. vs.

  22. multiple regression

  23. residuals • vertical deviations of points around the regression • for case i, residual = yi-ŷi [yi-(a+bxi)] • residuals in y should not show patterned variation either with x or y-hat • should be normally distributed around the regression line • residual error should not be autocorrelated (errors/residuals in y are independent…)

  24. residuals may show patterning with respect to other variables… • explore this with a residual scatterplot • ŷ vs. other variables… • are there suggestions of linear or other kinds of relationships? • if r2 < 1, some of the remaining variation may be explainable with reference to other variables

  25. paying close attention to outliers in a residual plot may lead to important insights • e.g.: outlying residuals from quantities of exotic flint ~ distance from quarries • sites with special access though transport routes, political alliances… • residuals from regressions are often the main payoff

  26. Middle Formative, Basin of Mexico

  27. Formative Basin of Mexico • settlement survey • 3 variables recorded from sites: • site size (proxy for population) • amount of arable land in standard “catchment” • productivity index for soils

  28. SIZE (ha) • AGLAND (km2) • PROD (index) How are these variables related? Do any make sense as dependent or independent variables?

  29. SIZE ~ AGLAND

  30. (ha) (km2) r2 = .75 y = 35.4 + .66x SIZE = 35.38 + .66*AGLAND

  31. residuals??

  32. residual SIZE = SIZE – SIZE-hat > resSize  frmdat$size – (35.4 +.66 * frmdat$agland)

  33. PROD & SIZE SIZE = -29 + 98 * PROD r2 = .69

  34. r2 = .75 What have we “explained” about site size?? r2 = .69

  35. r2 = .55

  36. X0 X1 X2 multiple regression…

  37. X0 1 1 = total variance observed in independent variable (x0)

  38. X0 X1 variance in x0 explained by x1, by itself… variance in x0 unexplained by x1…

  39. X0 X2 variance in x0 explained by x2, by itself… variance in x0 unexplained by x2…

  40. X0 X1 (total variance in x0 explained by x1, that is not explained by x2…) partial correlation coefficient: proportion of variance in x0 explained by x1, that is not explained by x2…

  41. multiple coefficient of determination: variance in x0 explained by x1 and x2, both separately, and together…

  42. productivity agricultural land SITE-SIZE

More Related