1 / 45

Spatial Modelling 2 Multivariate statistics using R

Spatial Modelling 2 Multivariate statistics using R. Rich Harris. Quick test . Is the relationship positive or negative? What is the r (Pearson correlation) value? What would the gradient of the line be if r = 0?.

Download Presentation

Spatial Modelling 2 Multivariate statistics using R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spatial Modelling 2Multivariate statistics using R Rich Harris

  2. Quick test  • Is the relationship positive or negative? • What is the r (Pearson correlation) value? • What would the gradient of the line be if r = 0?

  3. 4) What is the equation of the line ofbest fit shown?5) What is the probability the gradientis actually zero? Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -48.319 1.436 -33.65 <2e-16 *** X 2.947 0.049 60.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5 on 48 degrees of freedom Multiple R-squared: 0.9869, Adjusted R-squared: 0.9866 F-statistic: 3618 on 1 and 48 DF, p-value: < 2.2e-16

  4. TRUE or FALSE? • The standard error is a measure of uncertainty. • Holding all other things equal, as the number of observations (n) increases, the standard error decreases. • All else equal, as the variation around the line of best fit increases, the standard error decreases.

  5. TRUE or FALSE? • The t-statistic = estimate of the gradient of the line of best fit / standard error of the estimate • If the p value is greater than 0.05 we can be 95% confident the relationship (the slope) is not due to chance.

  6. Quick test  • Is the relationship positive or negative? • Negative • What is the r (Pearson correlation) value? • -1 • What would the gradient of the line be if r = 0? • Zero (flat)

  7. 4) What is the equation of the line ofbest fit shown?y = 2.95x – 48.35) What is the probability the gradientis actually zero?Almost zero Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -48.319 1.436 -33.65 <2e-16 *** X 2.947 0.049 60.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5 on 48 degrees of freedom Multiple R-squared: 0.9869, Adjusted R-squared: 0.9866 F-statistic: 3618 on 1 and 48 DF, p-value: < 2.2e-16

  8. TRUE or FALSE? • The standard error is a measure of uncertainty. • TRUE • Holding all other things equal, as the number of observations (n) increases, the standard error decreases. • TRUE • All else equal, as the variation around the line of best fit increases, the standard error decreases. • FALSE

  9. TRUE or FALSE? • The t-statistic = estimate of the gradient of the line of best fit / standard error of the estimate • TRUE • If the p value is greater than 0.05 we can be 95% confident the relationship (the slope) is not due to chance. • FALSE

  10. Reading • Core reading • The chapter I sent around • Though you should have read it by now! • Additional reading • Chapters 6-7 of Allison, PD (1999) Multiple Regression: a primer; or • Chapter 9 of Rogerson, PA (2006) Statistical Methods for Geography (2nd edition) (or chapter 7 in the 2001 edition)

  11. Regression modelling • Need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression

  12. Words to describe the relationship Positive Negative

  13. Words to describe the relationship Linear Non linear

  14. For linear regression • The relationship between Y and X in the model is assumed linear. • It’s the simplest relationship!

  15. Uses of regression • For verification • For exploration • For prediction • For explanation • For some sense of statistical “proof” but not for naive “data mining” • You need to think about the model! • It should have substantive as well as statistical meaning. • In terms of explanation, modelling Y  X is not the same as X  Y

  16. Regression Vs correlation y x Regression: CAUSAL y = dependent variable x = independent variable ASSOCIATION: No causality implied Correlation: y x

  17. Regression modelling • Need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression

  18. General process to fit the model in R • Read in the data • Take “a look at it” numerically and visually • Plot Y against X • If appropriate to do so, fit the model • Add the line to the plot and obtain a summary of the results

  19. General process to fit the model in R mydata = read.table(“my_file.dat”, …) summary(mydata) names(mydata) # assume there are two variables in mydata # called response and predictor plot(response ~ predictor, data=mydata) # assume an apparent linear relationship model1 = lm(response ~ predictor, data=mydata) summary(model1) abline(model1) save.image(“regression.RData”) # always save your work!

  20. General process to fit the model in R mydata = read.table(“my_file.dat”, …) summary(mydata) names(mydata) # assume there are two variables in mydata # called response and predictor plot(response ~ predictor, data=mydata) # assume an apparent linear relationship model1 = lm(response ~ predictor, data=mydata) summary(model1) abline(model1) save.image(“regression.RData”) # always save your work!

  21. Interpreting the output • The equation of the fitted line is • ŷ = 1195.92 + 48.75 × TEMP • The t value is 1.107 • The probability of drawing that t value by chance alone is p=0.2836 (28%) • Therefore, “72% confident” of the relationship • Not significant at a 95% level

  22. Interpreting the output • The equation of the fitted line is • ŷ = 2198 – 0.36 × PRECIP • The t value is -5.236 The negative sign only indicates the direction of the relationship • The probability of drawing that t value by chance alone is very small! • Therefore, (more than) “99.9% confident” of the relationship • Significant at a 95% level (and above)

  23. The R-squared values • The R-squared value measures the proportion of the Y variable that is explained by the model • For simple linear regression, if you calculate the (Pearson) correlation between the X and Y variables then square it, you’ll arrive at R2 • More generally, if you calculated the square of the correlation between the actual Y values and those predicted by the model, you’d obtain R2 • The adjusted R2 adjusts the R2 value for the number of explanatory/predictor variables (the number of Xs) in the model. • The logic is that is you include more Xs you should explain more of the Y due to chance alone. • Adjusted R2 useful for multiple regression

  24. And that’s pretty much it… • Except we need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression

  25. Regression assumptions and checks • The regression line is fitted by “ordinary least squares” • OLS regression • If certain assumptions are met the result will be BLUE • Best Linear Unbiased Estimator • The estimates of β0 and β1 will be unbiased, consistent (as n → ∞, the estimates converge to the true values) and most-efficient. The same is true of the standard errors and t-values. • If certain assumptions are met

  26. Assumptions • Assumptions about linearity • Assumptions about the residuals • “the error”

  27. For linear regression • The relationship between Y and X in the model is assumed linear. • It’s the simplest relationship! • Which may involve transforming the variables

  28. Common transformations plot(y ~ x) x2 = log10(x) plot(y ~ x2)

  29. Common transformations plot(y ~ x) x2 = log10(x); y2 = log10(y) plot(y2 ~ x2)

  30. Common transformations plot(y ~ x) x2 = x^2 plot(y ~ x2)

  31. Note • Graphical methods are very useful for “exploring” data and spotting potential problems. • It is safer to transform models, create new variables (objects) in R and then fit the model to them that to try a ‘one shot’ process such as: model1 = lm(y ~ log10(x)) • This has potential for error

  32. Assumptions about the residuals • Normal (“bell shaped”) distribution • No outliers (extreme residuals or leverage points) • independent and identically distributed (i.i.d.) • ‘Homoscedastic’ distribution • ‘even’ spread of noise around line • No autocorrelation • If the errors are independent and unrelated then there cannot be spatial or temporal patterns • No Endogeneity • X is exogenous to (“causes”) Y; no feedback

  33. What are residuals? • O – E • y – ŷ • y – (β0 + β1x)

  34. Assumptions about the residuals • Normal (“bell shaped”) distribution • No outliers (extreme residuals or leverage points) • independent and identically distributed (i.i.d.) • ‘Homoscedastic’ distribution • ‘even’ spread of noise around line • No autocorrelation • If the errors are independent and unrelated then there cannot be spatial or temporal patterns • No Endogeneity • X is exogenous to (“causes”) Y; no feedback

  35. Normal distribution? model1 = lm(ELA ~ PRECIP, data=glacier_data) model1.resids = residuals(model1) hist(model1.resids)

  36. Normal distribution? xx = seq(from=-150, to=150, by=1) mean.r = mean(model1.resids) sd.r = sd(model1.resids) bin = 50 n = nrow(glacier_data) yy = dnorm(xx, mean.r, sd.r)*bin*n points(xx, yy, type=“l”, col=“red”)

  37. Normal distribution? qqnorm(model1.resids) qqline(model1.resids)

  38. Normal distribution?

  39. Assumptions about the residuals • Normal (“bell shaped”) distribution • No outliers (extreme residuals or leverage points) • independent and identically distributed (i.i.d.) • ‘Homoscedastic’ distribution • ‘even’ spread of noise around line • No autocorrelation • If the errors are independent and unrelated then there cannot be spatial or temporal patterns • No Endogeneity • X is exogenous to (“causes”) Y; no feedback

  40. Outliers • May be extreme residuals, leverage points or both • plot(ELA ~ PRECIP) • identify(ELA ~ PRECIP) • Extreme residuals • Perhaps 7 and 8? • Leverage points • Perhaps 5, 14, 8, 9?

  41. Watch out for outliers Without an outlier With an outlier(a leverage point)

  42. To help identify plot(model1, which=1) plot(model1, which=2)

  43. To help identify plot(model1, which=5) 5 and 14 appear on all the plots But their studentized residual value is not especially high Rule of thumb: worry more if above +2 or below -2 And nor is their “Cook’s distance” Rule of thumb: worry more if they approach the 1 contour What to do if they were more problematic?

  44. Consolidation • Regression assumes the X leads to the Y, not the other way around • Linear regression is for when there is a straight line relationship between the X and Y variables, or they can be transformed for it to be so • The line of best fit cannot be genuinely so if it distorted by the influence of outliers or the relationship isn’t linear • The regression residuals should be Normally distributed • We use various visual tools to check for linearity, Normality and for the presence of extreme residuals / leverage points

More Related