470 likes | 570 Views
Spatial Modelling 2 Multivariate statistics using R. Rich Harris. Quick test . Is the relationship positive or negative? What is the r (Pearson correlation) value? What would the gradient of the line be if r = 0?.
E N D
Spatial Modelling 2Multivariate statistics using R Rich Harris
Quick test • Is the relationship positive or negative? • What is the r (Pearson correlation) value? • What would the gradient of the line be if r = 0?
4) What is the equation of the line ofbest fit shown?5) What is the probability the gradientis actually zero? Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -48.319 1.436 -33.65 <2e-16 *** X 2.947 0.049 60.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5 on 48 degrees of freedom Multiple R-squared: 0.9869, Adjusted R-squared: 0.9866 F-statistic: 3618 on 1 and 48 DF, p-value: < 2.2e-16
TRUE or FALSE? • The standard error is a measure of uncertainty. • Holding all other things equal, as the number of observations (n) increases, the standard error decreases. • All else equal, as the variation around the line of best fit increases, the standard error decreases.
TRUE or FALSE? • The t-statistic = estimate of the gradient of the line of best fit / standard error of the estimate • If the p value is greater than 0.05 we can be 95% confident the relationship (the slope) is not due to chance.
Quick test • Is the relationship positive or negative? • Negative • What is the r (Pearson correlation) value? • -1 • What would the gradient of the line be if r = 0? • Zero (flat)
4) What is the equation of the line ofbest fit shown?y = 2.95x – 48.35) What is the probability the gradientis actually zero?Almost zero Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -48.319 1.436 -33.65 <2e-16 *** X 2.947 0.049 60.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5 on 48 degrees of freedom Multiple R-squared: 0.9869, Adjusted R-squared: 0.9866 F-statistic: 3618 on 1 and 48 DF, p-value: < 2.2e-16
TRUE or FALSE? • The standard error is a measure of uncertainty. • TRUE • Holding all other things equal, as the number of observations (n) increases, the standard error decreases. • TRUE • All else equal, as the variation around the line of best fit increases, the standard error decreases. • FALSE
TRUE or FALSE? • The t-statistic = estimate of the gradient of the line of best fit / standard error of the estimate • TRUE • If the p value is greater than 0.05 we can be 95% confident the relationship (the slope) is not due to chance. • FALSE
Reading • Core reading • The chapter I sent around • Though you should have read it by now! • Additional reading • Chapters 6-7 of Allison, PD (1999) Multiple Regression: a primer; or • Chapter 9 of Rogerson, PA (2006) Statistical Methods for Geography (2nd edition) (or chapter 7 in the 2001 edition)
Regression modelling • Need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression
Words to describe the relationship Positive Negative
Words to describe the relationship Linear Non linear
For linear regression • The relationship between Y and X in the model is assumed linear. • It’s the simplest relationship!
Uses of regression • For verification • For exploration • For prediction • For explanation • For some sense of statistical “proof” but not for naive “data mining” • You need to think about the model! • It should have substantive as well as statistical meaning. • In terms of explanation, modelling Y X is not the same as X Y
Regression Vs correlation y x Regression: CAUSAL y = dependent variable x = independent variable ASSOCIATION: No causality implied Correlation: y x
Regression modelling • Need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression
General process to fit the model in R • Read in the data • Take “a look at it” numerically and visually • Plot Y against X • If appropriate to do so, fit the model • Add the line to the plot and obtain a summary of the results
General process to fit the model in R mydata = read.table(“my_file.dat”, …) summary(mydata) names(mydata) # assume there are two variables in mydata # called response and predictor plot(response ~ predictor, data=mydata) # assume an apparent linear relationship model1 = lm(response ~ predictor, data=mydata) summary(model1) abline(model1) save.image(“regression.RData”) # always save your work!
General process to fit the model in R mydata = read.table(“my_file.dat”, …) summary(mydata) names(mydata) # assume there are two variables in mydata # called response and predictor plot(response ~ predictor, data=mydata) # assume an apparent linear relationship model1 = lm(response ~ predictor, data=mydata) summary(model1) abline(model1) save.image(“regression.RData”) # always save your work!
Interpreting the output • The equation of the fitted line is • ŷ = 1195.92 + 48.75 × TEMP • The t value is 1.107 • The probability of drawing that t value by chance alone is p=0.2836 (28%) • Therefore, “72% confident” of the relationship • Not significant at a 95% level
Interpreting the output • The equation of the fitted line is • ŷ = 2198 – 0.36 × PRECIP • The t value is -5.236 The negative sign only indicates the direction of the relationship • The probability of drawing that t value by chance alone is very small! • Therefore, (more than) “99.9% confident” of the relationship • Significant at a 95% level (and above)
The R-squared values • The R-squared value measures the proportion of the Y variable that is explained by the model • For simple linear regression, if you calculate the (Pearson) correlation between the X and Y variables then square it, you’ll arrive at R2 • More generally, if you calculated the square of the correlation between the actual Y values and those predicted by the model, you’d obtain R2 • The adjusted R2 adjusts the R2 value for the number of explanatory/predictor variables (the number of Xs) in the model. • The logic is that is you include more Xs you should explain more of the Y due to chance alone. • Adjusted R2 useful for multiple regression
And that’s pretty much it… • Except we need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression
Regression assumptions and checks • The regression line is fitted by “ordinary least squares” • OLS regression • If certain assumptions are met the result will be BLUE • Best Linear Unbiased Estimator • The estimates of β0 and β1 will be unbiased, consistent (as n → ∞, the estimates converge to the true values) and most-efficient. The same is true of the standard errors and t-values. • If certain assumptions are met
Assumptions • Assumptions about linearity • Assumptions about the residuals • “the error”
For linear regression • The relationship between Y and X in the model is assumed linear. • It’s the simplest relationship! • Which may involve transforming the variables
Common transformations plot(y ~ x) x2 = log10(x) plot(y ~ x2)
Common transformations plot(y ~ x) x2 = log10(x); y2 = log10(y) plot(y2 ~ x2)
Common transformations plot(y ~ x) x2 = x^2 plot(y ~ x2)
Note • Graphical methods are very useful for “exploring” data and spotting potential problems. • It is safer to transform models, create new variables (objects) in R and then fit the model to them that to try a ‘one shot’ process such as: model1 = lm(y ~ log10(x)) • This has potential for error
Assumptions about the residuals • Normal (“bell shaped”) distribution • No outliers (extreme residuals or leverage points) • independent and identically distributed (i.i.d.) • ‘Homoscedastic’ distribution • ‘even’ spread of noise around line • No autocorrelation • If the errors are independent and unrelated then there cannot be spatial or temporal patterns • No Endogeneity • X is exogenous to (“causes”) Y; no feedback
What are residuals? • O – E • y – ŷ • y – (β0 + β1x)
Assumptions about the residuals • Normal (“bell shaped”) distribution • No outliers (extreme residuals or leverage points) • independent and identically distributed (i.i.d.) • ‘Homoscedastic’ distribution • ‘even’ spread of noise around line • No autocorrelation • If the errors are independent and unrelated then there cannot be spatial or temporal patterns • No Endogeneity • X is exogenous to (“causes”) Y; no feedback
Normal distribution? model1 = lm(ELA ~ PRECIP, data=glacier_data) model1.resids = residuals(model1) hist(model1.resids)
Normal distribution? xx = seq(from=-150, to=150, by=1) mean.r = mean(model1.resids) sd.r = sd(model1.resids) bin = 50 n = nrow(glacier_data) yy = dnorm(xx, mean.r, sd.r)*bin*n points(xx, yy, type=“l”, col=“red”)
Normal distribution? qqnorm(model1.resids) qqline(model1.resids)
Assumptions about the residuals • Normal (“bell shaped”) distribution • No outliers (extreme residuals or leverage points) • independent and identically distributed (i.i.d.) • ‘Homoscedastic’ distribution • ‘even’ spread of noise around line • No autocorrelation • If the errors are independent and unrelated then there cannot be spatial or temporal patterns • No Endogeneity • X is exogenous to (“causes”) Y; no feedback
Outliers • May be extreme residuals, leverage points or both • plot(ELA ~ PRECIP) • identify(ELA ~ PRECIP) • Extreme residuals • Perhaps 7 and 8? • Leverage points • Perhaps 5, 14, 8, 9?
Watch out for outliers Without an outlier With an outlier(a leverage point)
To help identify plot(model1, which=1) plot(model1, which=2)
To help identify plot(model1, which=5) 5 and 14 appear on all the plots But their studentized residual value is not especially high Rule of thumb: worry more if above +2 or below -2 And nor is their “Cook’s distance” Rule of thumb: worry more if they approach the 1 contour What to do if they were more problematic?
Consolidation • Regression assumes the X leads to the Y, not the other way around • Linear regression is for when there is a straight line relationship between the X and Y variables, or they can be transformed for it to be so • The line of best fit cannot be genuinely so if it distorted by the influence of outliers or the relationship isn’t linear • The regression residuals should be Normally distributed • We use various visual tools to check for linearity, Normality and for the presence of extreme residuals / leverage points