Spatial Modelling 2 Multivariate statistics using R

Spatial Modelling 2Multivariate statistics using R Rich Harris

Quick test  • Is the relationship positive or negative? • What is the r (Pearson correlation) value? • What would the gradient of the line be if r = 0?

4) What is the equation of the line ofbest fit shown?5) What is the probability the gradientis actually zero? Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -48.319 1.436 -33.65 <2e-16 *** X 2.947 0.049 60.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5 on 48 degrees of freedom Multiple R-squared: 0.9869, Adjusted R-squared: 0.9866 F-statistic: 3618 on 1 and 48 DF, p-value: < 2.2e-16

TRUE or FALSE? • The standard error is a measure of uncertainty. • Holding all other things equal, as the number of observations (n) increases, the standard error decreases. • All else equal, as the variation around the line of best fit increases, the standard error decreases.

TRUE or FALSE? • The t-statistic = estimate of the gradient of the line of best fit / standard error of the estimate • If the p value is greater than 0.05 we can be 95% confident the relationship (the slope) is not due to chance.

Quick test  • Is the relationship positive or negative? • Negative • What is the r (Pearson correlation) value? • -1 • What would the gradient of the line be if r = 0? • Zero (flat)

4) What is the equation of the line ofbest fit shown?y = 2.95x – 48.35) What is the probability the gradientis actually zero?Almost zero Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -48.319 1.436 -33.65 <2e-16 *** X 2.947 0.049 60.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5 on 48 degrees of freedom Multiple R-squared: 0.9869, Adjusted R-squared: 0.9866 F-statistic: 3618 on 1 and 48 DF, p-value: < 2.2e-16

TRUE or FALSE? • The standard error is a measure of uncertainty. • TRUE • Holding all other things equal, as the number of observations (n) increases, the standard error decreases. • TRUE • All else equal, as the variation around the line of best fit increases, the standard error decreases. • FALSE

TRUE or FALSE? • The t-statistic = estimate of the gradient of the line of best fit / standard error of the estimate • TRUE • If the p value is greater than 0.05 we can be 95% confident the relationship (the slope) is not due to chance. • FALSE

Reading • Core reading • The chapter I sent around • Though you should have read it by now! • Additional reading • Chapters 6-7 of Allison, PD (1999) Multiple Regression: a primer; or • Chapter 9 of Rogerson, PA (2006) Statistical Methods for Geography (2nd edition) (or chapter 7 in the 2001 edition)

Regression modelling • Need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression

Words to describe the relationship Positive Negative

Words to describe the relationship Linear Non linear

For linear regression • The relationship between Y and X in the model is assumed linear. • It’s the simplest relationship!

Uses of regression • For verification • For exploration • For prediction • For explanation • For some sense of statistical “proof” but not for naive “data mining” • You need to think about the model! • It should have substantive as well as statistical meaning. • In terms of explanation, modelling Y  X is not the same as X  Y

Regression Vs correlation y x Regression: CAUSAL y = dependent variable x = independent variable ASSOCIATION: No causality implied Correlation: y x

Regression modelling • Need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression

General process to fit the model in R • Read in the data • Take “a look at it” numerically and visually • Plot Y against X • If appropriate to do so, fit the model • Add the line to the plot and obtain a summary of the results

General process to fit the model in R mydata = read.table(“my_file.dat”, …) summary(mydata) names(mydata) # assume there are two variables in mydata # called response and predictor plot(response ~ predictor, data=mydata) # assume an apparent linear relationship model1 = lm(response ~ predictor, data=mydata) summary(model1) abline(model1) save.image(“regression.RData”) # always save your work!

Interpreting the output • The equation of the fitted line is • ŷ = 1195.92 + 48.75 × TEMP • The t value is 1.107 • The probability of drawing that t value by chance alone is p=0.2836 (28%) • Therefore, “72% confident” of the relationship • Not significant at a 95% level

Interpreting the output • The equation of the fitted line is • ŷ = 2198 – 0.36 × PRECIP • The t value is -5.236 The negative sign only indicates the direction of the relationship • The probability of drawing that t value by chance alone is very small! • Therefore, (more than) “99.9% confident” of the relationship • Significant at a 95% level (and above)

The R-squared values • The R-squared value measures the proportion of the Y variable that is explained by the model • For simple linear regression, if you calculate the (Pearson) correlation between the X and Y variables then square it, you’ll arrive at R2 • More generally, if you calculated the square of the correlation between the actual Y values and those predicted by the model, you’d obtain R2 • The adjusted R2 adjusts the R2 value for the number of explanatory/predictor variables (the number of Xs) in the model. • The logic is that is you include more Xs you should explain more of the Y due to chance alone. • Adjusted R2 useful for multiple regression

And that’s pretty much it… • Except we need to think about • How to formally define the line • “the relationship” between Y and X • The strength of the relationship between Y and X • The nature of the relationship • How to fit a model in R • Other assumptions and extensions to simple linear regression

Regression assumptions and checks • The regression line is fitted by “ordinary least squares” • OLS regression • If certain assumptions are met the result will be BLUE • Best Linear Unbiased Estimator • The estimates of β0 and β1 will be unbiased, consistent (as n → ∞, the estimates converge to the true values) and most-efficient. The same is true of the standard errors and t-values. • If certain assumptions are met

Assumptions • Assumptions about linearity • Assumptions about the residuals • “the error”

For linear regression • The relationship between Y and X in the model is assumed linear. • It’s the simplest relationship! • Which may involve transforming the variables

 

Common transformations plot(y ~ x) x2 = log10(x) plot(y ~ x2)

Common transformations plot(y ~ x) x2 = log10(x); y2 = log10(y) plot(y2 ~ x2)

Common transformations plot(y ~ x) x2 = x^2 plot(y ~ x2)

Note • Graphical methods are very useful for “exploring” data and spotting potential problems. • It is safer to transform models, create new variables (objects) in R and then fit the model to them that to try a ‘one shot’ process such as: model1 = lm(y ~ log10(x)) • This has potential for error

Assumptions about the residuals • Normal (“bell shaped”) distribution • No outliers (extreme residuals or leverage points) • independent and identically distributed (i.i.d.) • ‘Homoscedastic’ distribution • ‘even’ spread of noise around line • No autocorrelation • If the errors are independent and unrelated then there cannot be spatial or temporal patterns • No Endogeneity • X is exogenous to (“causes”) Y; no feedback

What are residuals? • O – E • y – ŷ • y – (β0 + β1x)

Normal distribution? model1 = lm(ELA ~ PRECIP, data=glacier_data) model1.resids = residuals(model1) hist(model1.resids)

Normal distribution? xx = seq(from=-150, to=150, by=1) mean.r = mean(model1.resids) sd.r = sd(model1.resids) bin = 50 n = nrow(glacier_data) yy = dnorm(xx, mean.r, sd.r)*bin*n points(xx, yy, type=“l”, col=“red”)

Normal distribution? qqnorm(model1.resids) qqline(model1.resids)

Normal distribution? 

Outliers • May be extreme residuals, leverage points or both • plot(ELA ~ PRECIP) • identify(ELA ~ PRECIP) • Extreme residuals • Perhaps 7 and 8? • Leverage points • Perhaps 5, 14, 8, 9?

Watch out for outliers Without an outlier With an outlier(a leverage point)

To help identify plot(model1, which=1) plot(model1, which=2)

To help identify plot(model1, which=5) 5 and 14 appear on all the plots But their studentized residual value is not especially high Rule of thumb: worry more if above +2 or below -2 And nor is their “Cook’s distance” Rule of thumb: worry more if they approach the 1 contour What to do if they were more problematic?

Consolidation • Regression assumes the X leads to the Y, not the other way around • Linear regression is for when there is a straight line relationship between the X and Y variables, or they can be transformed for it to be so • The line of best fit cannot be genuinely so if it distorted by the influence of outliers or the relationship isn’t linear • The regression residuals should be Normally distributed • We use various visual tools to check for linearity, Normality and for the presence of extreme residuals / leverage points

Spatial Modelling 2 Multivariate statistics using R

Spatial Modelling 2 Multivariate statistics using R

Presentation Transcript

Multivariate Statistics

Batch Startups Using Multivariate Statistics and Optimization

Spatial Statistics

Data Analysis Using R: 2. Descriptive Statistics

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics

MULTIVARIATE STATISTICS

Multivariate Statistics

Modelling Housing using spatial microsimulation

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics

Chapter 7 Using Multivariate Statistics P173

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics

Multivariate Statistics