410 likes | 425 Views
This lecture introduces the concept of econometrics and predictive modeling using linear regression. It discusses how to analyze relationships between variables and predict outcomes. The lecture also covers simple linear regression, multiple regression, and provides examples using real-world datasets.
E N D
IS6146 Databases for Management Information SystemsLecture 8: Predictive statistics: an intro to econometrics and linear regression Rob Gleasure R.Gleasure@ucc.ie robgleasure.com
IS6146 • Today’s session • Simple linear regression • Multiple regression
Econometrics and predictive modelling • Sometimes we want to go beyond analysing general patterns and within-variable trends and look at whether one variable predicts another • Does the salary spent on salespeople predict sales revenue? • Does the amount of coffee consumed on a particular premises predict overtime? • Do secondary school results predict university grades? • Do changes in the euro, yen, or dollar predict bitcoin prices? • Econometrics basically refers to the use of statistics to test theoretical (abstract) relationships with real world estimators • Cause represented by predictor/independent variables • Effect represented by outcome/dependent variables
Econometrics and predictive modelling • Imagine we are designing a catering strategy for UCC and one of us presents the argument that catering should focus on students who live far away, as other students can eat at home • Hypothesis: students living further from UCC are more likely to eat meals there. • Imagine we go around the class with 2 questions • How many minutes does it take (roughly) for you to get to UCC? • How many meals/week do you eat in UCC? • Now imagine we sketch this out on a 2-dimensional graph
Econometrics and predictive modelling • Let’s go back to our dataset at http://corvus2vm.ucc.ie/phd/rgleasure//rgleasure/databases/tomslee_airbnb_boston_1429_2017-07-10.csv • Is there a relationship between bedrooms and the number that can be accommodated, i.e. does one predict the other? • Select both columns of data and go to INSERT Charts Scatter
Econometrics and predictive modelling • Are more southerly lets more expensive? • Let’s select latitude and price and INSERT Charts Scatter ?
Simple linear regression • Problems such as whether age (a continuous variable) is a reliable predictor of income (also a continuous variable) lend themselves towards a simple linear regression • Arguably the most fundamental statistical test • Basically graphs variables against each other and checks if a line can be drawn that fits the distribution of data
Simple linear regression • A simple linear regression basically graphs variables against each other and checks a line can be drawn that fits the scatterplot 65 60 55 50 45 Income 40 35 30 25 80 30 60 70 20 40 50 Age
Simple linear regression • Most tests adopt ‘ordinary least squares’ approach to judge a line • Formally referred to as Pearson’s r 65 60 55 50 45 Income 40 35 30 25 80 30 60 70 20 40 50 Age
Simple linear regression • The basic formula for a simple linear regression is • Predicted value = slope of line * value of predicting variable * Intercept * Error i.e. Y’ = bX + A + E • Overall, we tend to prioritise two things • Does the predictor variable reliably predict the outcome variable? • We often refer to this as the ‘p value’, which states how likely the relationship could have occurred by natural variation • How strong is the relationship (does a change in the predictor cause a big or small change in the outcome)? • We often refer to this as the ‘beta value’, which states how much of the variation in the outcome variable is explained by variation in the predictor
Simple linear regression • Go to Data Data Analysis Regression and select the outcome variable cells under Y and the predictor variable cells under X
Simple linear regression • We should get the following results Amount of variance explained (0 being none, 1 being 100%) Probability pattern occurred by fluke
Multiple regression • Sometimes we want to run multiple predictors simultaneously • i.e. Y’ = b1X1+ b2X2 + … + bnXn+ A + E • Line these up in adjacent cells and select them all for X Range
Multiple regression • We should get the following results Coefficient for each individual predictor Probability for each individual predictor
OK… Break for 5?
Regression in R • Let’s have a look at a scatterplot of price and latitude plot(my_data$latitude, my_data$price)
Regression in R • Let’s have a look at a scatterplot of price and longitude plot(my_data$longitude, my_data$price)
Regression in R • And a similar plot of room type and latitude plot(my_data$room_type,my_data$latitude)
Regression in R • And level of education and age plot(my_data$room_type,my_data$longitude)
Running the model • To do this, we create a new variable to store the results of the model our_model = lm(my_data$price ~ my_data$latitude + my_data$longitude) • The show a summary of that variable summary(our_model)
Regression in R • We can also add the line using the abline() function if we wish to see it on the scatterplot plot(my_data$latitude,my_data$price) simple_model = lm(my_data$price ~ my_data$latitude) abline(simple_model, col="red")
Assumptions made by regression • A regression will break down if • The relationship isn’t linear, .e.g. it plateaus or changes direction after some point • The variance around the regression line is uneven • The predictions errors are not normally distributed, e.g. the ‘residuals’ fit well at one end of the line and not the other • The predictor variables are closely correlated themselves (only for multiple regression)
Checking assumptions in R (cont) • R has built in diagnostics, which we can cycle through by entering the following code and hitting enter to cycle through plots plot(our_model) • The first one looks for abnormally distributed errors in the residuals (we want a flat line)
Checking assumptions in R (cont) • The second one also looks for abnormally distributed errors in the standardised residuals (we want them to fit on the diagonal) • It can also help to spot weird patterns in variance, e.g. if one side falls away more than the other
Checking assumptions in R (cont) • The third and fourth ones also looks for abnormally distributed standardised residuals (again, we want a neat line) • Where data don’t meet the assumptions of a regression • Transformed tests, e.g. logarithmic transformations • Ranked tests, e.g. Spearman’s rho
What if the assumptions don’t hold? • A logistic/logit regression may be an option, which runs a similar regression formula against a binomial outcome, e.g. True or False, 0 or 1, etc. • Imagine we want to check whether the number of reviews is great than one - we could create a new binomial variable (either 0 or 1) my_data$has_reviews= ifelse(my_data$reviews>0,1,0) • The rerun the test with family set to ‘binomial’ logit_model = lm(my_data$has_reviews ~ my_data$price, family = "binomial") summary(logit_model)
What if the assumptions don’t hold? • Not very interesting • Nor do the longitude and latitude seem very informative. Oh well. logit_model2 = lm(my_data$has_reviews ~ my_data$latitude + my_data$longitude, family = "binomial") summary(logit_model2)
What if the assumptions don’t hold? • Back to pricing - let’s create e new variable to see if a price is above the median price median_price = median(my_data$price) my_data$isExpensive = ifelse(my_data$price>median_price,1,0) • A little bit more encouraging logit_model3 = lm(my_data$isExpensive ~ my_data$latitude + my_data$longitude, family = "binomial") summary(logit_model3)
What if the assumptions don’t hold? • Let’s get a little more detailed • It looks like the problem stems from the shape of our data hist(my_data$price, breaks = 200) plot(my_data$price,my_data$latitude) plot(my_data$price,my_data$longitude)
What if the assumptions don’t hold? • We can transform our data to stretch it back into a more easily analysed shape • Note – this won’t create any new false correlations, though it may make them harder to interpret • A log10 transform may fix a negative skew (a square root transform fixes a positive skew) my_data$logPrice = log10(my_data$price) • We’ll also need to fix any instances where original value was zero, as this will break the transformation my_data$logPrice = ifelse(my_data$logPrice=="-Inf",0,my_data$logPrice)
What if the assumptions don’t hold? • Let’s have a look at the transformed value hist(my_data$logPrice, breaks = 200) plot(my_data$logPrice,my_data$latitude) plot(my_data$logPrice,my_data$longitude)
What if the assumptions don’t hold? • Now let’s run our tests and check the assumptions log_transformed__model = lm(my_data$logPrice ~ my_data$latitude + my_data$longitude) summary(log_transformed__model) plot(log_transformed__model)
Want to read more? • http://www.r-tutor.com • https://www.r-bloggers.com/ • https://www.ablebits.com/office-addins-blog/2016/05/11/make-histogram-excel/ • Nice discussion of regressions • http://www.biddle.com/documents/bcg_comp_chapter4.pdf
Regression by hand Image from https://www.pinterest.ie/pin/497295983830246961/?lp=true
Regression by hand • Don’t worry if this feels intimidating – remember, you’ll never be asked to do it – it’s purely for your own sense of deep understanding • To keep this simple, let’s walk through 7 rows of fictional data
Regression by hand • As mentioned already, the basic formula for a regression is • Predicted value = slope of line * value of predicting variable * Intercept * Error i.e. Y’ = bX + A + E • To work out the slope (b), we need • The mean of X and Y (MX and My) • The standard deviation of X and Y (SXand Sy) • The correlation between X and Y (r)
Regression by hand • So how do we work out which line is the best fit? • First, we compute the mean for X (45) and subtract this from all values of X. The new variable is called x (the deviation scores for X). We do the same for Y (mean = 43) to create a new variable y.
Regression by hand • We then calculate the sum of deviation for each x and y individually
Regression by hand • The correlation (r) is then calculated as • In other words The total sum of xy values √ (the total sum of x values * the total sum of y values) • In our case 804 / √(2485*561) • Which equals 804 / √1394085 • Which equals 804 / 1180.7 • Hence, r = 0.68
Regression by hand • Now that we have the correlation, we can now calculate the slope of the line (b) based on the standard deviation b = r sY/sX • In other words b = correlation * standard deviation of Y / standard deviation of X • In our case b = 0.68 * 6.9 / 22.18 = 0.21
Regression by hand • The intercept (A) can be calculated from the means as A = MY– bMX • In our case A = 92.9 - 0.68 * 44.4 = 62.7 • We can now test to see if this is significant using the following formula where N is the number of value pairs • In our case t = 0.68 * √ (7-2) / √1-(0.68)2 • Equals 0.68 * √5 / √1-0.46 • Equals 0.68 * 2.23 / √0.53 • Equals 1.52 / 0.728 = 2.08 This value has to be mapped to a separate table to be translated into % confidence