410 likes | 438 Views
Linear Regression. CSC 576: Data Mining. Today…. Linear Regression. Advertising Dataset. https:// www.kaggle.com /sazid28/ advertising.csv. import pandas as pd advertising = pd.read_csv ('../datasets/ Advertising.csv ') advertising.head (5).
E N D
Linear Regression CSC 576: Data Mining
Today… • Linear Regression
Advertising Dataset https://www.kaggle.com/sazid28/advertising.csv import pandas as pd advertising = pd.read_csv('../datasets/Advertising.csv') advertising.head(5)
Advertising Dataset • Scatter plot visualization for TV and Sales. %matplotlib inline advertising.plot.scatter(x='TV', y='Sales');
Advertising Dataset • Simple Linear Model in Python (using pandas and scikit): • Predictor: x • Response: y reg = linear_model.LinearRegression() reg.fit(advertising['TV'].reshape(-1,1), advertising['Sales'].reshape(-1,1)) print('Coefficients: \n', reg.coef_) print('Intercept: \n', reg.intercept_) Coefficients: [[ 0.04753664]] Intercept: [ 7.03259355] Sales = 7.03259 + 0.04754 * TV
Assessing the Accuracy of the Model • Trying to quantify the extent to which the model fits the data • Typically assessed with: • Residual standard error (RSE) • R2 statistic • Different than measuring how well model’s predictions were on a test set • Root Mean Squared Error (RMSE)
Residual Standard Error (RSE) • RSE is the average amount that the response will deviate from the true regression line • (never can perfectly predict Y from X because of the error term ε)
Advertising Dataset • RSE = 3.26 • Actual sales in each market deviate from the true regression line by approximately 3.26 units, on average. • Is this error amount acceptable? • Business answer: depends on problem context • Worth noting the percentage error:
Concluding Thoughts on RSE • RSE measures the “lack of fit” that a model may have. • Measured in the units of Y • Not always clear what constitutes a good RSE
R2 Statistic • Proportion of variance explained • Always a value between 0 and 1 • Independent of the scale of Y (unlike RSE)
R2 Statistic • TSS: total variance in the response Y • Amount of variability inherent in the response, before the regression is performed • RSS: amount of variability that is left unexplained after performing the regression • TSS-RSS : the amount of variability that is explained
Advertising Dataset • R2 = 0.61 • Just under two-thirds of the variability in sales is explained by a linear regression on TV.
Q: What is a good R2 value? • R2 has an interpretational advantage over RSE • A: Depends on the application. • Example: problem from physics where it is known that a linear relationship exists, can expect a good R2 value • Example: other domains where linear model is rough approximation…
R2 Statistic vs. Correlation Correlation only quantifies the association between a single pair of variables. • Correlation is also a measure of the linear relationship between X and Y. • For simple linear regression (one predictor): R2 = r2 • Next: multiple linear regression (more than one predictor): use RSE
Multiple Linear Regression • In practice, often have more than one predictor • Yes, we can run three separate simple linear regressions for the Advertising dataset • But, • Unclear how to make single prediction of sales given all three predictor values • Each regression equation ignores the other two media • BAD! Media may be correlated with each other
Multiple Linear Regression Model • Extend the simple linear regression model for each predictor • Response variable Y is numeric (continuous) • For p predictor variables: • Since error ε has mean zero, variance σ2, with normal distribution, we usually omit it. • A one-unit change in any predictor variable xj will change the expected mean response by βj units.
Estimating the Parameters β0β1β2… • Parameters (regression coefficients) are typically estimated through the method of least squares • Just like with simple linear regression • Automatic in R, Python (data mining toolkits) We want to minimize the RSS
Advertising Dataset Sales = 2.938889 + 0.045765 * TV + 0.188530 * radio + -0.001037 * newspaper
Simple and Multiple Linear Regression Coefficients can be Quite Different Slope term (newspaper coefficient) represents the average effect of a $1,000 increase in newspaper advertising, ignoring other predictors (TV and radio). TV Model: [[ 0.04753664]] [ 7.03259355] Radio Model: [[ 0.20249578]] [ 9.3116381] Newspaper Model: [[ 0.0546931]] [ 12.35140707] Coefficient for newspaper represents the average effect of increasing newspaper spending by $1,000 while holding TV and radio fixed. Coefficients: [[ 0.04576465 0.18853002 -0.00103749]] Intercept: [ 2.93888937]
Correlation Matrix • Correlation between radio and newspaper is 0.35 • Barely any correlation (or “not correlated”) for TV/radio and TV/newspaper • Reveals tendency to spend more on Newspaper advertising in markets where more is spent on Radio advertising. • Sales higher in markets where more is spent on Radio, but more also tends to be spent on Newspaper. • In Simple LM: Newspaper “gets credit” for effect of Radio on Sales.
Qualitative Predictors • So far have assumed that all variables in linear regression model are quantitative. • How to deal with qualitative variables?
Credit Dataset • Response: • Balance (individual’s average credit card debt) • Quantitative Predictors: • Age (years) • Cards (number of credit cards) • Education (years of education) • Income (in thousands of dollars) • Limit (credit limit) • Rating (credit rating) • Qualitative Predictors: • Gender {Male, Female} • Student {Yes, No} • Married {Yes, No} • Ethnicity {Caucasian, African American, Asian}
Qualitative Predictors: Two Levels • Levels (sometimes called factors): possible values of discrete variable • Solution: create a dummy variable (or indicator) that takes on two possible numerical values • Credit dataset, Gender variable: {Male, Female} • Create new dummy variable:
Qualitative Predictors: Two Levels … for now assuming that Gender is the only predictor in model … • Simple Linear Regression Model • Estimate coefficients B0, B1 Term zeros out for males
Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance among males • B0 + B1: average credit card balance among females • B1: average difference in credit card balance between females and males
Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance among males • B0 + B1: average credit card balance among females • B1: average difference in credit card balance between females and males • Average credit card debt for males is estimated to be $509.80. • Females are estimated to carry $19.73 in additional debt, for a total of: • $509.80+$19.73=$529.53 Balance = 509.80 + 19.73 * xi
Qualitative Predictors: Two Levels • Decision to code females as 1 and males as 0 is arbitrary. • It does alter the interpretation of the coefficients • What would happen if we coded males as 1 and females as 0?
Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance among females • B0 + B1: average credit card balance among males • B1: average difference in credit card balance between females and males • Average credit card debt for females is estimated to be $529.54. • Males are estimated to carry $19.73 in less debt, for a total of: • $529.54-$19.73=$509.80 Balance = 529.54 - 19.73 * xi Same exact model!
Qualitative Predictors: Two Levels • Interpretation: • B0: overall average credit card balance (ignoring gender) • B1: amount that females are above the average, and males are below the average • Average credit card debt, ignoring gender is $519.67. • The average difference between males and females is: • $9.865 * 2 = $19.73 Balance = 519.67 + 9.865 * xi • Same exact model! • It doesn’t matter which coding scheme is used, as long as coefficients are correctly interpreted.
Qualitative Predictors: More than Two Levels • Single dummy variable cannot represent all possible values for qualitative predictors with more than two levels • Solution: create additional dummy variables • For Ethnicity variable: Simple linear model, ignoring all other predictors…. Always one fewer dummy variable than number of levels.
Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance for African Americans • B1: difference in average balance between Asians and African Americans • B2: difference in average balance between Caucasians and African Americans Balance = 531.00 – 18.69* xi1 – 12.50* xi2 • Estimated balance for African Americans is $531.00 • Asian category will have $18.69 in less debt than African American category • Caucasian category will have $12.50 in less debt than African American category Once again, arbitrary coding scheme.
African American xi1 xi2 Qualitative Predictors: Two Levels xi3 African American Balance = 520.60 + 10.38* xi1 – 8.29* xi2 – 2.11* xi3 Coefficients: [[ 10.39626236 -8.29001215 -2.10625021]] Intercept: [ 520.60373764] • Estimated balance for African Americans is $531.00 • Asian category will have $18.69 in less debt than African American category • Caucasian category will have $12.50 in less debt than African American category
Multiple Quantitative and Qualitative Predictors • Not a problem • Use as many dummy variables as needed • scikit creates dummy variables automatically for the qualitative predictors
In conclusion… • Pros of Linear Regression Model: • Provides nice interpretable results • Works well on many real-world problems • Cons of Linear Regression Model: • Assumes linear relationship between response and predictors: • Change in the response Y due to a one-unit change in Xiis constant • Assumes additive relationship • Effect of changes in a predictor Xi on response Y is independent of the values of the other predictors
Extensions of the Linear Model • Beyond the scope of this course… • Can remove the additive assumption by specifying interaction terms • Can remove the linear assumption using polynomial regression
References • Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. • Data Science from Scratch, 1st Edition, Grus • Data Mining and Business Analytics in R, 1st edition, Ledolter • An Introduction to Statistical Learning, 1st edition, James et al. • Discovering Knowledge in Data, 2nd edition, Larose et al. • Introduction to Data Mining, 1st edition, Tam et al.