1 / 46

Understanding Linear Regression for Data Analysis

Explore the theory, methods, and applications of linear regression in data science and big data analytics, including use cases, model descriptions, and practical examples using R programming. Learn to analyze relationships between variables and make predictions based on statistical insights.

dixons
Download Presentation

Understanding Linear Regression for Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Science and Big Data Analytics Chap 6: Adv Analytical Theory and Methods: Regression Charles Tappert Seidenberg School of CSIS, Pace University

  2. Chapter Sections • 6.1 Linear Regression • 6.2 Logical Regression • 6.3 Reasons to Choose and Cautions • 6.4 Additional Regression Models • Summary

  3. 6 Regression • Regression analysis attempts to explain the influence that input (independent) variables have on the outcome (dependent) variable • Questions regression might answer • What is a person’s expected income? • What is probability an applicant will default on a loan? • Regression can find the input variables having the greatest statistical influence on the outcome • Then, can try to produce better values of input variables • E.g. – if 10-year-old reading level predicts students’ later success, then try to improve early age reading levels

  4. 6.1 Linear Regression • Models the relationship between several input variables and a continuous outcome variable • Assumption is that the relationship is linear • Various transformations can be used to achieve a linear relationship • Linear regression models are probabilistic • Involves randomness and uncertainty • Not deterministic like Ohm’s Law (V=IR)

  5. 6.1.1 Use Cases • Real estate example • Predict residential home prices • Possible inputs – living area, #bathrooms, #bedrooms, lot size, property taxes • Demand forecasting example • Restaurant predicts quantity of food needed • Possible inputs – weather, day of week, etc. • Medical example • Analyze effect of proposed radiation treatment • Possible inputs – radiation treatment duration, freq

  6. 6.1.2 Model Description

  7. 6.1.2 Model DescriptionExample • Predict person’s annual income as a function of age and education • Ordinary Least Squares (OLS) is a common technique to estimate the parameters

  8. 6.1.2 Model DescriptionExample OLS

  9. 6.1.2 Model DescriptionExample

  10. 6.1.2 Model DescriptionWith Normally Distributed Errors • Making additional assumptions on the error term provides further capabilities • It is common to assume the error term is a normally distributed random variable • Mean zero and constant variance • That is

  11. 6.1.2 Model DescriptionWith Normally Distributed Errors • With this assumption, the expected value is • And the variance is

  12. 6.1.2 Model DescriptionWith Normally Distributed Errors • Normality assumption with one input variable • E.g., for x=8, E(y)~20 but varies 15-25

  13. 6.1.2 Model DescriptionExample in R Be sure to get publisher's R downloads: http://www.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html > income_input = as.data.frame(read.csv(“c:/data/income.csv”)) > income_input[1:10,] > summary(income_input) > library(lattice) > splom(~income_input[c(2:5)], groups=NULL, data=income_input, axis.line.tck=0, axis.text.alpha=0)

  14. 6.1.2 Model DescriptionExample in R Scatterplot Examine bottom line income~age: strong + trend income~educ: slight + trend income~gender: no trend

  15. 6.1.2 Model Description Example in R

  16. 6.1.2 Model Description Example in R • Quantify the linear relationship trends > results <- lm(Income~Age+Education+Gender,income_input) > summary(results) • Intercept: income of $7263 for newborn female • Age coef: ~1, year age increase -> $1k income incr • Educcoef: ~1.76, year educ + -> $1.76k income + • Gender coef: ~-0.93, male income decreases $930 • Residuals – assumed to be normally distributed – vary from -37 to +37 (more information coming)

  17. 6.1.2 Model Description Example in R • Examine residuals – uncertainty or sampling error • Small p-values indicate statistically significant results • Age and Education highly significant, p<2e-16 • Gender p=0.13 large, not significant at 90% confid. level • Therefore, drop variable gender from linear model > results2 <- lm(Income~Age+Education,income_input) > summary(results) # results about same as before • Residual standard error: residual standard deviation • R-squared (R2): variation of data explained by model • Here ~64% (R2 = 1 means model explains data perfectly) • F-statistic: tests entire model – here p value is small

  18. 6.1.2 Model Description Categorical Variables • In the example in R, Gender is a binary variable • Variables like Gender are categorical variables in contrast to numeric variables where numeric differences are meaningful • The book section discusses how income by state could be implemented

  19. 6.1.2 Model DescriptionConfidence Intervals on the Parameters • Once an acceptable linear model is developed, it is often useful to draw some inferences • R provides confidence intervals using confint() function > confint(results2, level = .95) • For example, the Education coefficient was 1.76, and now the corresponding 95% confidence interval is (1.53, 1.99) which provides the amount of uncertainty in the estimate

  20. 6.1.2 Model DescriptionConfidence Interval on Expected Outcome • In the income example, the regression line provides the expected income for a given Age and Education • Using the predict() function in R, a confidence interval on the expected outcome can be obtained > Age <- 41 > Education <- 12 > new_pt <- data.frame(Age, Education) > conf_int_pt <- predict(results2,new_pt,level=.95, interval=“confidence”) > conf_int_pt • Expected income = $68699, conf interval ($67831,$69567)

  21. 6.1.2 Model DescriptionPrediction Interval on a Particular Outcome • The predict() function in R also provides upper/lower bounds on a particular outcome, prediction intervals > pred_int_pt <- predict(results2,new_pt,level=.95, interval=“prediction”) > pred_int_pt • Expected income = $68699, pred interval ($44988,$92409) • This is a much wider interval because the confidence interval applies to the expected outcome that falls on the regression line, but the prediction interval applies to an outcome that may appear anywhere within the normal distribution

  22. > 6.1.3 DiagnosticsEvaluating the Linearity Assumption • A major assumption in linear regression modeling is that the relationship between the input and output variables is linear • The most fundamental way to evaluate this is to plot the outcome variable against each income variable • In the following figure a linear model would not apply • In such cases, a transformation might allow a linear model to apply Class of dataset Groceries is transactions, containing 3 slots • transactionInfo # data frame with vectors having length of transactions • itemInfo# data frame storing item labels • data # binary evidence matrix of labels in transactions > Groceries@itemInfo[1:10,] > apply(Groceries@data[,10:20],2,function(r) paste(Groceries@itemInfo[r,"labels"],collapse=", "))

  23. > 6.1.3 DiagnosticsEvaluating the Linearity Assumption • Income as a quadratic function of Age

  24. > 6.1.3 DiagnosticsEvaluating the Residuals • The error terms was assumed to be normally distributed with zero mean and constant variance > with(results2,{plot(fitted.values,residuals,ylim=c(-40,40)) })

  25. > 6.1.3 DiagnosticsEvaluating the Residuals • Next four figs don’t fit zero mean, const variance assumption Nonlnear trend in residuals Residuals not centered on zero

  26. > 6.1.3 DiagnosticsEvaluating the Residuals Residuals not centered on zero Variance not constant

  27. > 6.1.3 DiagnosticsEvaluating the Normality Assumption • The normality assumption still has to be validated > hist(results2$residuals) Residuals centered on zero and appear normally distributed

  28. > 6.1.3 DiagnosticsEvaluating the Normality Assumption • Another option is to examine a Q-Q plot comparing observed data against quantiles (Q) of assumed dist > qqnorm(results2$residuals) > qqline(results2$residuals)

  29. > 6.1.3 DiagnosticsEvaluating the Normality Assumption Normally distributed residuals Non-normally distributed residuals

  30. > 6.1.3 DiagnosticsN-Fold Cross-Validation • To prevent overfitting, a common practice splits the dataset into training and test sets, develops the model on the training set and evaluates it on the test set • If the quantity of the dataset is insufficient for this, an N-fold cross-validation technique can be used • Dataset randomly split into N dataset of equal size • Model trained on N-1 of the sets, tested on remaining one • Process repeated N times • Average the N model errors over the N folds • Note: if N = size of dataset, this is leave-one-out procedure

  31. > 6.1.3 DiagnosticsOther Diagnostic Considerations • The model might be improved by including additional input variables • However, the adjusted R2 applies a penalty as the number of parameters increases • Residual plots should be examined for outliers • Points markedly different from the majority of points • They result from bad data, data processing errors, or actual rare occurrences • Finally, the magnitude and signs of the estimated parameters should be examined to see if they make sense

  32. > 6.2 Logistic RegressionIntroduction • In linear regression modeling, the outcome variable is continuous – e.g., income ~ age and education • In logistic regression, the outcome variable is categorical, and this chapter section focuses on two-valued outcomes like true/false, pass/fail, or yes/no

  33. > 6.2.1 Logistic RegressionUse Cases • Medical • Probability of a patient’s successful response to a specific medical treatment – input could include age, weight, etc. • Finance • Probability an applicant defaults on a loan • Marketing • Probability a wireless customer switches carriers (churns) • Engineering • Probability a mechanical part malfunctions or fails

  34. > 6.2.2 Logistic RegressionModel Description • Logical regression is based on the logistic function • As y -> infinity, f(y)->1; and as y->-infinity, f(y)->0

  35. > 6.2.2 Logistic RegressionModel Description • With the range of f(y) as (0,1), the logistic function models the probability of an outcome occurring In contrast to linear regression, the values of y are not directly observed; only the values of f(y) in terms of success or failure are observed. Called log odds ratio, or logit of p. Maximum Likelihood Estimation (MLE) is used to estimate model parameters. MLR is beyond the scope of this book. Note p =f(y)

  36. > 6.2.2 Logistic RegressionModel Description: customer churn example • A wireless telecom company estimates probability of a customer churning (switching companies) • Variables collected for each customer: age (years), married (y/n), duration as customer (years), churned contacts (count), churned (true/false) • After analyzing the data and fitting a logical regression model, age and churned contacts were selected as the best predictor variables

  37. > 6.2.2 Logistic RegressionModel Description: customer churn example

  38. > 6.2.3 DiagnosticsModel Description: customer churn example > head(churn_input) # Churned = 1 if cust churned > sum(churn_input$Churned) # 1743/8000 churned • Use the Generalized Linear Model function glm() > Churn_logistic1<-glm(Churned~Age+Married+Cust_years+Churned_contacts,data=churn_input,family=binomial(link=“logit”)) > summary(Churn_logistic1) # Age + Churned_contacts best > Churn_logistic3<-glm(Churned~Age+Churned_contacts,data=churn_input,family=binomial(link=“logit”)) > summary(Churn_logistic3) # Age + Churned_contacts

  39. > 6.2.3 DiagnosticsDeviance and the Pseudo-R2 • In logistic regression, deviance = -2logL • where L is the maximized value of the likelihood function used to obtain the parameter estimates • Two deviance values are provided • Null deviance = deviance based on only the y-intercept term • Residual deviance = deviance based on all parameters • Pseudo-R2 measures how well fitted model explains the data • Value near 1 indicates a good fit over the null model

  40. > 6.2.3 DiagnosticsReceiver Operating Characteristic (ROC) Curve • Logistic regression is often used to classify • In the Churn example, a customer can be classified as Churn if the model predicts high probability of churning • Although 0.5 is often used as the probability threshold,other values can be used based on desired error tradeoff • For two classes, C and nC, we have • True Positive: predict C, when actually C • True Negative: predict nC, when actually nC • False Positive: predict C, when actually nC • False Negative: predict nC, when actually C

  41. > 6.2.3 DiagnosticsReceiver Operating Characteristic (ROC) Curve • The Receiver Operating Characteristic (ROC) curve • Plots TPR against FPR

  42. > 6.2.3 DiagnosticsReceiver Operating Characteristic (ROC) Curve > library(ROCR) > Pred = predict(Churn_logistic3, type=“response”)

  43. > 6.2.3 DiagnosticsReceiver Operating Characteristic (ROC) Curve

  44. > 6.2.3 DiagnosticsHistogram of the Probabilities It is interesting to visualize the counts of the customers who churned and who didn’t churn against the estimated churn probability.

  45. > 6.3 Reasons to Choose and Cautions • Linear regression – outcome variable continuous • Logistic regression – outcome variable categorical • Both models assume a linear additive function of the inputs variables • If this is not true, the models perform poorly • In linear regression, the further assumption of normally distributed error terms is important for many statistical inferences • Although a set of input variables may be a good predictor of an output variable, “correlation does not imply causation”

  46. > 6.4 Additional Regression Models • Multicollinearity is the condition when several input variables are highly correlated • This can lead to inappropriately large coefficients • To mitigate this problem • Ridge regression applies a penalty based on the size of the coefficients • Lasso regression applies a penalty proportional to the sum of the absolute values of the coefficients • Multinomial logistic regression – used for a more-than-two-state categorical outcome variable

More Related