290 likes | 478 Views
Lecture 4: Non-Linear Patterns. January 22, 2014. Question. In my opinion the first Quiz was: Very Easy Somewhat Easy Neither easy nor hard Somewhat Hard Very Hard. Administrative. Problem Set 2 due Monday. Quiz 2 next Wednesday Exam 1 two weeks from this coming Monday Questions?.
E N D
Lecture 4:Non-Linear Patterns January 22, 2014
Question In my opinion the first Quiz was: • Very Easy • Somewhat Easy • Neither easy nor hard • Somewhat Hard • Very Hard
Administrative • Problem Set 2 due Monday. • Quiz 2 next Wednesday • Exam 1 two weeks from this coming Monday • Questions?
Last time • Regression “by hand” • Interpreting the slope and intercept. • Properties of Residuals
Properties of Residuals Residual Plots: • If the least squares line captures the association between x and y, then a plot of residuals versus x should stretch out horizontally with consistent vertical scatter. • Can use a visual test for association to check for the absence of a pattern. • Don’t look too long: if you look long enough, you’ll see a pattern. You want to check if there is an obvious and immediate one. • Is there a pattern? • Subtle; increasing as x increases
Properties of Residuals Standard Deviation of the Residuals (se) • Measures how much the residuals vary around the fitted line. • Also known as standard error of the regression or the root mean squared error (RMSE). • For the diamond example, se = $170.21. • Since the residuals are approximately normal, the empirical rule implies that about 95% of the prices are within $340 of the regression.
Explaining Variation R-squared (r2) • Is the square of the correlation between x and y • 0 ≤ r2 ≤ 1 • Is the fraction of the variation accounted for by the least squares regression line. • Higher is obviously better • For the diamond example, r2 = 0.4297 (i.e., the fitted line explains 42.97% of the variation in price). • But I see r-squared and “adjusted r-squared” reported. What’s the difference? • We’ll get there… Always report both r2 and seso others can judge how well the regression equation describes the data.
Conditions for Simple Regression • Linear: Look at a scatterplot. Does pattern resemble a straight line? • Random residual variation: Look at the residual plot to make sure no pattern exists. • No obvious lurking variable: need to think about whether other explanatory variables might better explain the linear association between x and y. • Pay attention to the substantive context of the model • Be very very cautious of making predictions outside the range of observed conditions. • Look at the plots; look at the data!
Example 2: Gas Consumption Data: gas_consumption.csv • Use a simple regression model to Predict gas consumption – Gas (CCF) – by Average Temp • Are the conditions for simple regression met? • Yes • No • What is simple regression? • I have no idea what language you’re speaking
Example 2: Gas Consumption Data: gas_consumption.csv • Use a simple regression model to Predict gas consumption – Gas (CCF) – by Average Temp • Using a simple regression model, what is your estimate of the intercept? • 338.76 • -4.33 • 287.46 • 12.25 • None of the above.
Non-linear Patterns • When is a linear model appropriate? • Ask yourself: will changes in the explanatory variable result in equal sized changes in the estimated response, regardless of the value of x? • For example: Does trimming 200 pounds from a large SUV have the same effect on mileage as trimming 200 pounds from a small compact? • Many times the data we’re interested in estimating are not linearly related. Do we give up? No.
Estimating the Model • Data: 20_cars.csv • Cars data: MPG by Weight (1000’s of lbs) • The fitted line: Estimated MPG City = 43.3– 5.17 Weight • r2 = 0.702 and se = 2.95 • The equation estimates that mileage would increase by how much, on average, by reducing car weight by 200 lbs: • 1.0 MPG • 4.5 MPG • 2.9 MPG • I have no idea
Estimating the Model Cars data: MPG by Weight (1000’s of lbs) • It’s very easy to estimate an OLS regression model, but often a simple linear model isn’t appropriate. • Some times we can detect non-linearity with scatterplots. • In practice, it’s often hard to determine; especially if we start to consider outliers
Look at Plots of the Residuals • Nonlinear patterns are often easier to spot when looking at the residuals (residuals by x-values):
What to do? Transformations. • Create a new variable in the data set by applying a function to each observation • Two nonlinear transformations useful in many business applications: reciprocaland logarithms • Transformations allow the use of linear regression analysis to describe a curved pattern (sometimes) • How to decide? • Use theory for insights. Often thinking about the data will tell you what you should do. • Try different ones. Iterate. • Among the possible choices, select the one that captures the curvature of the data and produces an interpretable equation
Choosing a Transformation • There are several suggested ones, depending on the curvature of your data: (but don’t forget to use context of the problem) • What was the shape of our MPG and Weight data?
Reciprocal Transformation • Reciprocal transformation is useful when dealing with variables that are already in the form of a ratio, such as miles per gallon • In the context of our car data, a reciprocal transformation makes sense: • Instead of miles per gallon, use gallons per mile. • But there aren’t many cars that burn more than one gallon per mile. So… • Transform the response variable (MPG GPM)and multiply by 100. The resulting response is number of gallons it takes to go 100 miles
Reciprocal Transformation • Estimating the model with a transformed dependent variable: • EstimatedGallons/100 Miles = -0.112 + 1.204 Weight • r2= 0.713 se= 0.667
Residual Plot: • Outliers clear(er) – sports cars.
Comparing Models • Original linear model: • Estimated MPG City = 43.3– 5.17 Weight • r2 = 0.703 and se = 2.95 • Model with transformed dependent variable: • Estimated Gallons/100 Miles = -0.112 + 1.204 Weight • r2 = 0.7132 se = 0.667 • Can be tempting to say that model 1 is about the same as 2 because it has a similar r2 (70.3.% of variation explained vs 71.3%). • Not a valid comparison. Don’t compare r2 between models that have different data! (i.e., observations or response variables)
Reciprocal Transformation • Visually, what we did:
Reciprocal Transformation • But what if we really wanted to predict MPG? • We do not stop with just fitting the linear regression model. • We can transform back to MPG • Estimated Gallons/100 Miles = -0.112 + 1.204 Weight • Given the above model, what is the MPG of a car that weighs 3,000lbs? • 28.7 • 27.76 • 3.5 • 35
Next time • More on transformations • I.e.,: log transformations (very useful and common) • Exam 1 two weeks from this coming Monday!
Comparing Models • Original linear model: • Estimated MPG City = 35.6 – 4.52 Weight • r2 = .57 and se = 2.9 • Model with transformed dependent variable: • Estimated Gallons/100 Miles = 1.111 + 1.21 Weight • r2 = 0.412 se = 1.04 • A Hummer H2 (weight = 6400 lbs) is predicted to get 6.7 MPG from model 1. What is the MPG it’s predicted to get in model 2? • 8.8 • 11.3 • 7.7 • Not possible.
Substantive Comparison • The reciprocal equation treats weights differently than the linear equation • In the reciprocal eq, differences in weight matter less as cars get heavier • Diminishing effect of changes in weight makes more sense than a constant decrease • Substantive knowledge / theory important! • Knowledge of market forces (economics) very important
Log Transformations • Another very useful transformation: logarithms • Useful for distributions with positive skew (long right tail) • Useful when the association between variables is more meaningful on a percentage scale. • Price Elasticity of Demand • Percentage change in quantity demanded given a 1% change in price • Key to figuring out the optimal price to charge.
Log Transformations Estimated Sales Volume = 190,480 – 125,190*Price r2 = 0.83
Log Transformations • Residual Plot: systematic patterns easier to spot
Log Transformations • If we take the log of both independent and dependent variables (also called a log-log regression): Estimated log(Sales Volume) = 11.05 – 2.442 * log(Price) r2 = 0.955 Fitted line Residual Plot