330 likes | 340 Views
Learn how to introduce a regression model, correctly interpret intercept and slope, make predictions, and avoid common pitfalls in regression analysis. Understand the correlation coefficient's role in quantifying the association between variables. Discover the best ways to determine the "best" model and rank different lines numerically.
E N D
Regression FPP 10 kind of
Plan of attack • Introduce regression model • Correctly interpret intercept and slope • Prediction • Pit falls to avoid
Regression line • Correlation coefficient a nice numerical summary of two quantitative variables • It indicates direction and strength of association • But does it quantify the association? • It would be of interest to do this for • Predictions • Understanding phenomena
Regression line • Correlation measures the direction and strength of the straight-line (linear) relationship between two quantitative variables • If a scatter plot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatter plot • This line represents a mathematical model. Later we will make the mathematical model a statistical one.
Regression line • Slope intercept form notation • Regression form notation
Regression Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945
Which line is best Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green)
Which model to use • Different people might draw different lines by eye on a scatterplot • What are some ways we can determine which model(line) out of all the possible models(lines) is the “best” one? • What are some ways that we can numerically rank the different models? (i.e. the different lines) • This will come later in the course
Slope interpretation • The slope, β, of a regression line is almost always important for interpreting the data. • The slope is a rate of change. It is the mean amount of change in y-hat when x increases by 1
Slope interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 For every 1 sqft increase in size of home on average the house price increases by $159.8 dollars
Intercept interpretation • The intercept, α, of the regression line is the value of y-hat when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero.
Intercept interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 If the sqft of a home was 0 on average the house price will be -$90,245.80 dollars This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero.
Prediction Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = $469,054.2
OECD data: Income and unemployment in the U.S. • What is the relationship between households’ disposable income and the nation’s unemployment rate? • Data from the U.S. 1980 to 1998 • (data provided by the economics department at Duke)
Facts about regression • There is a close relationship between the correlation coefficient and the slope of a regression line • They have the same sign • They are proportional to each other • The intercept has no relationship with the correlation coefficient but here is the formula
Facts about regression • The distinction between explanatory and response variable is essential in regression • If you have a slope computed using x as the explanatory and y as the response variable you can’t “back solve” to get a slope and intercept for the regression model with x being the response and y the explanatory variables. • If you want to predict x given a y then you must find the intercept and slope with y being the explanatory variable and x being the response
Facts about regression • R2 (coefficient of determination) provides a one number summary of how well regression line fits data • R2 is the percentage of variation in Y’s explained by the regression line • R2 lies between 0 and 1 • Values near 1 indicate regression predicts y’s in data set very closely • Values near 0 indicate regression does not predict the y’s in the data set very closely
Facts about regression • Example: • The correlation coefficient between sale price and square feet was r = 0.8718945 • Thus the coefficient of determination is R2=(0.8718)2=0.76 • So 76% of the variability in sale price is explained by (taken into account by) the regression line with square feet.
Does regression fit data well? • A regression line is reasonable if • Association between two variables is indeed linear • When points are randomly scattered around line • Income/unemployment rate data well-described by regression line.
Regression of AIDS rates per 1000 people of GNP per capita • Line is too low for GDP values near zero and too high for big GDP values. • We shouldn’t use line for predictions
Changing the response variable • When the regression line fits the data badly, sometimes you can transform variables to obtain a better fitting line. • With monetary variables, typically this can be accomplished by taking logarithms.
Regression of log(AIDS) on log(GNP) • Much better fit • Predict log(AIDS) from log(GNP). Exponentiate to estimate AIDS
Warnings about regression • Predicting y at values of x beyond the range of x in the data is called extrapolation • This is risky, because we have no evidence to believe that the association between x and y remains linear for unseen x values • Extrapolated predictions can be absolutely wrong
Extrapolation • Diamond price and carat • Explanatory variable is measured by carats and response variable is dollars • Predict price of hope diamond
Extrapolation • The relationship between diamond carat and price doesn’t remain linear after a carat size of about 0.4
Extrapolation • Green line is linear fit with only diamonds less then 0.4 carats • Blue line is linear fit with all carat sizes • Red curve a quadratic fit
Lurking variable • A variable not being considered could be driving the relationship • In practice this is a difficult issue to tackle. Especially when everything seems OK
Influential point • An outlier in either the X or Y direction which, if removed, would markedly change the value of the slope and y-interept. • applet
Causality • On its own, regression only quantifies an association between x and y • It does not prove causality • Under a carefully designed experiment (or in some cases observational studies) regression can be used to show causality.