330 likes | 413 Views
Regression . FPP 10 kind of. Plan of attack. Introduce regression model Correctly interpret intercept and slope Prediction Pit falls to avoid. Regression line . Correlation coefficient a nice numerical summary of two quantitative variables
E N D
Regression FPP 10 kind of
Plan of attack • Introduce regression model • Correctly interpret intercept and slope • Prediction • Pit falls to avoid
Regression line • Correlation coefficient a nice numerical summary of two quantitative variables • It indicates direction and strength of association • But does it quantify the association? • It would be of interest to do this for • Predictions • Understanding phenomena
Regression line • Correlation measures the direction and strength of the straight-line (linear) relationship between two quantitative variables • If a scatter plot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatter plot • This line represents a mathematical model. Later we will make the mathematical model a statistical one.
Regression line • Slope intercept form notation • Regression form notation
Regression Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945
Which line is best Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green)
Which model to use • Different people might draw different lines by eye on a scatterplot • What are some ways we can determine which model(line) out of all the possible models(lines) is the “best” one? • What are some ways that we can numerically rank the different models? (i.e. the different lines) • This will come later in the course
Slope interpretation • The slope, β, of a regression line is almost always important for interpreting the data. • The slope is a rate of change. It is the mean amount of change in y-hat when x increases by 1
Slope interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 For every 1 sqft increase in size of home on average the house price increases by $159.8 dollars
Intercept interpretation • The intercept, α, of the regression line is the value of y-hat when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero.
Intercept interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 If the sqft of a home was 0 on average the house price will be -$90,245.80 dollars This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero.
Prediction Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = $469,054.2
OECD data: Income and unemployment in the U.S. • What is the relationship between households’ disposable income and the nation’s unemployment rate? • Data from the U.S. 1980 to 1998 • (data provided by the economics department at Duke)
Facts about regression • There is a close relationship between the correlation coefficient and the slope of a regression line • They have the same sign • They are proportional to each other • The intercept has no relationship with the correlation coefficient but here is the formula
Facts about regression • The distinction between explanatory and response variable is essential in regression • If you have a slope computed using x as the explanatory and y as the response variable you can’t “back solve” to get a slope and intercept for the regression model with x being the response and y the explanatory variables. • If you want to predict x given a y then you must find the intercept and slope with y being the explanatory variable and x being the response
Facts about regression • R2 (coefficient of determination) provides a one number summary of how well regression line fits data • R2 is the percentage of variation in Y’s explained by the regression line • R2 lies between 0 and 1 • Values near 1 indicate regression predicts y’s in data set very closely • Values near 0 indicate regression does not predict the y’s in the data set very closely
Facts about regression • Example: • The correlation coefficient between sale price and square feet was r = 0.8718945 • Thus the coefficient of determination is R2=(0.8718)2=0.76 • So 76% of the variability in sale price is explained by (taken into account by) the regression line with square feet.
Does regression fit data well? • A regression line is reasonable if • Association between two variables is indeed linear • When points are randomly scattered around line • Income/unemployment rate data well-described by regression line.
Regression of AIDS rates per 1000 people of GNP per capita • Line is too low for GDP values near zero and too high for big GDP values. • We shouldn’t use line for predictions
Changing the response variable • When the regression line fits the data badly, sometimes you can transform variables to obtain a better fitting line. • With monetary variables, typically this can be accomplished by taking logarithms.
Regression of log(AIDS) on log(GNP) • Much better fit • Predict log(AIDS) from log(GNP). Exponentiate to estimate AIDS
Warnings about regression • Predicting y at values of x beyond the range of x in the data is called extrapolation • This is risky, because we have no evidence to believe that the association between x and y remains linear for unseen x values • Extrapolated predictions can be absolutely wrong
Extrapolation • Diamond price and carat • Explanatory variable is measured by carats and response variable is dollars • Predict price of hope diamond
Extrapolation • The relationship between diamond carat and price doesn’t remain linear after a carat size of about 0.4
Extrapolation • Green line is linear fit with only diamonds less then 0.4 carats • Blue line is linear fit with all carat sizes • Red curve a quadratic fit
Lurking variable • A variable not being considered could be driving the relationship • In practice this is a difficult issue to tackle. Especially when everything seems OK
Influential point • An outlier in either the X or Y direction which, if removed, would markedly change the value of the slope and y-interept. • applet
Causality • On its own, regression only quantifies an association between x and y • It does not prove causality • Under a carefully designed experiment (or in some cases observational studies) regression can be used to show causality.