290 likes | 327 Views
Simple linear regression. Gediminas Murauskas Vilnius University. Introduction. How variable Y values depend on the values of X? T his question arises when there is need to explore relation between income and expenditure. t o evaluate impact of advertising on sales volume etc.
E N D
Simple linear regression Gediminas Murauskas Vilnius University
Introduction How variable Y values depend on the values of X? This question arises when there is need • to explore relation between income and expenditure. • to evaluate impact of advertising on sales volumeetc. Simple linear regression model is a statistical model describing dependence of one interval variable from other interval variable.
Objectives Simple linear regression analyses have several possible objectives including • Assessment of the relationship between explanatory variable on the response. • Prediction of future observations.
Analysis steps Is there a significant linear relationship between x and y? No End Yes Estimate model parameters Forecasting Model fit evaluation; Residual analysis End
Investigate the possible relationship between two variables • Graph(x1,y1),(x2,y2),…(xn,yn)scatter plot. • Calculate correlation coefficient.
If the points cluster in a band running from lower left to upper right (or from upper left to lower right) , there is a correlation Scatter plot
Correlation coefficient What information is provided by the correlation coefficient? • The correlation coefficient indicates how strong linear relationship between measured variables. • It is also possible to check whether the correlation was significant. If this information is sufficient then regression is unnecessary.
Simple linear regression model • Theoreticalmodel • Y is called response variable, or dependent variable. • X is called explanatory variable, predictor variableor independent variable. • and- parameters, e –error term.
Simple linear regression model(another form) • Suppose that we have a response variable Y and an explanatory variable x , then the simple linear regression model for Y on x is given by:
Simple linear regression model Assumptions: • ei– independent , normally distributed random variables. • Eei=0, constant variance (aka homoscedascity).
Least squares estimation • Regression coefficients are chosen in such a way that Y determined by the regression equation is as close as possible to Y from data. That is, we choose parameters estimates b0and b1that minimize • -- fitted regression function.
Goodness of fit • R square (Coefficient of determinationr²). • ANOVA table.
R square • The determination coefficient R2(or simply R square) shows the general goodness of fit of regression model (indicates of how well the model fits the data). • The closer R2 to 1, the better fit. • When R2 < 0.20 is generally assumed that model is “poor”. • The adjusted R2 takes into account the number of degrees of freedom and is referable to R2.
R square Explained sum of squares (sum of squares due to regression) Total sum of squares Error sum of squares (unexplained)
ANOVA • ANOVA in simple linear regression tests can we drop explanatory variable from the model (H0) or not (H1). • The model is bad (regression model does not fit data) when p-value >= 0.05.
Regression diagnostics(checking the model) • If there is no outliers (influentialpoints). various measures, such as Cook’s distance (which is larger than 1 for influentialpoints;sometimes >4/n, where n is the number of observations, might be used). • If the normality assumption hold. Plot the residuals versus the expected order statistics of the standard normal distribution (Q-Q plot). We should get approximately a straight line. • If the homoscedasticity assumption hold. plot of residuals versus predicted value. Checkwhether of residuals getting more spread-out as a function of the predicted value.
Regression diagnostics(checking the model) For time series data it’s very important to check the uncorrelated assumption. • Plot versus . • Use formal tests like the Durbin-Watson.
Prediction (forecasting) • One of the reasons to fit a line to data is to predict the response to a particular value of the explanatory variable. • Forecasting is more reliable if we use explanatory variable values from data range.
Example Anthropometry plays an important role in industrial design, clothing design and other fields where statistical data about the distribution of body dimensions in the population are used to optimize products. How well can foot size predict height? Sample of 33 points (Source: William Harkness, Pennsylvania State University): (27,00;169,0) (29,00;187,0) (25,50;178,0) (27,90;180,0) (27,00;185,0) (26,00;180,0) (29,00;180,0) (27,00;176,5) (29,00;185,0) (27,00;180,0) (29,00;175,0) (27,20;175,0) (29,00;185,0) (29,00;190,5)(27,20;185,0) (27,50;183,0) (25,00;175,0) (25,00;173,0) (28,00;184,0) (31,50;198,0) (30,00;201,0)(28,00;180,0) (29,00;188,0) (25,50;168,0) (26,70;180,0) (29,00;180,0) (28,00;180,0) (27,00;213,0)(29,00;196,0) (28,00;183,0) (26,00;178,0) (30,00;193,0) (27,00;173,0)
Simple regression with R The points cluster in a band running from lower left to upper right; there is a correlation (r= 0.5619425 -- indicates a positive relationship)
Simple regression with R Regression equation there is a weak linear relationship
Simple regression with R(ANOVA table) SSR H0 rejected SSE
Simple regression with R(checking the for outliers) > cutoff <- 4/((nrow(duom)-length(regr$coefficients)-2)) > plot(regr, which=4, cook.levels=cutoff) Cook’s distance <1. One point (28) >4/33.TheBonferroni-adjustedoutliertest shows it is outlier. Outliers are sometimes of substantive interest , but in this case We remove it from dataset.
Simple regression with R(results; outlier removed) there is a good linear relationship
Simple regression with R(checking the normality assumption) qqPlot(regr, distribution="norm",main="QQ Plot") The residuals have an approximately normal distribution
Simple regression with R(checking homoscedasticity) > plot(resid(regr) ~ fitted(regr), xlab="Predicted",ylab="Residuals", col="black",pch=24) > abline(h=0) plot of residuals versus predicted value shows no evidence of residuals getting more spread-out as a function of the predicted value.