270 likes | 1.02k Views
Introduction: Correlation and Regression. The General Linear Model is a phrase used to indicate a class of statistical models which include simple linear regression analysis . Regression is the predominant statistical tool used in the social sciences due to its simplicity and versatility.
E N D
Introduction: Correlation and Regression • The General Linear Model is a phrase used to indicate a class of statistical models which include simple linear regression analysis. • Regression is the predominant statistical tool used in the social sciences due to its simplicity and versatility. • Also called Linear Regression Analysis. • We will examine regression first, and then see how correlation is one portion of regression analysis.
Simple Linear Regression: The Basic Mathematical Model • Regression is based on the concept of the simple proportional relationship - also known as the straight line. • We can express this idea mathematically! • Theoretical aside: All theoretical statements of relationship imply a mathematical theoretical structure. • Just because it isn’t explicitly stated doesn’t mean that the math isn’t implicit in the language itself!
Simple Linear Relationships • Alternate Mathematical Notation for the straight line - don’t ask why! • 10th Grade Geometry • Statistics Literature • Econometrics Literature
Alternate Mathematical Notation for the Line • These are all equivalent. We simply have to live with this inconsistency. • We won’t use the geometric tradition, and so you just need to remember that B0 and a are both the same thing.
Linear Regression: the Linguistic Interpretation • In general terms, the linear model states that the dependent variable is directly proportional to the value of the independent variable. • Thus if we state that some variable Y increases in direct proportion to some increase in X, we are stating a specific mathematical model of behavior - the linear model.
The Mathematical Interpretation of the Regression Parameters • a = the intercept • the point where the line crosses the Y-axis. • (the value of the dependent variable when all of the independent variables = 0) • b = the slope • the increase in the dependent variable per unit change in the independent variable (also known as the 'rise over the run')
The Error Term • Such models do not predict behavior perfectly. • So we must add a component to adjust or compensate for the errors in prediction. • Having fully described the linear model, there are now several courses to spend on the error term.
The 'Goal' of Ordinary Least Squares • Ordinary Least Squares (OLS) is a method of finding the linear model which minimizes the sum of the squared errors. • Such a model provides the best explanation/prediction of the data.
Why Least Squared error? • Why not simply minimum error? • It is similar to the problem with the average deviation • The error’s about the line sum to 0.0! • Minimum absolute deviation (error) models now exist, but they are mathematically cumbersome. • Try algebra with | Absolute Value | signs! • We square the error to get rid of the negative signs, and take the square root to get back to the “root mean squared error.” • Which we don’t use very much • Some feel that big errors should be more influential than small errors.
Other models are possible... • Best parabola...? • (i.e. nonlinear or curvilinear relationships) • Best maximum likelihood model ... ? • Best expert system...? • Complex Systems…? • Chaos models • Catastrophe models • others
The Notion of Linear Change • The linear aspect means that the same amount of increase unemployment will have the same effect on crime at both low and high unemployment. • A nonlinear change would mean that as unemployment increased, its impact upon the crime rate might increase at higher unemployment levels.
Minimizing the Sum of Squared Errors • Who put the Least in OLS • In mathematical jargon we seek to minimize the Unexplained Sum of Squares (USS), where:
T-Tests • Since we wish to make probability statements about our model, we must do tests of inference. • Fortunately,
Measures of Goodness of fit • The Correlation coefficient • r-squared • The F test
The correlation coefficient • A measure of how close the residuals are to the regression line • It ranges between -1.0 and +1.0 • It is closely related to the slope.
Goodness of fit • The correlation coefficient • A measure of how close the residuals are to the regression lineIt ranges between -1.0 and +1.0 • r2 (r-square) • The r-square (or R-square, or r2) is also called the coefficient of determination • Ranges between 0.0 and 1.0 • Expresses the % of Y explained by X
Tests of Inference • t-tests for coefficients • F-test for entire model • Since we are interested in how well the model performs at reducing error, we need to develop a means of assessing that error reduction. Since the mean of the dependent variable represents a good benchmark for comparing predictions, we calculate the improvement in the prediction of Yi relative to the mean of Y (the best guess of Y with no other information).