240 likes | 275 Views
Regression and Correlation. CSLU 2850.Lo1 Spring 2008 Cameron McInally mcinally@fordham.edu Fordham University. May contain work from the Creative Commons. Regression and Correlation. Regression and Correlation In this lecture, we will cover:
E N D
Regression and Correlation CSLU 2850.Lo1 Spring 2008 Cameron McInally mcinally@fordham.edu Fordham University May contain work from the Creative Commons.
Regression and Correlation • Regression and Correlation • In this lecture, we will cover: • Linear Regression: Trying to find one line that best fits data points on a Cartesian Plane. • Correlation: Measuring the strength of a relationship between two variables.
Regression and Correlation • Linear Regression • When we plot two variables against each other (i.e. an XY graph), the values usually do not fall in a straight line. • Linear Regression is attempting to find a straight line that best estimates the relationship between the two variables.
Regression and Correlation • Linear Regression • There are two important parts: • Fitted Regression Line: The line that best fits the data points on the graph. • Regression Equation: The equation that generates the Fitted Regression Line.
Regression and Correlation • Fitted Regression Line • If the data falls approximately into a straight line, then we can fit a line to the points. • Some points may be above the line and others may be below the line. • In order to find the best-fit line, we use a Regression Equation.
Regression and Correlation • Regression Equation • Slope-Intercept Form: • In grade school we learn Slope-Intercept Form for plotting a line. Here, m is the slope of the line and b is where the line crosses the y-axis.
Regression and Correlation • Regression Equation • In Statistics and Excel, the regression equation, for the fitted regression line, will be of the form: Both say exactly the same thing!!!
Regression and Correlation • Regression Equation • Terms for the equation: • y: is called the dependant variable; • x: is called the independent(or predictor)variable; • a: is a coefficient called the intercept or constant term; • b: is a coefficient called the slope;
Regression and Correlation • Residuals • The black vertical lines you see represent the error between a particular data point and the line we have chosen. • We call these lines the residuals. These are the differences between the observed dependant values and the predicted values.
Regression and Correlation • Fitting the Regression Line • Assuming our data fits a linear model: • Where, • α is the true intercept; • β is the true slope; • Є is the error term;
Regression and Correlation • Least Squares Method • When fitting the line, we do not know the true intercept and slope. So, we must estimate them. • To get estimates for α and β, we find values for a and b that minimize the value of the sum of squared residuals:
Regression and Correlation • Least Squares Estimates • To find the estimates for a and b:
Regression and Correlation • Coefficient of determination • Also called the R2-value. • Measures the percentage of variation in the values of the dependent variable, that can be explained by the change in the independent variable. • If R2=0.61803, then 61.803% of the variation in the dependent variable can be explained by the change in the independent variable.
Regression and Correlation • Checking the Regression Model • When we perform a regression on a set of data, we are making 4 important assumptions: • The straight-line model is correct; • The error term, ε, is normally distributed with mean 0; • The errors have constant variance; • The errors are independent of each other;
Regression and Correlation • Straight-line assumption • Plot the observed data points, with the predicted data points, on the same scatter-plot. • Do the observed data points appear to follow the straight line? Do the data points appear curved?
Regression and Correlation • Normal Distribution of the Residuals • To test this assumption, create a normal plot of the residuals. • If the residuals follow a normal distribution, they should fall evenly along the normal probability plot.
Regression and Correlation • Constant variance assumption • Next, we will look at the assumption of constant variance in the residuals. To view this graphically, plot the residual values against the predicted values. • If there is an increase in variance, at any point along the line, then there is reason to doubt the assumption of constant variance. • Outliers may have a significant impact on this test.
Regression and Correlation • Independent errors • This final assumption should only be considered when there is a defined order of the observations. • We would like to make sure that the residual of a data point is not influenced by the surrounding data points.
Regression and Correlation • Correlation and Causation • Correlation: indicates the relationship between two variables without assuming that a change in one causes a change in the other. • Causation: indicates the relationship between two variables where a change in one causes a change in the other.
Regression and Correlation • Correlation • Expresses the strength of a relationship between -1 and 1. • Positive Correlation: indicates that an increase in one variable implies an increase in the other variable. • Negative Correlation: indicates that an increase in one variable implies a decrease in the other. • Zero Correlation: indicates that there is no correlation between variables.
Regression and Correlation • Pearson Correlation Coefficient • Most often used measure of correlation:
Regression and Correlation • Pearson Correlation Coefficient • Notice that the sign of this coefficient and the slope of the line are the same. • The slope can be any real number, but the coefficient must be between -1 and 1. • Perfect Fit: A value of 1 means that all the data points fall perfectly on the line, i.e. all the residuals are 0. • Does not work well with Outliers. • Does not work well with a curved relationship.
Regression and Correlation • Spearman Correlation Coefficient • Does work well with Outliers. • Does work well with a curved (non-linear) relationship. • We will work with this coefficient in the lab.
Regression and Correlation Homework(Always Due in One Week) • Read Chapter 8. • Complete Chapter 8, pg. 326: 1, 2, 4