12b. Regression Analysis, Part 2

CSCI N207 Data Analysis Using Spreadsheet 12b. Regression Analysis, Part 2 Lingma Acheson linglu@iupui.edu Department of Computer and Information Science, IUPUI

Fitting the Data

Fitting the Data • If there are more than two data points, chances are they don’t all fit in one straight line. • We need to find the equation for a straight line that does the “best job” of reproducing the data. • About half of the data points should fall above our line (“positive residual”) and about half should fall below (“negative residual”).

Residual • Difference between the measured and the calculated Y-values:

Finding the Slope (m) of an Estimated Line • The slope of the estimated line is given by the ratio of the covariance between the X and Y data sets and of the variance of the X data set:

Finding the y-Intercept (b) of an Estimated Line • Once we’ve found the slope, we can find the Y-intercept using the standard equation for a line, with one exception: we must use the means of the X and Y data sets as our coordinates (since the actual data points are unlikely to be on the estimated line): • Excel functions: • m: SLOPE(..,..) • b: INTERCEPT(..,..)

Practice • Find an equation for the trendline of the following data set and predict the reading hours when aptitude is 25, 33 or 45.

Predicting Values • Once we get the slope (m) and the y-intercept (b) of the estimated line, we have a mathematical relation that ties the X variable to the Y variable. • Once we have this relation, we can use it to predict X- and Y- coordinates that are not part of the data sets. • E.g. What is the estimated reading hours if two new students coming in, one has a reading aptitude of 25 and another one 46? y = 0.2029x + 0.9429 x = 25, y = 0.2029*25 + 0.9429 = 6.0154 x = 46, y = 0.2029*46 + 0.9429 = 10.2763

Interpolation • Interpolation is the process by which we use the formula for estimated line to predict a value of Y for a given value of X that is not included in the data set, but is within the range of the data set. • The given value of X and the predicted Y-value will be on the estimated line.

Extrapolation • Extrapolation is the process by which we use the formula for estimated line to predict a value of Y for a given value of X that is not included in the data set AND is not within the range of the data set. • The given value of X and the predicted Y-value will be on the estimated line, but outside of the range of the data set.

R2 Value • How good is the line? How confident is the prediction? • R : Correlation Coefficient, -1 ≤ R≤ 1 • R2 :Coefficient of Determination, 0 ≤ R2 ≤ 1 • The Coefficient of Determination is used to measure the certainty of making predictions from a graph. It represents the percent of data closest to the trendline. • The closer it is to 1, the more confident the prediction is. - From "Correlation Coefficient" (http://mathbits.com/MathBits/TISection/Statistics2/correlation.htm)

Excel Functions • TREND() - Returns predicted Y values in a linear trend when passed X data. • Add Trendline(from the Chart menu) Returns the trendline, equation, and correlation coefficient for a set of X,Y data.

12b. Regression Analysis, Part 2

12b. Regression Analysis, Part 2

Presentation Transcript

Regression Analysis: Part 2

Regression Analysis

Regression Analysis Simple Regression

Regression Analysis

Regression Analysis

Regression Analysis

Regression: (2) Multiple Linear Regression and Path Analysis

Chapter 4 Multiple Regression Analysis (Part 2)

Regression analysis

Regression Analysis

Regression Analysis Part B Calculation Procedures

Logistic Regression Analysis of Matched Case-Control Data- Part 2

Multiple Regression Analysis: Part 4

Regression Analysis

Multiple Regression Analysis: Part 2

2. Regression Analysis

Multiple Regression Analysis: Part 1

Penalized Regression, Part 2

Part 1: Regression Analysis Estimating Relationships

Regression Analysis Simple Regression

Part 1: Regression Analysis Estimating Relationships

Regression analysis