120 likes | 281 Views
CSCI N207 Data Analysis Using Spreadsheet. 12b. Regression Analysis, Part 2. Lingma Acheson linglu@iupui.edu. Department of Computer and Information Science, IUPUI. Fitting the Data. Fitting the Data.
E N D
CSCI N207 Data Analysis Using Spreadsheet 12b. Regression Analysis, Part 2 Lingma Acheson linglu@iupui.edu Department of Computer and Information Science, IUPUI
Fitting the Data • If there are more than two data points, chances are they don’t all fit in one straight line. • We need to find the equation for a straight line that does the “best job” of reproducing the data. • About half of the data points should fall above our line (“positive residual”) and about half should fall below (“negative residual”).
Residual • Difference between the measured and the calculated Y-values:
Finding the Slope (m) of an Estimated Line • The slope of the estimated line is given by the ratio of the covariance between the X and Y data sets and of the variance of the X data set:
Finding the y-Intercept (b) of an Estimated Line • Once we’ve found the slope, we can find the Y-intercept using the standard equation for a line, with one exception: we must use the means of the X and Y data sets as our coordinates (since the actual data points are unlikely to be on the estimated line): • Excel functions: • m: SLOPE(..,..) • b: INTERCEPT(..,..)
Practice • Find an equation for the trendline of the following data set and predict the reading hours when aptitude is 25, 33 or 45.
Predicting Values • Once we get the slope (m) and the y-intercept (b) of the estimated line, we have a mathematical relation that ties the X variable to the Y variable. • Once we have this relation, we can use it to predict X- and Y- coordinates that are not part of the data sets. • E.g. What is the estimated reading hours if two new students coming in, one has a reading aptitude of 25 and another one 46? y = 0.2029x + 0.9429 x = 25, y = 0.2029*25 + 0.9429 = 6.0154 x = 46, y = 0.2029*46 + 0.9429 = 10.2763
Interpolation • Interpolation is the process by which we use the formula for estimated line to predict a value of Y for a given value of X that is not included in the data set, but is within the range of the data set. • The given value of X and the predicted Y-value will be on the estimated line.
Extrapolation • Extrapolation is the process by which we use the formula for estimated line to predict a value of Y for a given value of X that is not included in the data set AND is not within the range of the data set. • The given value of X and the predicted Y-value will be on the estimated line, but outside of the range of the data set.
R2 Value • How good is the line? How confident is the prediction? • R : Correlation Coefficient, -1 ≤ R≤ 1 • R2 :Coefficient of Determination, 0 ≤ R2 ≤ 1 • The Coefficient of Determination is used to measure the certainty of making predictions from a graph. It represents the percent of data closest to the trendline. • The closer it is to 1, the more confident the prediction is. - From "Correlation Coefficient" (http://mathbits.com/MathBits/TISection/Statistics2/correlation.htm)
Excel Functions • TREND() - Returns predicted Y values in a linear trend when passed X data. • Add Trendline(from the Chart menu) Returns the trendline, equation, and correlation coefficient for a set of X,Y data.