160 likes | 383 Views
Prediction. Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among variables is high, the success in prediction will be high.
E N D
Prediction • Prediction is a scientific guess of unknown by using the known data. • Statistical prediction is based on correlation. If the correlation among variables is high, the success in prediction will be high. • So, if there is a perfect correlation among variables, then the prediction will be perfect. • Today, we will study on the bivariate linear correlation and regression. That is, linear relation between two variables.
Prediction • Today, we will study on the bivariate linear correlation and regression. That is, linear relation between two variables. • In a graduate level of statistic class, you will learn curvilinear correlation/regression for U-shaped relations and Cannonical Correlation for the relations among three or more variables • By using statistical predictions (regression), we can answer such questions: • What is the height of a person whose weight is 80 kg? • If a students’ OSS score is 168, what will be his/her GPA in university?
Prediction • As you remember, correlation coefficient provides us a measure of how strongly two variables are related. • Pearson r is one of the correlation coefficients and it is calculated on the least squared deviations. • By using that method, we can draw a hypothetic linebetween the scores on a scatter diagram. • This line represents the best fit of the predicted scores
In this example, the correlation between weight and height is perfect (r= +1). So, all the dots in the diagram are on the hypothetic line (the line of best fit). • The formula of the line is Y=aX+c. For these distributions, it is Y=1x + 100.
Terminology: • The line in this diagram is regression line. • The equation of the line is regression equation (Y=aX+c). • For correlation, it is not important which axis represents which variable, since there is no direction in correlation. • For regression, Y axis is always shows the predicted (dependent) variable, and X axis represents the predictor (independent variable) • Be careful about the language of regression. The equation is always read as regression of Y on X.
Now let’s focus on the other examples in which the relation between two variables is less perfect. • As you can see, when the correlation become less perfect, dots in the diagram starts to diverge from the regression line
In this example, the correlation is far from being perfect. • So, the dots are far away from the regression line.
The Criterion of Best Fit • By using the regression equation, we can calculate predicted values of Y. • Let’s say these predicted values areYl • As you can see in the scatter diagram, these predicted values of Yl are not the same of Y. • The discrepancy with Y and Yl is the error of prediction. • These discrepancies are presented as vertical lines in the figure. • As you can see, the higher the r, the lower the error is.
The Regression Equation • The regression of Y on X (Standart-Score Formula) • zly = rzx • zly = the predicted standart-score value of Y • r = correlation coefficient • rzx= the standart-score value of X from which zly is predicted • As you can see, • 1. zly =rzxwhen r=1 • 2. zlyis 0 when zxis 0. That is mean of X overlaps with mean of Yl • To calculate the predicted values of Y from raw scores of X, we can use a simpler formula
The Regression Equation • In this formula • Yl= the predicted raw score in Y • Sx and Sy= the two standart deviations • r= correlation coefficient • Now, lets try to calculate Yl for the distribution presented in Table 1. Note that this form of the formula is similar to Y=aX+c
Error of PredictionThe Standard Error of Estimate • The regression line is a kind of a mean of the score in the scatter plot • The discrepancy from regression line is then a kind of a deviation form mean (line) • So, we can use a similar formula to SD in order to calculate standard discrepancy
Error of PredictionThe Standard Error of Estimate • A preferred formula for Syx is
Limits of Error in Estimating Y from X • Assume that actual scores of Y are normally distributed about predicted scores • So, we can use normal curve to calculate the limits of error in prediction • In a Normal Distribuion • 68% of the Y scores fall within the limits Y(mean) +/- 1 SD • 95 % of the Y scores fall within the limits Y(mean) +/- 1.96 SD • 99 % of the Y scores fall within the limits Y(mean) +/- 2.58 SD
Limits of Error in Estimating Y from X • Given that regression line is a kind of a mean and standard error is a kind of a SD • We can see that • 68% of actual Y values fall within the limits Y(mean) +/- 1 Syx • 95 % of actual Y values fall within the limits Y(mean) +/- 1.96 Syx • 99 % of actual Y values fall within the limits Y(mean) +/- 2.58 Syx