BUSINESS STATISTICS, 2/E

BUSINESS STATISTICS, 2/E by Chapter G C Beri 15 Regression Analysis

What is Regression? It was Sir Francis Galton who first used the term regression as a statistical concept in 1877. He made a statistical study that showed that the height of children born to tall parents tends to ‘regress’ towards the mean height of population. Galton used the term regression as a statistical technique to predict one variable (the height of children) from another variable (the height of parents). This is called ‘regression’ or ‘simple regression’ confined to bivariate data. The variable that forms the basis for predicting another variable is known as the independent or predictor variable and the variable that is predicted, is known as the dependent variable.

Regression Model A statistical model is a set of mathematical formulas and assumptions which describe a real world situation. In this sense, simple linear regression as also multiple regression are statistical models. A statistical model tries to capture the systematic behaviour of the given data, leaving out those factors that cannot be foreseen or predicted. These factors are the errors. A good statistical model is one which provides as large a systematic component as possible, minimising errors.

Regression Model (Contd…) As a first step, we choose a particular model, say a linear regression model, for describing the relationship between the two variables. As a second step, we work out the estimates of the model parameters on the basis of random sample data. The third step is to consider the errors that are called residuals, arising on the fit of the model to the data. When we are convinced that the residuals contain only pure randomness, we consider our model quite appropriate for its intended purpose, which invariably happens to make predictions.dependent variable on the dependent variable.

Estimation Using the Regression Line A Scatter diagram can give us a broad idea of the type of relationship (or even absence of any relationship) between the two variables under study. The equation for a straight line is Y = a + bX where Y is the dependent variable, X is the independent variable, a is the Y-intercept, which is the point at which the regression line crosses the Y-axis (the vertical axis) and b is the slope of the regression line. It should be noted that the values of both a and b will remain constant for any given straight line.

The Method of Least Squares • In order to explain the method of least squares, it is necessary to introduce a new symbol. • A new symbol (computed or estimated value of Y) is used to represent individual values of the estimated points, that is, those points that actually lie on the estimating line. In view of this, the equation for the estimating line becomes = a + bX.

The Method of Least Squares (Contd…) The two normal equations are: SY = na + bSX SXY = aSX + bSX2 where SY = the total of Y series n = number of observations SX = the total of X series SXY = the sum of XY column SX2 = the total of squares of individual items in X series a and b are the Y-intercept and the slope of the regression line, respectively.

Alternative Approach

Use of Deviations from Means of X & Y

Use of Deviations from the Assumed Means

Regression in Case of Bivariate Grouped Frequency Distributions

Regression Coefficient

Properties of Regression Coefficients

The Standard Error of Estimate It is the measure of the spread of observed values from the estimated ones, expressed by regression equation. This concept is similar to the standard deviation, which measures the variation of individual items about the arithmetic mean.

The Standard Error of Estimate (Short-cut method

Interpreting Standard Error of Estimate It is the measure of the spread of observed values from the estimated ones, expressed by regression equation. ThiHigher the magnitude of the standard error of estimate, the greater is the dispersion or variability of points around the regression line. In contrast, if the standard error of estimate is zero, then we may take it that the estimate in equation is the best estimator of the dependent variable. In such a case, all the points would lie on the regression line. As such, there would be no point scattered around the regression line.s concept is similar to the standard deviation, which measures the variation of individual items about the arithmetic mean.

Hypothesis Tests about Regression Relationship

Interval Estimate of B We recall that Y = a + bx really is a sample regression line and, as such, is only one of several possible sample regression lines. The population regression line is Y = A + BX where A equals the population equivalent of the sample a. Similarly, B is the parameter analogous to b, which is the slope of the sample regression line. In order to determine the interval estimate of B, the formula is b ± t Sb

How Good is the Regression

Strength of the Association SSR SST we have to calculate the coefficient of determination, i. e. r2 = which shows variation in Y explained by regression compared to total variation. It should be obvious that greater is r2, higher is the degree of association. The range of r2 is 0 to 1 while r varies from –1 to +1.

Cautions in the Use of Regression Analysis The inclusion of one or two extreme items can completely change a given relationship between the variable. As such, extreme values should be excluded from the data. It is advisable to first draw a scatter diagram so that one can have an idea of the possible relationship between X and Y. In the absence of a scatter diagram, one may attempt a linear regression model but the given set of data may actually show a non-linear relationship. When predictions based on regression analysis are made, one should be sure that the nature and extent of relationship between X and Y will remain the same. This assumption at times is completely overlooked that may lead to errors in prediction. In many cases the regression line computed is a sample regression line This implies that the constant a and the regression coefficient b are for the sample. It is advisable to make some refinement for providing an interval within which the true population regression line lies.

BUSINESS STATISTICS, 2/E