360 likes | 373 Views
REGRESSION ANALYSIS. REGRESSION ANALYSIS. Regression analysis attempts to establish nature of relation between variables Measure of average relation between two or more variables Most frequently used technique in economics and business research. Historical Origin of Regression.
E N D
REGRESSION ANALYSIS
REGRESSION ANALYSIS • Regression analysis attempts to establish nature of relation between variables • Measure of average relation between two or more variables • Most frequently used technique in economics and business research
Historical Origin of Regression • Regression Analysis was first developed by Sir Francis Galton, who studied the relation between heights of sons and fathers. • Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group.
REGRESSION ANALYSIS • Statistical tool to estimate the unknown values of one variable from known values of another variable • Independent (X) and dependent variable (Y) • Simple linear regression analysis: only one predictor and straight line • Dependent and independent refer to the mathematical or functional meaning • Values of Y are dependent on values of X, X may or may not be causing change in Y
USES • Provides estimates of values of dependent variables from values of independent values : regression lines • Obtains a measure of error involved in using regression line as basis for estimation • Correlation coefficient can be calculated with help of regression coefficient
DIFFERENCES WITH CORRELATION • Correlation : Measure of degree of relationship, measure degree of co variability • Regression : Study the nature of relationship • Correlation : Can not tell which variable is cause (& effect) • Regression : One variable is dependent, another independent
REGRESSION LINES • Lines cut each other at point of average of X and Y • Drawn on assumption of least square
REGRESSION EQUATIONS • Regression equation of ‘Y’ on ‘X’ is expressed as:- • Y = a + bX • ‘Y’ is dependent variable, ‘X’ is independent • ‘a’ is ‘Y-Intercept’, ‘b’ is slope (change in Y for unit change in X) • Values of ‘a’ and ‘b’ by method of least squares
REGRESSION EQUATIONS • Least Square Method : line should be drawn through plotted points in such a manner that the sum of squares of deviations of actual ‘y’ values from computed ‘y’ values is the least • Σ(y-ye)2should be minimum to obtain best fitting line
CHARACTERISTICS OF STRAIGHT LINE (BEST FIT) • Gives the best fit of data • Σ(y-ye)2should be minimum, deviation above the line equals those below the line • Straight line goes through overall mean of data • For data representing sample from a population, least square line is ‘best’ estimate of population regression line
REGRESSION EQUATIONS • SIMILARLY, REGRESSION EQUATION OF ‘X’ ON ‘Y’ IS EXPRESSED AS:- • X = a + bY • ‘X’ IS DEPENDENT VARIABLE, ‘Y’ IS INDEPENDENT. • ‘a’ IS “X-INTERCEPT”, ‘b’ IS SLOPE (CHANGE IN ‘X’ FOR UNIT CHANGE IN ‘Y’). • FIND VALUES OF ‘a’ AND ‘b’ BY METHOD OF LEAST SQUARES.
EXPRESSION FOR A LINE y 9 8 7 6 5 4 3 2 1 Q y = 4 +0.3x P y’ x’ b (Slope) = y’/x’ a = intercept X 0 2 4 6 8 10 12 14 16 18
REGRESSION ANALYSIS : LIMITATIONS • Assumption; relationship has not changed since regression equation was computed • Relationship shown by the scatter diagram may not be the same if equation is extended beyond the values used in computing the equation
LINE OF BEST FIT Regression Equation is given by Where, and • The numerator of equation for b is called Sum of Products SPxy • Denominator is Sum of Squared Deviations from mean SSx. • Denominator will always be +ive and sign of slope of the line would be determined by sign of numerator.
REGRESSION EQUATION FOR POINT ESTIMATE • If number of hrs study is 4 hrs, what will be estimate of marks in Exam? • ‘Point Estimate’ of y using the regression equation. • Y= a + b * x • = 1.0277 + 5.1389 * 4 • = 21.58 • { Value of ‘x’ for which you wish to find estimate of y, should lie within the range of given data ( i.e. 3-10)}. • Reliability of Point Estimate depends on:- • Sample size. • Amount of variation within the sample. • Value of ‘x’ ? • Therefore, ‘Interval Estimate’ is always better.
STD ERROR OF ESTIMATE (Measure of Goodness of Fit) (Std Error of Regression)
ASSUMPTIONS LINE
Assumptions of the Simple Linear Regression Model LINEassumptions of the Simple Linear Regression Model LINEAR, INDEPENDENT, NORMAL & EQUAL VAR Y my|x=a + x y Identical normal distributions of errors, all centered on the regression line. N(my|x,sy|x2) X x
y = a + b x REPRESENTING STANDARD ERROR OF ESTIMATE 1Sy,x y 2Sy,x 3Sy,x Dependent Variable 0 Indep Variable X
STANDARD ERROR OF ESTIMATE • Standard Error of Estimate In HRS of study example Stderror of estimate would be =√2.884=1.698 marks. What does it mean ?
INTERPRETING STD ERROR OF ESTIMATE • We can expect to find 68.26% of the points (y values) within 1 sy,x 95.45% of the points (y values) within 2 sy,x 99.7% of the points (y values) within 3 sy,x. of estimated y (y hat) • Larger the std error of estimate, greater the scattering of points around the scatter line. • Conversely, if sy,x = 0, estimating eqn would be a perfect estimator of the dependent variable.
INTERVAL ESTIMATION • Interval estimation of y for an x value (for a given LoS and sample size) to • Accuracy of this interval estimation depends on the distance of x from its mean (x bar). • Closer the value of x, more reliable the estimate • Hence, for x values other than x bar, a correction factor is used
CONFIDENCE INTERVAL FOR ESTIMATION OF MEAN • Confidence Interval for mean value of y (using correction factor for a given x ) is given by:- to
PREDICTION OF INTERVAL ESTIMATION OF INDL Y VALUE • Confidence Interval for value of y (and not the mean value of y) is given by:- to THEREFORE INTERVAL FOR Y WOULD BE BIGGER THAN INTERVAL FOR MEAN Y
Confidence Interval for the Average Value of Yand Prediction Interval for the Individual Value of Y Y Mean Y
AN ILLUSTRATION : LRCA Qn.A study was conducted by the Air Force on the effect of sleep deprivation on air traffic controllers’ performance whilst on watch. The sample data is as follows: No of hrs w/o SleepNo of Errors 8 8 8 6 12 6 12 10 16 8 16 14 20 14 20 12 24 16 24 12 Estimate No of errors if No of hrs w/o sleep were 10 at 95% CL.
CORRELATION ANALYSIS • How strong is the relationship between the dependent and indep variables. • How are the variables correlated. • Statistical tool to describe the deg to which one variable is linearly related to another. • Measures for describing the correlation between two variables: - Coefficient of Determination, r2 - Coefficient of Correlation, r
COEFFICIENT OF DETERMINATION • Measures extent or strength of association. • Its % of explained variation in dependent variable (y). • Coeff of Determination = Total Variation – Unexplained Variation Total Variation • For ATC Case = SST – SSE 968 – 17.3 • SST 968 = Case of No of errors and going w/o sleep in ATC r2 = 0.64, What does it mean? Means 64% of errors explained ie due to lack of sleep and balance could be due to poor trgetc
y y = y y = y 0 x COEFFICIENT OF DETERMINATION • Measures extent or strength of association. • Its % of explained variation in dependent variable (y). • Coeff of Determination:- < < < • r2 = 0, IF y = y • for all values of x showing no correlation. • r2 = 1, IF y = y • for all values of x showing perfect correlation. <
CORRELATION ANALYSIS INTERPRETING r2 ANOTHER WAY. Interpret the coeff of determination by looking at amount of the variation in y that can be explained by the regression line. UNEXPLAINED VAR < y (y – y) Total variation = Explained variation + Unexplained var TOTAL VAR EXPLAINED VAR (y – y) < (y – y ) y 0 x
CORRELATION ANALYSIS the Coefficient of Correlation, r r= r2 • Measures the strength of relationship ie how strongly the variables are related • Multiple r = 0.8, in case of ATC (Errors & Hrs w/o sleep) means very strong relationship between the two variables • Sign of ‘r’ is guided by the sign of the slope (b) of the regression line • - ive sign indicates inverse relationship between two variables
PROPERTIES OF SAMPLE CORRELATION COEFFICIENT (r) • Ranges between -1 to +1. • Sign of r tells whether relationship is positive or negative. • Larger absolute value of r indicates stronger relationship. • r value near zero indicates ‘no or poor’ relationship between x and y. • r = + 1 or - 1 indicates perfect linear relationship. • r values of 0, 1 or -1 are rare in practice.