Taehyun Jung taehyun.jung@circle.lu.se CIRCLE, Lund University

15.15-17.00 December 10 2012 For Survey of Quantitative Research, NORSI Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression analysis Taehyun Jung taehyun.jung@circle.lu.se CIRCLE, Lund University

Objectives of this session

Contents

Bivariate Linear Regression Model The simple linear regression model (or bivariate linear regression model, 2-variable linear regression model) • = dependent variable, outcome variable, response variable, explained variable, predicted variable, regressand • = independent variable, explanatory variable, control variable, predictor variable, regressor, covariate • = error term, disturbance • = intercept parameter • = slope parameter

Bivariate Linear Regression Model and are observable (we have data/observations on them) and are unobserved but estimable under certain conditions u is unobservable • Model implies that u captures everything that determines y except for x • Omission of the influence of innumerable chance events • Systematic influence: specification error • Random influence: e.g. weather variations, etc. the net influence of a large number of small and independent causes • Measurement error • Human indeterminacy: inherent randomness in human behavior

Parameters cannot be calculated but only estimated • because we do not know the actual values of the disturbances inherent in a sample of data Estimator: the method of estimation. The formula or recipe by which the data are transformed into an actual estimate. • Notation conventions for estimates: and , , (Ordinary Least Squares estimator) “good” or “preferred” estimator? • Computational cost • Least squares • Highest R2 • Unbiasedness • Efficiency “Preferred” or good estimators of and

Minimize residuals • Residuals () = actual values )of the dependent variable – estimated values ( ) of the dependent variable • Minimizes the sum of squared residuals • Always met by the Ordinary Least Squares (OLS) estimator Least squares residuals

How much variation in the dependent variable is explained by variation in the independent variables? OLS estimator minimizes SSR and, therefore, automatically maximizes • Not use it for determining the proper functional form and the appropriate independent variables Highest R2

Sampling distribution centered over the true population parameter • Expected value of estimated parameters() is equal to true value of the parameter ) • The mean of the sampling distribution of • Only one of good property that sampling distribution of an estimator can have OLS criterion (so far) has nothing to do with sampling distribution • We need further assumptions to make OLS estimator to be unbiased • How disturbance is distributed is most important Unbiasedness

Would you prefer to obtain your estimate by making a single random draw out of an unbiased sampling distribution with a small variance or out of an unbiased sampling distribution with a large variance? Best unbiased estimator is efficient BLUE: Best linear unbiased estimator Efficiency

Ordinary Least Squares

Each value of Y thus has a non-random component, , and a random component, u. The first observation has been decomposed into these two components.

The discrepancies between the actual and fitted values of Y are known as the residuals. • Note that the values of the residuals are not the same as the values of the disturbance term

Deriving linear regression coefficients Conditions for Minimizing RSS

Deriving linear regression coefficients(cont’d)

We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2. True model: Y Fitted line: b1 b0 X1 Xn X

hourly earnings in 2002 plotted against years of schooling, defined as highest grade completed, for a sample of 540 respondents from the National Longitudinal Survey of Youth.

In this case there is only one variable, S, and its coefficient is 2.46. _cons, in Stata, refers to the constant. The estimate of the intercept is -13.93 Interpretation of a regression equation . reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

hourly earnings increase by $2.46 for each extra year of schooling. Literally, the constant indicates that an individual with no years of education would have to pay $13.93 per hour to be allowed to work. • Nonsense! • the only function of the constant term is to enable you to draw the regression line at the correct height on the scatter diagram Interpretation of a regression equation

You can see that the t statistic for the coefficient of S is enormous. We would reject the null hypothesis that schooling does not affect earnings at the 1% significance level (critical value about 2.59). Testing a hypothesis relating to a regression coefficient . reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

2.455 – 0.232 x 1.965 ≤ b2 ≤ 2.455 + 0.232 x 1.965 • The critical value of t at the 5% significance level with 538 degrees of freedom is 1.965. 1.999 ≤ b2 ≤ 2.911 Testing a hypothesis relating to a regression coefficient: Confidence intervals . reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

The null hypothesis that we are going to test is that the model has no explanatory power k is the number of parameters in the regression equation, which at present is just 2. n – k is, as with the t statistic, the number of degrees of freedom F is a monotonically increasing function of R2 • Why do we perform the test indirectly, through F, instead of directly through R2? After all, it would be easy to compute the critical values of R2 from those for F Hypotheses concerning goodness of fit are tested via the F statistic

For simple regression analysis, the F statistic is the square of the t statistic.

Calculation of F statistic . reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321 .2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

OLS Assumptions

A.1 The model is linear in parameters and correctly specified. • Examples of models that are not linear in parameters: ‘Linear in parameters’ means that each term on the right side includes a as a simple factor and there is no built-in relationship among the s. Assumptions for OLS 1:

If we tried to regress Y on X, when X is constant, we would find that we would not be able to compute the regression coefficients. Both the numerator and the denominator of the expression for b1would be equal to zero. We would not be able to obtain b1 either. If for all . A.2 There is some variation in the regressor in the sample.

We assume that the expected value of the disturbance term in any observation should be zero. Sometimes the disturbance term will be positive, sometimes negative, but it should not have a systematic tendency in either direction. for all i Actually, if an intercept is included in the regression equation, it is usually reasonable to assume that this condition is satisfied automatically. The role of the intercept is to pick up any systematic but constant tendency in Y not accounted for by the regressor(s). A.3 The disturbance term has zero expectation

We assume that the disturbance term is homoscedastic, meaning that its value in each observation is drawn from a distribution with constant population variance. Once we have generated the sample, the disturbance term will turn out to be greater in some observations, and smaller in others, but there should not be any reason for it to be more erratic in some observations than in others. A.4 The disturbance term is homoscedastic

OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE. This implies that if we still use OLS in the presence of heteroskedasticity, our standard errors could be inappropriate and hence any inferences we make could be misleading. Whether the standard errors calculated using the usual formulae are too big or too small will depend upon the form of the heteroskedasticity. Consequences of Using OLS in the Presence of Heteroskedasticity

Multiple Regression

an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP. • Note that the interpretation of the model does not depend on whether S and EXP are correlated or not • However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.

The expression for b1 is a straightforward extension of the expression for it in simple regression analysis. However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis. Calculating regression coefficients

It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience. Intercept: Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range Interpretation of a regression equation . reg EARNINGS S EXP Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010 -------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213 ------------------------------------------------------------------------------ 20

A.1 The model is linear in parameters and correctly specified. A.2 There does not exist an exact linear relationship among the regressors in the sample. A.3 The disturbance term has zero expectation A.4 The disturbance term is homoscedastic A.5 The values of the disturbance term have independent distributions A.6 The disturbance term has a normal distribution Properties of the multiple regression coefficients. Only A.2 is different.

the inclusion of the new term has had a dramatic effect on the coefficient of EXP The high correlation causes the standard error of EXP to be larger than it would have been if EXP and EXPSQ had been less highly correlated, warning us that the point estimate is unreliable Multicollinearity . reg EARNINGS S EXP EXPSQ ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.754372 .2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907 .665197 -0.35 0.724 -1.542103 1.071322 EXPSQ | .0267843 .0219115 1.22 0.222 -.0162586 .0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632 ------------------------------------------------------------------------------ . reg EARNINGS S EXP ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213 ------------------------------------------------------------------------------ .corEXP EXPSQ (obs=540) | EXP EXPSQ ------+------------------ EXP | 1.0000 EXPSQ | 0.9812 1.0000

When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to be suffering from multicollinearity. • the standard errors and t tests remain valid. Multicollinearitymay also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2. Note that, multicollinearity does not cause the regression coefficients to be biased.

Reduce the variance of the disturbance term by including further relevant variables in the model Increase the number of observations Increase MSD(X2) (the variation in the explanatory variables). • For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households. Reduce Combine the correlated variables Drop some of the correlated variables • However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity

Kennedy’s 10 commandments of applied econometrics • look long and hard at your results • apply the laugh test • beware the costs of data mining • E.g. tailoring one’s specification to the data, resulting in a specification that is misleading • Be prepared to compromise • Should a proxy be used? Can sample attrition be ignored? • Do not confuse significance with substance • Report a sensitivity analysis • Use common sense and economic theory. • Avoid Type III errors • Producing the right answer to the wrong question is called a type III error • place relevance before mathematical elegance • know the context • Do not perform ignorant statistical analyses • inspect the data • place data cleanliness ahead of econometric godliness • Keep it sensibly simple • Do not talk Greek without knowing the English translation

Taehyun Jung taehyun.jung@circle.lu.se CIRCLE, Lund University

Taehyun Jung taehyun.jung@circle.lu.se CIRCLE, Lund University

Presentation Transcript

Communities of Practice Leif Hommen CIRCLE leif.hommen @circle.lu.se

Lund University

Lars.Coenen@circle.lu.se PhD

Lund University

Olof Ejermo, olof.ejermo@circle.lu.se CIRCLE, Lund University

Introduction to Innovation in High-Tech Sectors Leif Hommen CIRCLE leif.hommen @circle.lu.se

Jung

Taehyun Jung taehyun.jung@circle.lu.se CIRCLE, Lund University

FACULTY OF MEDICINE LUND UNIVERSITY Sweden

Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Alba amicorum at Lund University Library

University-wide development of education at Lund University

Lund University Libraries Head Office

Lund Aerosol Group Div. of Nuclear Physics, Lund University, LTH

Possible contributions from Lund University, Sweden

WELCOME TO THE UNIVERSITY OF LUND

Standards and Standards-based Competition Leif Hommen CIRCLE leif.hommen @circle.lu.se

Mika Vähäkangas, Lund University

Lund University - main building

Lund University Libraries Head Office

Lund University Libraries Head Office