1 / 68

Class 5 Multiple Regression

SKEMA Ph.D programme 2010-2011. Class 5 Multiple Regression. Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr. Introduction to Regression.

skylar
Download Presentation

Class 5 Multiple Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SKEMA Ph.D programme 2010-2011 Class 5Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr

  2. Introduction to Regression • Typically, the social scientist is dealing with multiple and complex webs of interactions between variables. An immediate and appealing extension to simple linear regression is to extend the set of explanatory variable to other variables. • Multiple regressions include several explanatory variables in the empirical model

  3. Introduction to Regression • Typically, the social scientist is dealing with multiple and complex webs of interactions between variables. An immediate and appealing extension to simple linear regression is to extend the set of explanatory variable to other variables. • Multiple regressions include several explanatory variables in the empirical model

  4. To minimize the sum of squared errors

  5. Multivariate Least Square Estimator • Usually, the multivariate is described by matrix notation: • With the following least square solution:

  6. OLS can not estimate this Assumption OLS 1 Linearity The model is linear in its parameters • It is possible to operate non linear transformation of the variables (e.g. log of x) but not of the parameters like the following :

  7. Assumption OLS 2 Random Sampling The n observations are a random sample of the whole population • There is no selection bias in the sample. The results pertain to the whole population • All observations are independent from one another (no serial nor cross-sectional correlation)

  8. Assumption OLS 3 No perfect Collinearity There is no collinearity between independent variables • No independent variable is constant. Each variable has variance which can be used with the variance of the dependent variable to compute the parameters. • No exact linear relationships amongst independent variables

  9. Assumption OLS 4 Zero Conditional Mean The error term u has an expected value of zero • Given any values of the independent variables (IV), the error term must have an expected value of zero. • In this case, all independent variables are exogenous. Otherwise, at least one IV suffers from an endogeneity problem.

  10. Sources of endogeneity • Wrong specification of the model • Omitted variable correlated with one RHS. • Measurement errors of RHS • Mutual causation between LHS and RHS • Simultaneity

  11. Assumption OLS 5 Homoskedasticity The variance of the error term, u, conditional on RHS, is the same for all values of RHS. Otherwise we speak of heteroskedasticity.

  12. Assumption OLS 6 Normality of error term The error term is independent of all RHS and follows a normal distribution with zero mean and variance σ²

  13. Assumptions OLS OLS1 Linearity OLS2 Random Sampling OLS3 No perfect Collinearity OLS4 Zero Conditional Mean OLS5 Homoskedasticity OLS6 Normality of error term

  14. Theorem 1 • OLS1 - OLS4 :Unbiasedness of OLS.The set of estimated parameters is equal to the true unknown values of

  15. Theorem 2 • OLS1 – OLS5 :Variance of OLS estimate.The variance of the OLS estimator is … where R²j is the R-squared from regressing xj on all other independent variables. But how can we measure ?

  16. Theorem 3 • OLS1 – OLS5 :The standard error of the regression is defined as This is also called the standard error of the estimate or the root mean squared errors (RMSE)

  17. Standard Error of Each Parameter • Combining theorems 2 and 3 yields:

  18. Theorem 4 • Under assumptions OLS1 – OLS5, estimators are the Best Linear Unbiased Estimators (BLUE) of Assumptions OLS1 – OLS5 are known as the Gauss-Markov Theorem, which stipulates that under OLS1-5, the OLS are the best estimation method • The estimates are unbiased (OLS1-4) • The estimates have the smallest variance (OLS5)

  19. Theorem 5 • Under assumptions OLS1 – OLS6, the OLS estimates follows a t distribution:

  20. Extension of theorem 5: Inference • We can define de confidence interval of β, at 95% : If the 95% CI does not include 0, then βis significantly different than 0.

  21. Student t Test for H0: βj=0 • We are also in the position to infer on βj • H0: βj = 0 • H1: βj≠ 0 Rule of decision Accept H0 is | t | < tα/2 Reject H0 is | t | ≥tα/2

  22. Summary OLS1 Linearity OLS2 Random Sampling OLS3 No perfect Collinearity OLS4 Zero Conditional Mean OLS5 Homoskedasticity OLS6 Normality of error term T1 Unbiasedness T2-T4 BLUE T5 β~ t

  23. Application 1: seminal model The knowledge production function

  24. Application 1: modèle de base

  25. Application 2: Changing specification The knowledge production function

  26. Application 2: Changing specification

  27. Application 3: Adding variables The knowledge production function

  28. Application 3: Adding variables

  29. Qualitative variables used as independent variables

  30. Qualitative variables as indep. variables • Qualitative variables • Dummy variables • Generating dummy variables using STATA • Interpretation of coefficients in OLS • Interaction effects between continuous and dummy var.

  31. Qualitatives variables • Qualitative variables provide information on discretecharacteristics • The number of categories taken by qualitative variables is general small. • These can be numerical values but each number denotes a attribute – a characteristics. • A qualitative variable may have several categories • Two categories: male – female • Three categories: nationality (French, German, Turkish) • More than three categories: sectors (car, chemical, steel, electronic equip., etc.)

  32. Qualitative variables • There are several ways to code qualitative variables with n categories • Using one categorical variables • Producing n - 1 dummy variables • A dummy variable is a variable which takes values 0 or 1. • We also call them binary variables • We also call dichotomous variables

  33. Qualitative variables • Coding using one categorical variable • Two categories: we generate a categorical variable called “gender” set to 1 if the observation is a female, 2 if the observation is a male. • Three categories: we generate a categorical variable called “country” set to 1 if the observation is French, 2 if the observation is German, three if the observation if Turkish. • More than three categories : we generate a categorical variable called “sector” set to 1 if the observation is in the car industry, 2 for the chemical industry, three for the steel ifnustry, four for the electronic equip industry, etc.. This requires the use of label in order to know to which category a given number pertains

  34. Labelling variables • Labelling is tedious, boring and uninteresting. • But there are clear consequences when one must interpret the results • label variable. Decribe a variable, qualitative or quantitative • label variable asset "real capital" • label define. Defines a label (meaning of numbers) • label define firm_type 1 "biotech" 0 "Pharma" • label valuesApplies the label to a given variable • label values type firm_type

  35. Exemple de labellisation ************************************************************************************* ******* CREATION DES LABELS INDUSTRIES ********* ************************************************************************************* egen industrie = group(isic_oecd) #delimit ; label define induscode 1 "Text. Habill. & Cuir" 2 "Bois" 3 "Pap. Cart. & Imprim." 4 "Coke Raffin. Nucl." 5 "Chimie" 6 "Caoutc. Plast." 7 "Aut. Prod. min." 8 "Métaux de base" 9 "Travail des métaux" 10 "Mach. & Equip." 11 "Bureau & Inform." 12 "Mach. & Mat. Elec." 13 "Radio TV Telecom." 14 "Instrum. optique" 15 "Automobile" 16 "Aut. transp." 17 "Autres"; #delimit cr label values industrie induscode

  36. Exercise • Open SKEMA_BIO.dta • Create variable firm_type from type • Label variable firm_type • Define a label for firm_type and apply it

  37. Dummy variables • Coding categorical variables using dummy variables only • Two categories. • We generate one dummy variable “female” set to 1if the obs. is a female, 0 otherwise. • We generate one dummy variable “male” set to 1if the obs. is a male, 0 otherwise. • But one of the dummy variable is simply redundant. When female = 0, then necessarily male = 1 (and vice versa). • Hence with two categories, we only need one dummy variable.

  38. Dummy variables • Coding categorical variables using dummy variables only • Three categories. • We generate one dummy variable “France” set to 1if the obs. is a French, 0 otherwise. • We generate one dummy variable “Germany” set to 1if the obs. is a German, 0 otherwise. • We generate one dummy variable “Turkish” set to 1if the obs. is a Turkish, 0 otherwise. • But one of the dummy variable is simply redundant. When France=0 and German=0, then Turkish=1. For a variable with n categories, we must create n - 1dummy variables, each representing one particular category.

  39. Generation of dummies with STATA • Using the if condition. • generate DEU = 0 • replace DEU = 1 if country==“GERMANY” • generate LDF= 1 if size > 100 • replace LDF =0 if size < 101 • Avoiding the use of the if condition. • generate FRA = country==“FRANCE” • generate LDF = size > 100

  40. Generation of dummies with STATA • With n categories and n being large, generating dummty variables can become really tedious • Function tabulate has a very convenient extension, since it will generate n dummy variables at once. • tabulate varcat, gen(v_) • tabulate country, gen(c_) • Will create n dummy variables with n being the number of country in the dataset, and c_1 being the first country, c_2 being second, c_3 the third, etc.

  41. Reading coefficients of dummy variables • Remember! A coefficient tells us the increase in y associated with a one-unit increase in x, other things held constant (ceteris paribus). • If the knowledge production function goes with « y » being the number of patent and “biotech” being a dummy variable set to 1 for biotech fimrs, 0 otherwise.

  42. Reading coefficients of dummy variables • If the firm is biotech company, then the dummy variable “biotech” is equal to unity. Hence: • If the firm is pharma company, then the dummy variable “biotech” is equal to zero. Hence:

  43. Reading coefficients of dummy variables • The coefficient reads as the variation in the dependent variable when the dummy variable is set to 1 relative to the situation where the dummy variable is set to 0. • With two categories, I must introduce one dummy variable. • With three categories, I must introduce two dummy variables. • With n categories, I must introduce (n-1) dummy variables.

  44. Exercise • Regress the following model: • Predict the number of patents for both biotech and pharma companies • Produce descriptive statistics of PAT for each type of company using the command table • What do you observe?

  45. Reading coefficients of dummy variables • For semi logarithmic forms (log Y), coefficient β must be read as an approximation of the percent change in Y associated with a variation of 1 unit of the explanatory variable. • This approximation is acceptable for small β (β < 0.1). When β is large (β ≥ 0.1), the exact percent change in Y is: 100 × (eβ – 1)

  46. Application 4: dummy variable The knowledge production function

  47. Application 4: dummy variable

  48. Application 4: dummy variable Patent ln(PAT) size

  49. Application 5: Interacting variables The knowledge production function

  50. Application 5: Interacting variables

More Related