200 likes | 296 Views
Fitting the Data. Lecture 2. Today’s Plan. Finishing off the examples from Lecture 1 Introducing different types of data Fitting the data One of the most important lectures of the course There will be a question on this on a midterm and the final! (Almost guaranteed!)
E N D
Fitting the Data Lecture 2
Today’s Plan • Finishing off the examples from Lecture 1 • Introducing different types of data • Fitting the data • One of the most important lectures of the course • There will be a question on this on a midterm and the final! (Almost guaranteed!) • You can find this material in the Appendix 4.2
Experimental vs Observational • Because of financial/practical/ethical concerns, experiments in economics are rare (SIME/DIME, Tennessee STAR). • Economists tend to use observational data - obtained from real world behavior. Collected using surveys/administrative records. • Observational data poses problem: how to estimate causal effects, no random assignment, data definitions not quite right (what economic theory might require). • Much of econometrics is devoted to estimation with problems encountered with observational data.
Cross-Section Data • We have already seen 2 examples of cross-section data: • Wages and years of education • Voting polls in Florida • Cross section data sets provide information about individual/agent behavior at a moment in time • Current Population Survey is a cross-section survey that generates monthly detail about the US work force • Data on county/state/or even countries at a moment in time is also cross-section data.
Time Series Data Sets (1) • Time series data sets provide information about individual/agent behavior over time • A time unit of observation (day, week, month, year) defines a time series • We hear about time series data everyday: • Nasdaq • Financial Times Stock Exchange Index (FTSE) • Dow Jones • Government data: GDP/Unemployment/Inflation
Time Series Data Sets (2) • Composition of unit can change • FTSE gives information on the top 100 stocks each day, not necessarily the same 100 stocks every day • CPS: gives data from each month on the number of people who are unemployed. Not the same people (we hope!) from month to month. • Characteristics of time series data sets • set of observations over time • composition of unit can change • compositional changes are dealt with using weighting schemes (Lecture 3)
Longitudinal Data Sets • Longitudinal data sets provide information on a particular group of individuals/agents over time. • For example: following Econ140, Fall 2002 over time. Alternatively, a set of firms over time. • Example we will use: Production functions (Cobb-Douglas) - following firms over time. • Book example: Traffic Deaths and Alcohol Taxes - following states over time.
Ordinary Least Squares (OLS) • Learning how to calculate a straight line (Appendix 4.2) • Recall the scatter plot of earnings vs. years of education: there was a mess of data! • We can use Ordinary Least Squares (OLS) to fit a straight line through these data points • This line is called the least squares line or line of best fit • Why is it called: ‘least square line’? • Least squares line is the minimization of errors - the OLS regression line picks up the smallest distance between data points and the line
Two Parts to OLS 1. Derive estimators for a (intercept) and b (slope coefficent) • this means using differential calculus! 2. Calculate values for a and b from data • this means mechanically using the derived formulas for a & b • How to calculate a regression line through a mass of data points that do not necessarily lie on a straight line? • Each data point (X,Y) has a value.
OLS Line • We’ll call the regression line • this is an estimate of the true Y • The errors will be the difference between and Y • errors can be positive or negative • We can write the following general equations: Where i = 1 … n.
OLS Line • A data set example is available at the course web site. It consists of five points. Using that output I can calculate the regression equation to be: • Keeping this equation in mind we can find estimates of a and b given our general formulas for Y and • We derive a and b from two different types of regression equations: a from b from
OLS Line: Deriving a (1) • We can rewrite as ei=Yi - a • we could write objective function for a as: • Go back to the regression analysis example: notice that the sum of errors is zero! • Why? The positive and negative errors from the line of best fit always cancel out • For a minimum you need a first order condition (FOC) set to zero. • We need a FOC for OLS that is set to zero, not zero to start with!
OLS Line: Deriving a (2) • We can’t just minimize the sum of the errors because • Instead, we have to minimize the sum of the errors squared (hence - least squares): where ei = Y - a
OLS Line: Deriving a (3) • Differentiate with respect to a to find the formula for the OLS estimator a • Note that you set the first order condition to zero to find a minimum: -2Sei = 0 (don’t worry about the second order derivative - which will be positive). • Remember that ei = Y - a • Solve for a: a = SYi/n.
OLS Line: Deriving b (1) Now consider the slope regression where • We use the same principles as before: Note: this condition only holds if there’s no correlation between X and the errors So: (keep in mind that this expression only holds for the regression of a zero intercept and non-zero slope)
OLS Line: Collect a & b • We know a regression line with a non-zero intercept and a non-zero slope coefficient looks like: • We also know: • From the derivations of a and b we have the necessary first order conditions:
OLS Line: Collect a & b (2) • Plug the new equation into the FOC from our derivation of a: • Plugging into the FOC from the derivation of b:
Example • From the data set posted on the web • To calculate the regression line you need: • Solve for a & b given the formulas:
Wrap Up • Introduced three data types: cross-section, time series, and longitudinal • Using the OLS technique to derive formulas for an intercept and a slope coefficient • We estimated the regression lines • We found FOCs = 0 • Then we put everything together to estimate