1 / 26

Data Structures

Data Structures. Methods of Economic Investigation Lecture 2. Why are we doing this?. Thus far: Most of econometrics teaching has been theory based Type of data can drive what you can do Type of data affects credibility and problems with analysis

Download Presentation

Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Structures Methods of Economic Investigation Lecture 2

  2. Why are we doing this? • Thus far: Most of econometrics teaching has been theory based • Type of data can drive what you can do • Type of data affects credibility and problems with analysis • Can be hard to translate equations into applications and even into reading papers • Rest of this course based on applications: this lecture will help with both lectures and exercises

  3. Choosing your data.. Suppose interested in causal effect of X on y: How would you test this? If you could choose the way in which X is determined in your sample—what would you do may seem fanciful but field experiments becoming more common in economics Good thought experiment: If you could have any data in the world, is this question answerable (if not, move on!) Good reason to choose to do randomized controlled experiment

  4. Where does data come from? • Surveys • Response Rate • Stratification/Clusters • Reporting Error/Measurement Error • Administrative Records • Lots of different places • Often kept real-time (so addresses “reporting” or “recollection” errors) • May be missing, and that might not be random… • Researchers (and you!) • Often collected for specific project—so be careful what it has • More “unique” with different types of data (e.g. content analysis)

  5. Who Collects Data • Government • Official Statistics: Unemployment, GDP, etc • Surveys: Labor Force, Consumption, etc. • Records: Justice System, Social Programs • Service providers • Often this may be administrative (e.g. hospital records) • Sometimes, internal surveys or evaluations which can be useful if you can get them • Third Parties • Critical for places with limited capacity (e.g. World Bank is a big source of this for developing countries) • University or Survey Research Programs • Newspapers and Media sources compile LOTS of things

  6. Different Types of data • Cross-Sectional Data • Time Series Data • Panel Data • Repeated Cross-Section

  7. Cross-Sectional Data • Cross section data covers a cross section of population and information is collected from this cross section during a given period of time. • What does this look like • Rows are units of observations (e.g. individuals) • Columns may be variables

  8. Cross-Section Data • Simple descriptive statistics across individuals: can get sample mean and variance of various X’s • Regressions: The standard formulas

  9. AlgebraReality: Outcome Variables • Try to get a sense of data, to translate the matrix algebra into reality. • What is the effect of education on income? • We have an Outcome “y”, for example income

  10. AlgebraReality: RHS Variables • There may be several (labeled by k) different X’s. So usually we think of this as meaning that: • X is of dimensionality kxn • We will estimate k coefficients • Our X variables looks like:

  11. Our Data Looks like: Our Data Example • N=5 • k=3 We can index our individuals by ID (useful later)

  12. What does a regression tell us? • Remember, it’s minimizing the errors and will pick the 3 coefficients (one on race, one on sex, and one on education) to do that • We are interested in the coefficient on education to tell use the “effect of education on earnings” • We might still care about the effect of race and gender as “control” variables

  13. Stata Output

  14. AlgebraReality: Stata Output • Using our “data” if we regress y on our X’s • To do this in stata we would tell stata: regress income race sex education • Output: • Coefficients • Standard errors • R-squared

  15. Limitations… • Lots of things vary over time • Can’t control for these issues in cross-section data • Only source of variation is across individuals (or whatever the unit of observation) • Identification: Need observations similar time characteristics (because we can’t control this) but different on some variable of interest

  16. Now to time series data • Pretty similar to panel data except data indexed by time instead of individual

  17. Why is time series different? • Correlation between different observations • Violates OLS assumptions (estimates ok but can’t do inference) • More on this later… • Lots of things about individuals are time-invariant so they don’t make sense in this context. • Other things, often in time series data, are common across individuals (e.g. macroeconomic trends) • Limits what we can do with these variables—we CAN’T “control” for time-invariant characteristics so all variation comes from time variation…

  18. Estimating with Time Series Data • Two critical issues: • Stationary: Mean and Variance not changing over time • Stronger conditions sometimes required which is that distribution (e.g. all moments) same over time/space • May need to do something to make your data stationary (e.g. de-mean, detrend, difference, etc.) • Ergodic • Given a sufficiently long set of realizations, can estimate statistical properties • Worry about Unit roots (more on this later)

  19. Panel Data • Repeated observation on individuals • Common example: Labor Force Surveys • Take information about individuals • Usually contains time invarying for any individual (race, sex, education level) • Usually contains time varying for any given individual (employed last week) • Can contain or link to time varying but same across groups of individuals (local unemployment rate)

  20. Example of Panel Data • Multi-dimensional—so indexed by time & individual

  21. Panel Data Regressions • Regressions need to be indexed by all dimensions (our example is time and individual but it could be time, state, and individual) • May allow intercept shift (e.g. add a dummy for each year) • May allow a slope shift (e.g. allow different coefficients for men and women)

  22. What’s so great about Panel Data? • We can control for individual specific factors (e.g. error component models) • ECM may solve some of our omitted variable bias issues (individual controls) • Can use both “within” (for an individual over time) and “between variation (across individuals in a given time) • Can be rare to have long panels • Tend to span very short periods of time • May make it difficult to study trends—can only see “breaks” at big changes

  23. Repeated Cross-Section Data • More common—Annual or Frequent Surveys—not always same people • Get repeated cross-section, of different cohorts of individuals • Can do several things: • Construct panel at more aggregate level • Use time-series aspects to compare cohorts

  24. Example of Cross-Section Data • Multi-dimensional—so indexed by time & individual

  25. Repeated Cross-Section Regressions • Index by time and whatever “group” you want to use—for example: group 1 is men and group 2 is women, then you estimate: • Use similarities between groups but can’t control of individual specific issues • Cohort specific changes—selection issues, e.g. • Can allow ‘fixed effects’ for time or group—but not as believable to control for unobservables

  26. Next Steps: • Using data can we: • Describe the data to understand what we’ve got • Develop some “questions” to answer • Test our hypotheses • Application based class—will use Stata examples

More Related