1 / 25

Data, Models and the Search for Exchangeability

Data, Models and the Search for Exchangeability. Mark Hopkins, Department of Economics Math Department Colloquium Gettysburg College April 14, 2005. “Torture the data, and they will confess…”. Theory: Is data mining a dirty word?

ismail
Download Presentation

Data, Models and the Search for Exchangeability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data, Models and theSearch for Exchangeability Mark Hopkins, Department of Economics Math Department Colloquium Gettysburg College April 14, 2005

  2. “Torture the data, and they will confess…” • Theory: • Is data mining a dirty word? • Statistics vs. econometrics and the role of the ex ante theory • Information extraction amounts to a conditioning problem • Conditioning: bias vs. variance, or a search for exchangeability…? • Propagating “model uncertainty” into our parameter estimates • Using new Bayesian statistical methods in econometrics • What do economists have to learn from statisticians? • Application: • Why do some countries become rich faster than others?

  3. Preliminaries: Recalling Bayes’ Rule • Bayes’ Rule tells us how we can update our beliefs (about event A) given some data (knowledge that event B happened) • Example: What is the probability that Saddam had weapons of mass destruction (WMD), given that none have been found (NF)? • The answer depends both on the “strength of the data” p(NF|WMD) and one’s own (subjective) prior beliefs about p(WMD) • The statistician's job is (should be) to help you update your own personal beliefs… all truth is “subjective” in a Bayesian world

  4. Prior beliefs modify our view of the information contained in “data”

  5. Statistical Inference: A Review • The goal: observe the world (gather data, D) and then draw conclusions and/or make predictions • This requires a theory (or model, M) to organize relationships • Mathematics (Probability Theory) • A statistical model is simply a probability distribution, p(D|M), where M {,A}consists of • A set of structural assumptions (A), and • some vector () parameterizing the probability distribution. This usually represents the “question of interest”: e.g. {,2} • Statistical inference: • “Drawing conclusions” refers to p(|D,A) • “Making predictions” refers to p(Dnew|,A)

  6. Estimating p(|D,A):Two Practical (& Related) Problems #1: Inference about  is conditional on model assumptions • In practice, we don’t know the true structural assumptions (A) • What do we know? Bayes Rule: p(M |D)  p(D|M)p(M) • Hypothesis testing can reject a model, but it can neither confirm it nor tell you the correct alternative! • Statistics vs. econometrics: what role does the prior p(M) play? • Traditional statistics recognizes uncertainty about  but not A. • Result: run a specification search for A, but pretend you didn’t! #2: What if data are not drawn from the same distribution? • Inference about  is based on averaging repeated draws • A fundamental statistical issue: “We are each a population of 1!” • A methodological guide for “”: conditional exchangeability

  7. The Conditioning Problem: A Familiar Example • Data D = {X,Y}; we want to know the “effect of X on Y” • We are interested in the regression (or C.E.F.): E[Y|X] • Define the residual or “error” as:   Y – E[Y|X] • Familiar Linear Example: model M is E[Y|X] = 0 + 1X • so Y=0 + 1X +  • Estimation / inference: • Estimation: find {0,1} that minimize some loss function L( ) • Inference: conditional on our information set ,  must be exchangeable

  8. The Benefits of Using the Bayesian Approach of “Exchangeability” • Classical (Frequentist) “i.i.d.”vs. Bayesian “exchangeability” • A foundation for statistical inference on population data • DeFinetti’s Representation Theorem states… • If a sample {X1, X2,…,Xn} is a subset of an infinite exchangeable sequence, {X}, then it is “as if ” p(D |,A) exists, where  ~p( ) • Clarifies the goal of conditioning / model search process • We are trying to achieve “anonymity” of regression residuals • Clarifies the relationship between model search and prediction • What is the basis for using the past to make predictions of the future? … when the past and future are part of an exchangeable sequence!

  9. Example of a Conditioning Problem:The Sources of Economic Growth • Why have some countries grown richer faster than others do? • Data (D): growth rates (g) & assorted country characteristics (X) • Observations are countries (n 100) • Ex ante theory: The Solow Model of Capital Accumulation • The Problem: What about other variables that may affect g ? • Omitted variable bias & “robustness” problems • D.o.F. problem: # Theories > # Observations … (plus multicollinearity!) • Specifying functional forms for variables like democracy, ethnic diversity • Population heterogeneity… Are France, Taiwan, and Sudan really all “draws from the same distribution”? Inference about 2…?

  10. Exchangeability in Cross-Country Growth Regressions • Inference requires conditional exchangeability • France, Taiwan, and Sudan are not exchangeable, but can we find appropriate vector X such that   g – E[g|X] are exchangeable? • Conditioning just boils down to a problem of model selection! • The classical approach to model selection is “hypothesis testing” • However, D.o.F. problem has led to upward “specification search”! • In summary: • Two types of uncertainty: sampling (variance), model (bias) • Model Selection usually involve an artful trade-off of bias vs. variance • However, classical methods do not propagate our model uncertainty into coefficient estimates • Can Bayesian statistics help us bring science to the art of selection?

  11. The Growth Literature, Take 1:OLS estimates w/ controls & dummies

  12. The Growth Literature, Take 2:“Explaining” Parameter Heterogeneity • Tree Regressions • Local Linear Regressions (Spline models) • Varying Coefficient / Hierarchical Models

  13. A Tree Regression s60<0.095 | EQINV<0.0144 laam<0.5 s60<0.03 -0.0072 0.0040 0.0159 NONEQINV<0.1624 DEMOC65<0.8435 FRAC<0.155 0.0213 0.0130 0.0068 EQINV<0.04949 EQINV<0.05405 lny60<8.49696 0.0170 0.0330 0.0532 0.0390 0.0250

  14. An Additive Spline Model: Investment

  15. An Additive Spline Model: Schooling

  16. An Additive Spline Model:Population Growth

  17. Using splines to reveal non-linearities: Solow + s(FRAC)

  18. Does democracy modify effects of investment and schooling?

  19. A Varying Coefficient Model

  20. Specification Searches A specification search is a search for the mode of P(M |D)… • Bayes Rule: • Problem #1: How strong is your prior belief about M? • Problem #2: Can you characterize your prior beliefs? • Problem #3: Using the same data to find M and to estimate  ? • Danger! Why? • Problem #4: By conditioning model on M [not p(M) ], you are understating uncertainty about coefficient estimates!

  21. Bayesian Model Averaging (BMA) • An alternative to trying to find the single best model (i.e., the mode of p(M) – is to consider the entire distribution of specifications… • Suppose you assign probability p(Ak) to K specifications, then • Averaging over model space improves statistical inference • Coefficient estimates tend to have better predictive ability • Standard errors reflect model, as well as parametric uncertainty

  22. Some nasty theoretical details • Choosing the space of models and model priors • Managing summation in BMA can be tricky…with 12 possible covariates, there are 212 = 4,096 different models to combine! • “Occam’s Window” suggested by Rafferty (1994): eliminate larger and/or less probable models • MC3 techniques transit across model space. Compute p(,A) from p(|A) and p(A|D) • Computing the integral p(D|A) = p(D|,A)p(|A)d • This is done directly in MC3 techniques for BMA, otherwise… • Can approximate using p(D| MLE,A)

  23. Bayesian Model Selection Results

  24. Bayesian Model Averaging Results

  25. Conclusions • Standard statistical inference is conditional on the chosen model • A data-driven model search is usually an unavoidable fact of life • Model must include appropriate vector of controls (bias vs. variance) • Model should address parameter heterogeneity and functional form • A methodological guide for conditioning is exchangeability • Of course, the very fact that we are searching for a model means we are really less certain about our estimates that we are stating… • BMA techniques help to “propagate model uncertainty” into coefficient estimates and standard errors

More Related