1 / 32

Causal Search in the Real World

Causal Search in the Real World. A menu of topics. Some real-world challenges: Convergence & error bounds Sample selection bias Simpson ’ s paradox Some real-world successes: Learning based on more than just independence Learning about latents & their structure. Short-run causal search.

addo
Download Presentation

Causal Search in the Real World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Causal Search in the Real World

  2. A menu of topics • Some real-world challenges: • Convergence & error bounds • Sample selection bias • Simpson’s paradox • Some real-world successes: • Learning based on more than just independence • Learning about latents & their structure

  3. Short-run causal search • Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies • Of course, this is a problem for all inference: we might just be really unlucky • Note: This is not (really) the problem of unrepresentative samples (e.g., black swans)

  4. Convergence in search • In search, we would like to bound our possible error as we acquire data • I.e., we want search procedures that have uniform convergence • Without uniform convergence, • Cannot set confidence intervals for inference • Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose

  5. Pointwise convergence • Assume hypothesis H is true • Then • For any standard of “closeness” to H, and • For any standard of “successful refutation,” • For every hypothesis that is not “close” to H, there is a sample size for which that hypothesis is refuted

  6. Uniform convergence • Assume hypothesis H is true • Then • For any standard of “closeness” to H, and • For any standard of “successful refutation,” • There is a sample size such that for all hypotheses H* that are not “close” to H, H* is refuted at that sample size.

  7. Two theorems about convergence • There are procedures that, for every model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993) • There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)

  8. Two theorems about convergence • What if we didn’t care about “small” causes? • ε-Faithfulness: If X & Y are d-connected given S, then ρXY.S >ε • Every association predicted by d-connection is ≥ε • For anyε, standard constraint-based algorithms are uniformly convergent given ε-Faithfulness • So we have error bounds, confidence intervals, etc.

  9. Sample selection bias • Sometimes, a variable of interest is a cause of whether people get in the sample • E.g., measure various skills or knowledge in college students • Or measuring joblessness by a phone survey during the middle of the day • Simple problem: You might get a skewed picture of the population

  10. Factor A Factor B Sample Sample selection bias • If two variables matter, then we have: • Sample = 1 for everyone we measure • That is equivalent to conditioning on Sample • ⇒ Induces an association between A and B!

  11. Simpson’s Paradox • Consider the following data: Men Women P(A | T) = 0.5 P(A | T) = 0.39 P(A | U) = 0.45… P(A | U) = 0.333 Treatment is superior in both groups!

  12. Simpson’s Paradox • Consider the following data: Pooled P(A | T) = 0.404 P(A | U) = 0.434 In the “full”population, you’re better off not being Treated!

  13. Simpson’s Paradox • Berkeley Graduate Admissions case

  14. More than independence • Independence & association can reveal only the Markov equivalence class • But our data contain more statistical information! • Algorithms that exploit this additional info can sometimes learn more (including unique graphs) • Example: LiNGaM algorithm for non-Gaussian data

  15. Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = BD + ε • where B is permutable to lower triangular (because graph is acyclic)

  16. Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = Aε • where A = (I – B)-1

  17. Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = Aε • where A = (I – B)-1 • ICA is an efficient estimator for A • ⇒ Efficient causal search that reveals direction! • C ⟶ E iff non-zero entry in A

  18. A B A B Non-Gaussian data • Why can we learn the directions in this case? Gaussian noise Uniform noise

  19. Non-Gaussian data • Case study: European electricity cost

  20. Learning about latents • Sometimes, our real interest… • is in variables that are only indirectly observed • or observed by their effects • or unknown altogether but influencing things behind the scenes Sociability General IQ Other factors Math skills Test score Size of social network Reading level

  21. Factor analysis • Assume linear equations • Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error

  22. Factor analysis • If we have one factor, then we find coefficients to minimize error in: Fi = ai + biU where U is the unobserved variable (with fixed mean and variance) • Two factors ⇒ Minimize error in: Fi = ai + bi,1U1 + bi,2U2

  23. Factor analysis • Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff • Also, the interpretation of the unobserved factors must be provided by the scientist • The data do not dictate the meaning of the unobserved factors (though it can sometimes be “obvious”)

  24. F1 Fn F2 U Factor analysis as graph search • One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph:

  25. F1 Fn F2 U1 U2 Factor analysis as graph search • Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: …

  26. Better methods for latents • Two different types of algorithms: • Determine which observed variables are caused by shared latents • BPC, FOFC, FTFC, … • Determine the causal structure among the latents • MIMBuild • Note: need additional parametric assumptions • Usually linearity, but can do it with weaker info

  27. Discovering latents • Key idea: For many parameterizations, association between X & Y can be decomposed • Linearity ⇒ • ⇒ can use patterns in the precise associations to discover the number of latents • Using the ranks of different sub-matrices

  28. Discovering latents U A B C D

  29. Discovering latents U L A B C D

  30. Discovering latents • Many instantiations of this type of search for different parametric knowledge, # of observed variables (⇒ # of discoverable latents), etc. • And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents

  31. Other Algorithms • CCD: Learn DCG (with non-obvious semantics) • ION: Learn global features from overlapping local sets (including between not co-measured variables) • SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints • LoSST: Learn causal structure while that structure potentially changes over time • And lots of other ongoing research!

  32. Tetrad project • http://www.phil.cmu.edu/projects/tetrad/current.html

More Related