320 likes | 525 Views
Causal Search in the Real World. A menu of topics. Some real-world challenges: Convergence & error bounds Sample selection bias Simpson ’ s paradox Some real-world successes: Learning based on more than just independence Learning about latents & their structure. Short-run causal search.
E N D
A menu of topics • Some real-world challenges: • Convergence & error bounds • Sample selection bias • Simpson’s paradox • Some real-world successes: • Learning based on more than just independence • Learning about latents & their structure
Short-run causal search • Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies • Of course, this is a problem for all inference: we might just be really unlucky • Note: This is not (really) the problem of unrepresentative samples (e.g., black swans)
Convergence in search • In search, we would like to bound our possible error as we acquire data • I.e., we want search procedures that have uniform convergence • Without uniform convergence, • Cannot set confidence intervals for inference • Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose
Pointwise convergence • Assume hypothesis H is true • Then • For any standard of “closeness” to H, and • For any standard of “successful refutation,” • For every hypothesis that is not “close” to H, there is a sample size for which that hypothesis is refuted
Uniform convergence • Assume hypothesis H is true • Then • For any standard of “closeness” to H, and • For any standard of “successful refutation,” • There is a sample size such that for all hypotheses H* that are not “close” to H, H* is refuted at that sample size.
Two theorems about convergence • There are procedures that, for every model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993) • There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)
Two theorems about convergence • What if we didn’t care about “small” causes? • ε-Faithfulness: If X & Y are d-connected given S, then ρXY.S >ε • Every association predicted by d-connection is ≥ε • For anyε, standard constraint-based algorithms are uniformly convergent given ε-Faithfulness • So we have error bounds, confidence intervals, etc.
Sample selection bias • Sometimes, a variable of interest is a cause of whether people get in the sample • E.g., measure various skills or knowledge in college students • Or measuring joblessness by a phone survey during the middle of the day • Simple problem: You might get a skewed picture of the population
Factor A Factor B Sample Sample selection bias • If two variables matter, then we have: • Sample = 1 for everyone we measure • That is equivalent to conditioning on Sample • ⇒ Induces an association between A and B!
Simpson’s Paradox • Consider the following data: Men Women P(A | T) = 0.5 P(A | T) = 0.39 P(A | U) = 0.45… P(A | U) = 0.333 Treatment is superior in both groups!
Simpson’s Paradox • Consider the following data: Pooled P(A | T) = 0.404 P(A | U) = 0.434 In the “full”population, you’re better off not being Treated!
Simpson’s Paradox • Berkeley Graduate Admissions case
More than independence • Independence & association can reveal only the Markov equivalence class • But our data contain more statistical information! • Algorithms that exploit this additional info can sometimes learn more (including unique graphs) • Example: LiNGaM algorithm for non-Gaussian data
Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = BD + ε • where B is permutable to lower triangular (because graph is acyclic)
Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = Aε • where A = (I – B)-1
Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = Aε • where A = (I – B)-1 • ICA is an efficient estimator for A • ⇒ Efficient causal search that reveals direction! • C ⟶ E iff non-zero entry in A
A B A B Non-Gaussian data • Why can we learn the directions in this case? Gaussian noise Uniform noise
Non-Gaussian data • Case study: European electricity cost
Learning about latents • Sometimes, our real interest… • is in variables that are only indirectly observed • or observed by their effects • or unknown altogether but influencing things behind the scenes Sociability General IQ Other factors Math skills Test score Size of social network Reading level
Factor analysis • Assume linear equations • Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error
Factor analysis • If we have one factor, then we find coefficients to minimize error in: Fi = ai + biU where U is the unobserved variable (with fixed mean and variance) • Two factors ⇒ Minimize error in: Fi = ai + bi,1U1 + bi,2U2
Factor analysis • Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff • Also, the interpretation of the unobserved factors must be provided by the scientist • The data do not dictate the meaning of the unobserved factors (though it can sometimes be “obvious”)
… F1 Fn F2 U Factor analysis as graph search • One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph:
F1 Fn F2 U1 U2 Factor analysis as graph search • Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: …
Better methods for latents • Two different types of algorithms: • Determine which observed variables are caused by shared latents • BPC, FOFC, FTFC, … • Determine the causal structure among the latents • MIMBuild • Note: need additional parametric assumptions • Usually linearity, but can do it with weaker info
Discovering latents • Key idea: For many parameterizations, association between X & Y can be decomposed • Linearity ⇒ • ⇒ can use patterns in the precise associations to discover the number of latents • Using the ranks of different sub-matrices
Discovering latents U A B C D
Discovering latents U L A B C D
Discovering latents • Many instantiations of this type of search for different parametric knowledge, # of observed variables (⇒ # of discoverable latents), etc. • And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents
Other Algorithms • CCD: Learn DCG (with non-obvious semantics) • ION: Learn global features from overlapping local sets (including between not co-measured variables) • SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints • LoSST: Learn causal structure while that structure potentially changes over time • And lots of other ongoing research!
Tetrad project • http://www.phil.cmu.edu/projects/tetrad/current.html