Causal Search in the Real World

Causal Search in the Real World

A menu of topics • Some real-world challenges: • Convergence & error bounds • Sample selection bias • Simpson’s paradox • Some real-world successes: • Learning based on more than just independence • Learning about latents & their structure

Short-run causal search • Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies • Of course, this is a problem for all inference: we might just be really unlucky • Note: This is not (really) the problem of unrepresentative samples (e.g., black swans)

Convergence in search • In search, we would like to bound our possible error as we acquire data • I.e., we want search procedures that have uniform convergence • Without uniform convergence, • Cannot set confidence intervals for inference • Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose

Pointwise convergence • Assume hypothesis H is true • Then • For any standard of “closeness” to H, and • For any standard of “successful refutation,” • For every hypothesis that is not “close” to H, there is a sample size for which that hypothesis is refuted

Uniform convergence • Assume hypothesis H is true • Then • For any standard of “closeness” to H, and • For any standard of “successful refutation,” • There is a sample size such that for all hypotheses H* that are not “close” to H, H* is refuted at that sample size.

Two theorems about convergence • There are procedures that, for every model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993) • There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)

Two theorems about convergence • What if we didn’t care about “small” causes? • ε-Faithfulness: If X & Y are d-connected given S, then ρXY.S >ε • Every association predicted by d-connection is ≥ε • For anyε, standard constraint-based algorithms are uniformly convergent given ε-Faithfulness • So we have error bounds, confidence intervals, etc.

Sample selection bias • Sometimes, a variable of interest is a cause of whether people get in the sample • E.g., measure various skills or knowledge in college students • Or measuring joblessness by a phone survey during the middle of the day • Simple problem: You might get a skewed picture of the population

Factor A Factor B Sample Sample selection bias • If two variables matter, then we have: • Sample = 1 for everyone we measure • That is equivalent to conditioning on Sample • ⇒ Induces an association between A and B!

Simpson’s Paradox • Consider the following data: Men Women P(A | T) = 0.5 P(A | T) = 0.39 P(A | U) = 0.45… P(A | U) = 0.333 Treatment is superior in both groups!

Simpson’s Paradox • Consider the following data: Pooled P(A | T) = 0.404 P(A | U) = 0.434 In the “full”population, you’re better off not being Treated!

Simpson’s Paradox • Berkeley Graduate Admissions case

More than independence • Independence & association can reveal only the Markov equivalence class • But our data contain more statistical information! • Algorithms that exploit this additional info can sometimes learn more (including unique graphs) • Example: LiNGaM algorithm for non-Gaussian data

Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = BD + ε • where B is permutable to lower triangular (because graph is acyclic)

Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = Aε • where A = (I – B)-1

Non-Gaussian data • Assume linearity & independent non-Gaussian noise • Linear causal DAG functions are: D = Aε • where A = (I – B)-1 • ICA is an efficient estimator for A • ⇒ Efficient causal search that reveals direction! • C ⟶ E iff non-zero entry in A

A B A B Non-Gaussian data • Why can we learn the directions in this case? Gaussian noise Uniform noise

Non-Gaussian data • Case study: European electricity cost

Learning about latents • Sometimes, our real interest… • is in variables that are only indirectly observed • or observed by their effects • or unknown altogether but influencing things behind the scenes Sociability General IQ Other factors Math skills Test score Size of social network Reading level

Factor analysis • Assume linear equations • Given some set of (observed) features, determine the coefficients for (a fixed number of) unobserved variables that minimize the error

Factor analysis • If we have one factor, then we find coefficients to minimize error in: Fi = ai + biU where U is the unobserved variable (with fixed mean and variance) • Two factors ⇒ Minimize error in: Fi = ai + bi,1U1 + bi,2U2

Factor analysis • Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff • Also, the interpretation of the unobserved factors must be provided by the scientist • The data do not dictate the meaning of the unobserved factors (though it can sometimes be “obvious”)

… F1 Fn F2 U Factor analysis as graph search • One-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph:

F1 Fn F2 U1 U2 Factor analysis as graph search • Two-variable factor analysis is equivalent to finding the ML parameter estimates for the SEM with graph: …

Better methods for latents • Two different types of algorithms: • Determine which observed variables are caused by shared latents • BPC, FOFC, FTFC, … • Determine the causal structure among the latents • MIMBuild • Note: need additional parametric assumptions • Usually linearity, but can do it with weaker info

Discovering latents • Key idea: For many parameterizations, association between X & Y can be decomposed • Linearity ⇒ • ⇒ can use patterns in the precise associations to discover the number of latents • Using the ranks of different sub-matrices

Discovering latents U A B C D

Discovering latents U L A B C D

Discovering latents • Many instantiations of this type of search for different parametric knowledge, # of observed variables (⇒ # of discoverable latents), etc. • And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents

Other Algorithms • CCD: Learn DCG (with non-obvious semantics) • ION: Learn global features from overlapping local sets (including between not co-measured variables) • SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints • LoSST: Learn causal structure while that structure potentially changes over time • And lots of other ongoing research!

Tetrad project • http://www.phil.cmu.edu/projects/tetrad/current.html

Causal Search in the Real World

Causal Search in the Real World

Presentation Transcript

MVVM in the Real World

Jaguar in the Real World

Services in the Real World

Tutorial: Causal Model Search

Writing in the Real World

………… IN THE REAL WORLD

SQL in the Real World

Graphs in the Real* World

Games in the real world

Algorithms in the Real World

‘Economics in the Real World’

Research in the Real World

ITIL in the Real World :

In the Real World

Debugging IN THE REAL WORLD

Deblobbing In The Real World

Algorithms in the Real World

Algorithms in the Real World

CC in the Real World

Stoichiometry in the Real World