410 likes | 500 Views
Course files. http://www.andrew.cmu.edu/~ddanks/NASSLLI/. Principles Underlying Causal Search Algorithms. Fundamental problem. As we have all heard many times… “ Correlation is not causation! ”. Fundamental problem. Why is this slogan correct?
E N D
Course files http://www.andrew.cmu.edu/~ddanks/NASSLLI/
Fundamental problem • As we have all heard many times… “Correlation is not causation!”
Fundamental problem • Why is this slogan correct? • Causalhypotheses make implicit claims about the effects of intervening (manipulating) one or more variables • Hypotheses about association or correlation make no such claims • Correlation or probabilistic dependence can be produced in many ways
Fundamental problem • Some of the possible reasons why X and Y might be associated are: • Sheer chance • X causes Y • Y causes X • Some third variable Z influences X and Y • The value of X (or a cause of X) and the value of Y (or a cause of Y) can be causes/reasons for whether an individual is in the sample (sample selection bias)
Fundamental problem • Fundamental problem of causal search: • For any particular set of data,there are often many different causal structures that could have produced that data • Causation → Association map is many → one
Fundamental problem • Okay, so what can we do about this? • Use the data to figure out as much as possible (though it usually won’t be everything) • Requires developing search procedures • And then try to narrow the possibilities • Use other knowledge (e.g., time order, interventions) • Get better / different data (e.g., run an experiment)
Always remember… Even if we cannot discoverthe whole truth, we might be able to find some of the truth!
Markov equivalence • Formally, we say that: • Two causal graphs are members of the same Markov Equivalence Class iff they imply the exact same (un)conditional independence relations among the observed variables • By the Markov and Faithfulness assumptions • Remember that d-separation gives a purely graphical criterion for determining all of the (un)conditional independencies
Markov equivalence • The “Fundamental Problem of Causal Inference”can be restated as: • For some sets of independence relations, the Markov equivalence class is not a singleton • Markov equivalence classes give a precise characterization of what can be inferred from independencies alone
Y Z X Y Z X Y Z X X Z X Y Z Y Z X Y Markov equivalence • Examples: • X {Y, Z} ⇒ • X Y | Z ⇒ • X Y ⇒
X Y Z X Y Z X Y Z X Y Z Markov equivalence • Two more examples: • Are these graphs Markov equivalent? • Are these two graphs?
Shared structure • What is shared by all of the graphs in a Markov equivalence class? • Same “skeleton” • I.e., they all have the same adjacency relations • Same “unshielded colliders” • I.e., X→ Y ← Z with no edge between X and Z • Sometimes, other edges have same direction • In these last two cases, we can infer that the true graph contains the shared directed edges.
Shared structure as patterns • Since every Markov equivalent graph has the same adjacencies, we can represent the whole class using a pattern • A pattern is itself a graph, but the edges represent edges in other graphs
Shared structure as patterns • A pattern can have directed and undirected edges • It represents all graphs that can be created by adding arrowheads to the undirected edges without creating either: (i) a cycle; or (ii) an unshielded collider • Let’s try some examples…
Shared structure as patterns Nitrogen — PlantGrowth — Bees Nitrogen→PlantGrowth →Bees Nitrogen←PlantGrowth →Bees Nitrogen←PlantGrowth ←Bees
Shared structure as patterns Nitrogen→PlantGrowth ←Bees Nitrogen→PlantGrowth ← Bees
Formal problem of search • Given some dataset D, find: • Markov equivalence class, represented as a pattern P, that predicts exactly the independence relations found in the data • More colloquially, find the causal graphs that could have produced data like this
Hard to find a pattern • “Gee, how hard could this be? Just test all of the associations, find the Markov equivalence class, then write down the pattern for it. Voila! We’re doing causal learning!” • Big problem: # of independencies to test is super-exponential in # of variables: • 2 variables ⇒ 1 test 5 variables ⇒ 80 tests • 3 variables ⇒ 6 tests 6 variables ⇒ 240 tests • 4 variables ⇒ 24 tests and so on…
General features of causal search • Huge model and parameter spaces • Even when we (necessarily) use prior information about the family of probability distributions. • Relevant statistics must be rapidly computed • But substantive knowledge about the domain may restrict the space of alternative models • Time order of variables • Required cause/effect relationships • Existence or non-existence of latent variables
Three schemata for search • Bayesian / score-based • Find the graph(s) with highest P(graph | data) • Constraint-based • Find the graph(s) that predict exactly the observed associations and independencies • Combined • Get “close” with constraint-based, and then find the best graph using score-based
Bayesian / score-based • Informally: • Give each model an initial score using “prior beliefs” • Update each score based on the likelihood of the data if the model were true • Output the highest-scoring model • Formally: • Specify P(M, v) for all models M and possible parameter values v of M • For any data D, P(D | M, v) can easily be calculated • P(M | D) ∝⎰vP(D | M, v)P(M, v)
Bayesian / score-based • In practice, this strategy is completely computationally intractable • There are too many graphs to check them all • So, we use a greedy search strategy • Start with an initial graph • Iteratively compare the current graph’s score (∝ posterior probability) with that of each 1- or 2-step modification of that graph • By edge addition, deletion or reversal
Bayesian / score-based • Problem #1: Local maxima • Often, greedy searches get stuck • Solution: • Greedy search over Markov equivalence classes,rather than graphs (Meek) • Has a proof of correctness and convergence (Chickering) • But it gets to the right answer slowly
Bayesian / score-based • Problem #2: Unobserved variables • Huge number of graphs • Huge number of different parameterizations • No fast, general way to compute likelihoods from latent variable models • Partial solution: • Focus on a small, “plausible” set of models for which we can compute scores
Constraint-based • Implementation of the earlier idea • “Build” the Markov equivalence class that predicts the pattern of association actually found in the data • Compatible with a variety of statistical techniques • Note that we might have to introduce a latent variable to explain the pattern of statistics • Important constraints on search: • Minimize the number of statistical tests • Minimize the size of the conditioning sets (Why?)
Constraint-based • Algorithm step #1: Discover the adjacencies • Create the complete graph with undirected edges • Test all pairs X, Y for unconditional independence • Remove X—Y edge if they are independent • Test all adjacent X, Y for independence given single N • Remove X—Y edge if they are independent • Test adjacent pairs given two neighbors • …
Constraint-based • Algorithm step #2: (Try to) Orient edges • “Unshielded triple”: X — C — Y, but X, Y not adjacent • If X & Y independent given S containing C, then C must be a non-collider • Since we have to condition on it to achieve d-separation • If X & Y independent given Snot containing C, then C must be a collider • Since the path is not active when not conditioning on C • And then do further orientations to ensure acyclicity and nodes being non-colliders
Constraint-based example • Variables are {X, Y, Z, W} • Only independencies are: • XY • X W | Z • Y W | Z
X Y Z W Constraint-based example • Step 1: Form the complete graph using undirected edges
X Y Z W Constraint-based example • Step 2: For each pair of variables, remove the edge between them if they’re unconditionally independent X Y⇒
X Y Z W Constraint-based example • Step 3: For each adjacent pair, remove the edge if they’re independent conditional on some variable adjacent to one of them {X, Y} W | Z⇒
X Y Z W Constraint-based example • Step 4: Continue removing edges, checking independence conditional on 2 (or 3, or 4, or…) variables
X Y Z W Constraint-based example • Step 5: Orientation • For X – Z – Y, since XY without conditioning on Z, then make Z a collider • Since Z is a non-collider between X and W, though, we must orient Z – W away from Z
Constraint-based output • Searches that allow for latent variables can also have edges of the form X o→Y • This indicates one of three possibilities: • X→Y • At least one unobserved common cause of X and Y • Both of these
Interventions to the rescue? • Interventions helped us solve an earlier equivalence class problem • Randomization meant that:Treatment-Effect association ⇒ T → E • Interventions alter equivalence classes, but don’t make them all into singletons • The fundamental problem of search remains
Y X X Y Z X Y Z X Y Z X Y Z X Y Y Z Z Y Y Z X Y Z X Y X Y Y Z X Y Z X X Z Z Z Y Z X Y Z X Y Z X Y Z X Y Z Y Z X X Z Y X Y Z X Y Z Z X Z X Y Z X X X Y Before X-intervention
Y X X Y Z X Y Z X Y Z X Y Z X Y Y Z Z Y Y Z X Y Z X Y X Y Y Z X Y Z X X Z Z Z Y Z X Y Z X Y Z X Y Z X Y Z Y Z X X Z Y X Y Z X Y Z Z X Z X Y Z X X X Y After X-intervention
Search with interventions • Search with interventions is the same as search with observations, except • We adjust the graphs in the search space to account for the intervention • For multiple experiments, we search for graphs in every output equivalence class • More complicated than this in the real world due to sampling variation
Y Y X Y Z X Y Z X Z Z X Y Z Y Z X X Example • Observation • Y Z | X⇒ • Intervention on X • Y {X, Z}⇒ & • Only possible graph:
Looking ahead… • Have: • Basic formal representation for causation • Fundamental causal asymmetry (of intervention) • Inference & reasoning methods • Search & causal discovery principles • Need: • Search & causal discovery methods that work in the real world