660 likes | 770 Views
Automatic Causal Discovery. Richard Scheines Peter Spirtes, Clark Glymour Dept. of Philosophy & CALD Carnegie Mellon. Outline. Motivation Representation Discovery Using Regression for Causal Discovery. 1. Motivation. Non-experimental Evidence Typical Predictive Questions
E N D
Automatic Causal Discovery Richard Scheines Peter Spirtes, Clark Glymour Dept. of Philosophy & CALD Carnegie Mellon
Outline • Motivation • Representation • Discovery • Using Regression for Causal Discovery
1. Motivation Non-experimental Evidence TypicalPredictiveQuestions • Can we predict aggressiveness from Day Care • Can we predict crime rates from abortion rates 20 years ago Causal Questions: • Does attending Day Care cause Aggression? • Does abortion reduce crime?
Causal Estimation Manipulated Probability P(Y | X set= x, Z=z) from Unmanipulated Probability P(Y | X = x, Z=z) When and how can we use non-experimental data to tell us about the effect of an intervention?
Conditioningvs. Intervening P(Y | X = x1)vs. P(Y | X set= x1) Stained Teeth Slides
2. Representation • Representing causal structure, and connecting it to probability • Modeling Interventions
Causation & Association X is a cause of Y iff x1 x2 P(Y | X set= x1) P(Y | X set= x2) X and Y are associated iff x1 x2 P(Y | X = x1) P(Y | X = x2)
Direct Causation X is a direct cause of Y relative to S, iff z,x1 x2 P(Y | X set= x1 , Zset=z) P(Y | X set= x2 , Zset=z) where Z = S - {X,Y}
Association X and Y are associated iff x1 x2 P(Y | X = x1) P(Y | X = x2) X and Y are independent iff X and Y are not associated
Causal Graphs Causal Graph G = {V,E} Each edge X Y represents a direct causal claim: X is a direct cause of Y relative to V
Modeling Ideal Interventions • Ideal Interventions (on a variable X): • Completely determine the value or distribution of a variable X • Directly Target only X • (no “fat hand”) • E.g., Variables: Confidence, Athletic Performance • Intervention 1: hypnosis for confidence • Intervention 2: anti-anxiety drug (also muscle relaxer)
Teeth Stains Smoking Modeling Ideal Interventions Interventions on the Effect Post Pre-experimental System
Teeth Stains Smoking Modeling Ideal Interventions Interventions on the Cause Post Pre-experimental System
Interventions & Causal Graphs • Model an ideal intervention by adding an “intervention” variable outside the original system • Erase all arrows pointing into the variable intervened upon Intervene to change Inf Post-intervention graph? Pre-intervention graph
Calculating the Effect of Interventions P(YF,S,L) = P(S) P(YF|S) P(L|S) Replace pre-manipulation causes with manipulation P(YF,S,L)m = P(S) P(YF|Manip) P(L|S)
Calculating the Effect of Interventions P(YF,S,L) = P(S) P(YF|S) P(L|S) P(L|YF) P(L| YF set by Manip) P(YF,S,L) = P(S) P(YF|Manip) P(L|S)
The Markov Condition Causal Structure Statistical Predictions Markov Condition Independence X _||_ Z | Y i.e., P(X | Y) = P(X | Y, Z) Causal Graphs
Causal Markov Axiom In a Causal Graph G, each variable V is independent of its non-effects, conditional on its direct causes in every probability distribution that G can parameterize (generate)
Causal Graphs Independence Acyclic causal graphs: d-separation Causal Markov axiom Cyclic Causal graphs: • Linear structural equation models : d-separation, not Causal Markov • For some discrete variable models: d-separation, not Causal Markov • Non-linear cyclic SEMs : neither
Background Knowledge e.g., X2 before X3 Causal DiscoveryStatistical DataCausal Structure
Equivalence Classes • D-separation equivalence • D-separation equivalence over a set O • Distributional equivalence • Distributional equivalence over a set O Two causal models M1 and M2 are distributionally equivalent iff for any parameterization q1 of M1, there is a parameterization q2 of M2 such that M1(q1) = M2(q2), and vice versa.
Equivalence Classes For example, interpreted as SEM models M1 and M2 : d-separation equivalent & distributionally equivalent M3 and M4 : d-separation equivalent & not distributionally equivalent
D-separation Equivalence Over a set X Let X = {X1,X2,X3}, then Ga and Gb 1) are not d-separation equivalent, but 2) are d-separation equivalent over X
D-separation Equivalence D-separation Equivalence Theorem (Verma and Pearl, 1988) Two acyclic graphs over the same set of variables are d-separation equivalent iff they have: • the same adjacencies • the same unshielded colliders
Representations ofD-separation Equivalence Classes We want the representations to: • Characterize the Independence Relations Entailed by the Equivalence Class • Represent causal features that are shared by every member of the equivalence class
Patterns & PAGs • Patterns (Verma and Pearl, 1990): graphical representation of an acyclic d-separation equivalence - no latent variables. • PAGs: (Richardson 1994) graphical representation of an equivalence class including latent variable models and sample selection bias that are d-separation equivalent over a set of measured variables X
Patterns Not all boolean combinations of orientations of unoriented pattern adjacencies occur in the equivalence class.
PAGs: Partial Ancestral Graphs What PAG edges mean.
Search Difficulties • The number of graphs is super-exponential in the number of observed variables (if there are no hidden variables) or infinite (if there are hidden variables) • Because some graphs are equivalent, can only predict those effects that are the same for every member of equivalence class • Can resolve this problem by outputting equivalence classes
What Isn’t Possible • Given just data, and the Causal Markov and Causal Faithfulness Assumptions: • Can’t get probability of an effect being within a given range without assuming a prior distribution over the graphs and parameters
What Is Possible • Given just data, and the Causal Markov and Causal Faithfulness Assumptions: • There are procedures which are asymptotically correct in predicting effects (or saying “don’t know”)
Overview of Search Methods • Constraint Based Searches • TETRAD • Scoring Searches • Scores: BIC, AIC, etc. • Search: Hill Climb, Genetic Alg., Simulated Annealing • Very difficult to extend to latent variable models Heckerman, Meek and Cooper (1999). “A Bayesian Approach to Causal Discovery” chp. 4 in Computation, Causation, and Discovery, ed. by Glymour and Cooper, MIT Press, pp. 141-166
Constraint-based Search • Construct graph that most closely implies conditional independence relations found in sample • Doesn’t allow for comparing how much better one model is than another • It is important not to test all of the possible conditional independence relations due to speed and accuracy considerations – FCI search selects subset of independence relations to test
Constraint-based Search • Can trade off informativeness versus speed, without affecting correctness • Can be applied to distributions where tests of conditional independence are known, but scores aren’t • Can be applied to hidden variable models (and selection bias models) • Is asymptotically correct
Search for Patterns • Adjacency: • X and Y are adjacent if they are dependent conditional on all subsets that don’t include X and Y • X and Y are not adjacentif they are independent conditional on any subset that doesn’t include X and Y
Search: Orientation After Orientation Phase X1 || X2 X1 || X4 | X3 X2 || X4 | X3
PAG Knowing when we know enoughto calculate the effect of Interventions Observation: IQ _||_ Lead Background Knowledge: Lead prior to IQ P(IQ | Lead) P(IQ | Lead set=) P(IQ | Lead) = P(IQ | Lead set=)