650 likes | 928 Views
CAUSAL INFERENCE AS A MACHINE LEARNING EXERCISE. Judea Pearl Computer Science and Statistics UCLA www.cs.ucla.edu/~judea/. OUTLINE. Learning: Statistical vs. Causal concepts. Causal models and identifiability. Learnability of three types of causal queries:.
E N D
CAUSAL INFERENCE AS A MACHINE LEARNING EXERCISE Judea Pearl Computer Science and Statistics UCLA www.cs.ucla.edu/~judea/
OUTLINE • Learning: Statistical vs. Causal concepts • Causal models and identifiability • Learnability of three types of causal queries: • Effects of potential interventions, • Queries about attribution (responsibility) • Queries about direct and indirect effects
TRADITIONAL MACHINE LEARNING PARADIGM P Joint Distribution Q(P) (Aspects of P) Data Learning e.g., Learn whether customers who bought product A would also buy product B. Q = P(B|A)
THE CAUSAL ANALYSIS PARADIGM M Data-generating Model Q(M) (Aspects of M) Data Learning • Some Q(M) cannot be inferred from P. • e.g., • Learn whether customers who bought product A • would still buy A if we double the price. • Data-mining vs. knowledge mining
THE SECRETS OF CAUSAL MODELS Causal Model = Data-generating model satisfying: • Modularity (Symbol-mechanism correspondence) • Uniqueness (Variable-mechanism correspondence)
Marketing (MS) Cost Proj. g f Quantity Sold (QS) PRICE (PR) Prev. Sale Others (e1) Others (e2) Causal ModelJoint Distribution PR = f (CP,PS, e1) P(PR, PS, CP, ME, QS) QS = g(ME, PR, e2); P(e1, e2) Q1: P(QS|PR=2) computable from P (and M) Q2: P(QS|do(PR=2) computable from M (not P) THE SECRETS OF CAUSAL MODELS Causal Model = Data-generating model satisfying: • Modularity (Symbol-mechanism correspondence) • Uniqueness (Variable-mechanism correspondence)
Probability and statistics deal with static relations Statistics Probability inferences from passive observations joint distribution Data FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES
Probability and statistics deal with static relations Statistics Probability inferences from passive observations joint distribution Data FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES • Causal analysis deals with changes (dynamics) • i.e. What remains invariant when P changes. • P does not tell us how it ought to change • e.g. Curing symptoms vs. curing diseases • e.g. Analogy: mechanical deformation
Probability and statistics deal with static relations Statistics Probability predictions from passive observations joint distribution Data Causal analysis deals with changes (dynamics) • Effects of • interventions Data Causal Model • Causes of • effects Causal assumptions • Explanations Experiments FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES
Causal and statistical concepts do not mix. CAUSAL Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT)
Causal and statistical concepts do not mix. CAUSAL Spurious correlation Randomization Confounding / Effect Instrument Holding constant Explanatory variables STATISTICAL Regression Association / Independence “Controlling for” / Conditioning Odd and risk ratios Collapsibility Propensity score • No causes in – no causes out (Cartwright, 1989) } statistical assumptions + data causal assumptions causal conclusions FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES (CONT) • Causal assumptions cannot be expressed in the mathematical language of standard statistics. • Non-standard mathematics: • Structural equation models (SEM) • Counterfactuals (Neyman-Rubin) • Causal Diagrams (Wright, 1920)
WHAT'SIN A CAUSAL MODEL? Oracle that assigns truth value to causal sentences: Action sentences:B if wedoA. Counterfactuals:B would be different if Awere true. Explanation:B occurredbecauseof A. Optional:with whatprobability?
FAMILIAR CAUSAL MODEL ORACLE FOR MANIPILATION X Y Z INPUT OUTPUT
Definition: A causal model is a 3-tuple M = V,U,F with a mutilation operator do(x): MMx where: (i) V = {V1…,Vn} endogenous variables, (ii) U = {U1,…,Um} background variables (iii) F = set of n functions, fi : V \ ViU Vi vi = fi(pai,ui)PAi V \ ViUi U CAUSAL MODELS AND CAUSAL DIAGRAMS
Definition: A causal model is a 3-tuple M = V,U,F with a mutilation operator do(x): MMx where: (i) V = {V1…,Vn} endogenous variables, (ii) U = {U1,…,Um} background variables (iii) F = set of n functions, fi : V \ ViU Vi vi = fi(pai,ui)PAi V \ ViUi U I W Q P CAUSAL MODELS AND CAUSAL DIAGRAMS U1 U2 PAQ
Definition: A causal model is a 3-tuple M = V,U,F with a mutilation operator do(x): MMx where: (i) V = {V1…,Vn} endogenous variables, (ii) U = {U1,…,Um} background variables (iii) F = set of n functions, fi : V \ ViU Vi vi = fi(pai,ui)PAi V \ ViUi U CAUSAL MODELS AND MUTILATION (iv) Mx= U,V,Fx, X V, x X where Fx = {fi: Vi X } {X = x} (Replace all functions ficorresponding to X with the constant functions X=x)
Definition: A causal model is a 3-tuple M = V,U,F with a mutilation operator do(x): MMx where: (i) V = {V1…,Vn} endogenous variables, (ii) U = {U1,…,Um} background variables (iii) F = set of n functions, fi : V \ ViU Vi vi = fi(pai,ui)PAi V \ ViUi U I W Q CAUSAL MODELS AND MUTILATION (iv) Mp U1 U2 P P = p0
Definition: A causal model is a 3-tuple M = V,U,F with a mutilation operator do(x): MMx where: (i) V = {V1…,Vn} endogenous variables, (ii) U = {U1,…,Um} background variables (iii) F = set of n functions, fi : V \ ViU Vi vi = fi(pai,ui)PAi V \ ViUi U I W Q CAUSAL MODELS AND MUTILATION (iv) U1 U2 P
Definition: A causal model is a 3-tuple M = V,U,F with a mutilation operator do(x): MMx where: (i) V = {V1…,Vn} endogenous variables, (ii) U = {U1,…,Um} background variables (iii) F = set of n functions, fi : V \ ViU Vi vi = fi(pai,ui)PAi V \ ViUi U PROBABILISTIC CAUSAL MODELS (iv) Mx= U,V,Fx, X V, x X where Fx = {fi: Vi X } {X = x} (Replace all functions ficorresponding to X with the constant functions X=x) Definition (Probabilistic Causal Model): M, P(u) P(u) is a probability assignment to the variables in U.
CAUSAL MODELS AND COUNTERFACTUALS Definition: Potential Response The sentence: “Y would be y (in unit u), had X been x,” denoted Yx(u) = y, is the solution for Y in a mutilated model Mx, with the equations for X replaced by X = x. (“unit-based potential outcome”)
CAUSAL MODELS AND COUNTERFACTUALS Joint probabilities of counterfactuals: Definition: Potential Response The sentence: “Y would be y (in unit u), had X been x,” denoted Yx(u) = y, is the solution for Y in a mutilated model Mx, with the equations for X replaced by X = x. (“unit-based potential outcome”)
CAUSAL MODELS AND COUNTERFACTUALS In particular: Definition: Potential Response The sentence: “Y would be y (in unit u), had X been x,” denoted Yx(u) = y, is the solution for Y in a mutilated model Mx, with the equations for X replaced by X = x. (“unit-based potential outcome”) Joint probabilities of counterfactuals:
CAUSAL INFERENCE MADE EASY (1985-2000) • Inference with Nonparametric Structural Equations • made possible through Graphical Analysis. • Mathematical underpinning of counterfactuals • through nonparametric structural equations • Graphical-Counterfactuals symbiosis
NON-PARAMETRIC STRUCTURAL MODELS Given P(x,y,z), should we ban smoking? U U 1 1 U U 3 3 U U 2 2 f3 f1 f2 X Y Z X Y Z Smoking Tar in Lungs Cancer Smoking Tar in Lungs Cancer Linear Analysis Nonparametric Analysis x = u1, z = x + u2, y = z + u1 + u3. x = f1(u1), z = f2(x, u2), y = f3(z, u1, u3). Find: Find: P(y|do(x))
LEARNING THE EFFECTS OF ACTIONS Given P(x,y,z), should we ban smoking? U U 1 1 U U 3 3 U U 2 2 f3 f2 X = x Y Z X Y Z Smoking Tar in Lungs Cancer Smoking Tar in Lungs Cancer Linear Analysis Nonparametric Analysis x = u1, z = x + u2, y = z + u1 + u3. x = const. z = f2(x, u2), y = f3(z, u1, u3). Find: Find: P(y|do(x)) = P(Y=y) in new model
IDENTIFIABILITY Definition: Let Q(M) be any quantity defined on a causal model M, andlet A be a set of assumption. Q is identifiable relative to A iff P(M1) = P(M2) ÞQ(M1) = Q(M2) for all M1, M2, that satisfy A.
IDENTIFIABILITY Definition: Let Q(M) be any quantity defined on a causal model M, andlet A be a set of assumption. Q is identifiable relative to A iff P(M1) = P(M2) ÞQ(M1) = Q(M2) for all M1, M2, that satisfy A. In other words, Q can be determined uniquely from the probability distribution P(v) of the endogenous variables, V, and assumptions A.
IDENTIFIABILITY Definition: Let Q(M) be any quantity defined on a causal model M, andlet A be a set of assumption. Q is identifiable relative to A iff P(M1) = P(M2) ÞQ(M1) = Q(M2) for all M1, M2, that satisfy A. In this talk: A: Assumptions encoded in the diagram Q1: P(y|do(x)) Causal Effect (= P(Yx=y)) Q2: P(Yx =y | x, y) Probability of necessity Q3: Direct Effect
THE FUNDAMENTAL THEOREM OF CAUSAL INFERENCE Causal Markov Theorem: Any distribution generated by Markovian structural model M (recursive, with independent disturbances) can be factorized as Where pai are the (values of) the parents of Viin the causal diagram associated with M.
Corollary: (Truncated factorization, Manipulation Theorem) The distribution generated by an intervention do(X=x) (in a Markovian model M) is given by the truncated factorization THE FUNDAMENTAL THEOREM OF CAUSAL INFERENCE Causal Markov Theorem: Any distribution generated by Markovian structural model M (recursive, with independent disturbances) can be factorized as Where pai are the (values of) the parents of Viin the causal diagram associated with M.
Given P(x,y,z),should we ban smoking? U (unobserved) U (unobserved) X = x Y Z X Y Z Smoking Tar in Lungs Cancer Smoking Tar in Lungs Cancer RAMIFICATIONS OF THE FUNDAMENTAL THEOREM
Given P(x,y,z),should we ban smoking? U (unobserved) U (unobserved) X = x Y Z X Y Z Smoking Tar in Lungs Cancer Smoking Tar in Lungs Cancer Pre-intervention Post-intervention RAMIFICATIONS OF THE FUNDAMENTAL THEOREM
Given P(x,y,z),should we ban smoking? U (unobserved) U (unobserved) X = x Y Z X Y Z Smoking Tar in Lungs Cancer Smoking Tar in Lungs Cancer Pre-intervention Post-intervention RAMIFICATIONS OF THE FUNDAMENTAL THEOREM To compute P(y,z|do(x)), wemust eliminate u. (graphical problem).
G Gx THE BACK-DOOR CRITERION Graphical test of identification P(y | do(x)) is identifiable in G if there is a set Z of variables such that Zd-separates X from Y in Gx. Z1 Z1 Z2 Z2 Z Z3 Z3 Z4 Z5 Z5 Z4 X X Z6 Y Y Z6
G Gx Moreover, P(y | do(x)) = åP(y | x,z) P(z) (“adjusting” for Z) z THE BACK-DOOR CRITERION Graphical test of identification P(y | do(x)) is identifiable in G if there is a set Z of variables such that Zd-separates X from Y in Gx. Z1 Z1 Z2 Z2 Z Z3 Z3 Z4 Z5 Z5 Z4 X X Z6 Y Y Z6
RULES OF CAUSAL CALCULUS • Rule 1:Ignoring observations • P(y |do{x},z, w) = P(y | do{x},w) • Rule 2:Action/observation exchange • P(y |do{x}, do{z}, w) = P(y|do{x},z,w) • Rule 3: Ignoring actions • P(y |do{x},do{z},w) = P(y|do{x},w) ^ ^ if ( Y Z | X , W ) G X Z
DERIVATION IN CAUSAL CALCULUS Genotype (Unobserved) Smoking Tar Cancer Probability Axioms P (c |do{s})=tP (c |do{s},t) P (t |do{s}) Rule 2 = tP (c |do{s},do{t})P (t |do{s}) Rule 2 = tP (c |do{s},do{t})P (t | s) Rule 3 = tP (c |do{t})P (t | s) Probability Axioms = stP (c |do{t},s) P (s|do{t})P(t |s) Rule 2 = stP (c | t, s) P (s|do{t})P(t |s) Rule 3 = stP (c | t, s) P (s) P(t |s)
X Z2 Z1 Y (c) A RECENT IDENTIFICATION RESULT Theorem: [Tian and Pearl, 2001] The causal effect P(y|do(x)) is identifiable whenever the ancestral graph of Y contains no confounding path ( ) between X and any of its children. X X Z1 Z1 Z2 Y Y (a) (b)
OUTLINE • Learning: Statistical vs. Causal concepts • Causal models and identifiability • Learnability of three types of causal queries: • Distinguishing direct from indirect effects • Queries about attribution (responsibility)
DETERMINING THE CAUSES OF EFFECTS (The Attribution Problem) • Your Honor! My client (Mr. A) died BECAUSE • he used that drug.
DETERMINING THE CAUSES OF EFFECTS (The Attribution Problem) • Your Honor! My client (Mr. A) died BECAUSE • he used that drug. • Court to decide if it is MORE PROBABLE THAN • NOT that A would be alive BUT FOR the drug! • P(? | A is dead, took the drug) > 0.50
THE PROBLEM • Theoretical Problems: • What is the meaning of PN(x,y): • “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.”
THE PROBLEM • Theoretical Problems: • What is the meaning of PN(x,y): • “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” • Answer:
THE PROBLEM • Theoretical Problems: • What is the meaning of PN(x,y): • “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” • Under what condition can PN(x,y) be learned from statistical data, i.e., observational, experimental and combined.
WHAT IS LEARNABLE FROM EXPERIMENTS? Simple Experiment: Q = P(Yx= y | z) Z nondescendants of X. Compound Experiment: Q = P(YX(z) = y | z) Multi-Stage Experiment: etc…
Court to decide (given both data): • Is it more probable than not that A would be alive • but for the drug? CAN FREQUENCY DATA DECIDE LEGAL RESPONSIBILITY? ExperimentalNonexperimental do(x) do(x) xx Deaths (y) 16 14 2 28 Survivals (y) 984 986 998 972 1,000 1,000 1,000 1,000 • Nonexperimental data: drug usage predicts longer life • Experimental data: drug has negligible effect on survival • Plaintiff: Mr. A is special. • He actually died • He used the drug by choice
TYPICAL THEOREMS (Tian and Pearl, 2000) • Identifiability under monotonicity (Combined data) • corrected Excess-Risk-Ratio • Bounds given combined nonexperimental and experimental data
WITH PROBABILITY ONE P(yx | x,y) =1 SOLUTION TO THE ATTRIBUTION PROBLEM (Cont) • From population data to individual case • Combined data tell more that each study alone
OUTLINE • Learning: Statistical vs. Causal concepts • Causal models and identifiability • Learnability of three types of causal queries: • Effects of potential interventions, • Queries about attribution (responsibility) • Queries about direct and indirect effects
QUESTIONS ADDRESSED • What is the semantics of direct and indirect effects? • Can we estimate them from data? Experimental data?