THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl

THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl Departments of Computer Science and Statistics UCLA

OUTLINE • The causal revolution – from statistics to counterfactuals • The fundamental laws of causal inference • From counterfactuals to practical victories • policy evaluation • attribution • mediation • generalizability – external validity • missing data

P Joint Distribution Q(P) (Aspects of P) Data Inference TRADITIONAL STATISTICAL INFERENCE PARADIGM e.g., Infer whether customers who bought product A would also buy product B. Q = P(B | A)

FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES P Joint Distribution P′ Joint Distribution Q(P′) (Aspects of P′) Data change Inference How does P change to P′? New oracle e.g., Estimate P′(cancer) if we ban smoking.

FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES P Joint Distribution P′ Joint Distribution Q(P′) (Aspects of P′) Data change Inference e.g., Estimate the probability that a customer who bought Awould buy B if we were to double the price.

THE STRUCTURAL MODEL PARADIGM Joint Distribution Data Generating Model Q(M) (Aspects of M) Data M Inference • M – Invariant strategy (mechanism, recipe, law, protocol) by which Nature assigns values to variables in the analysis. “A painful de-crowning of a beloved oracle!”

WHAT KIND OF QUESTIONS SHOULD THE ORACLE ANSWER? • Observational Questions: • “What if we see A” • Action Questions: • “What if we do A?” • Counterfactuals Questions: • “What if we did things differently?” • Options: • “With what probability?” P(y | A) (What is?) (What if?) P(y | do(A) (Why?) P(yA’ | A) THE CAUSAL HIERARCHY - SYNTACTIC DISTINCTION

e.g., STRUCTURAL CAUSAL MODELS: THE WORLD AS A COLLECTION OF SPRINGS • Definition: A structural causal model is a 4-tuple • <V,U, F, P(u)>, where • V = {V1,...,Vn} are endogenous variables • U={U1,...,Um} are background variables • F = {f1,...,fn} are functions determining V, • vi = fi(v, u) • P(u) is a distribution over U • P(u) and F induce a distribution P(v) over observable variables

The Fundamental Equation of Counterfactuals: COUNTERFACTUALS ARE EMBARRASINGLY SIMPLE Definition: The sentence: “Y would be y (in situation u), had X beenx,” denoted Yx(u) = y, means: The solution for Y in a mutilated model Mx, (i.e., the equations for X replaced by X = x) with input U=u, is equal to y.

THE TWO FUNDAMENTAL LAWS OF CAUSAL INFERENCE The Law of Counterfactuals (Mgenerates and evaluates all counterfactuals.) The Law of Conditional Independence (d-separation) (Separation in the model ⇒ independence in the distribution.)

U3 U2 U4 S C U1 THE LAW OF CONDITIONAL INDEPENDENCE C (Climate) S (Sprinkler) R (Rain) W (Wetness) Each function summarizes millions of micro processes.

U3 U2 U4 S C U1 THE LAW OF CONDITIONAL INDEPENDENCE C (Climate) S (Sprinkler) R (Rain) W (Wetness) Each function summarizes millions of micro processes. Still, if the U 's are independent, the observed distribution P(C,R,S,W) must satisfy certain constraints that are:(1) independent of the f‘s and of P(U) and (2) can be read from the structure of the graph.

D-SEPARATION: NATURE’S LANGUAGE FOR COMMUNICATING ITS STRUCTURE C (Climate) S (Sprinkler) R (Rain) W (Wetness) Every missing arrow advertises an independency, conditional on a separating set. • Applications: • Model testing • Structure learning • Reducing "what if I do" questions to symbolic calculus • Reducing scientific questions to symbolic calculus

SEEING VS. DOING Effect of turning the sprinkler ON

CAUSAL MODEL (MA) A - CAUSAL ASSUMPTIONS A* - Logical implications of A Causal inference Q Queries of interest Q(P) - Identified estimands T(MA) - Testable implications Statistical inference Data (D) Q - Estimates of Q(P) Goodness of fit Provisional claims Model testing THE LOGIC OF CAUSAL ANALYSIS

THE MACHINERY OF CAUSAL CALCULUS • Rule 1:Ignoring observations • P(y |do{x},z, w) = P(y |do{x},w) • Rule 2:Action/observation exchange • P(y |do{x},do{z}, w) = P(y |do{x},z,w) • Rule 3: Ignoring actions • P(y |do{x},do{z},w) = P(y |do{x},w) Completeness Theorem (Shpitser, 2006)

DERIVATION IN CAUSAL CALCULUS Genotype (Unobserved) Smoking Tar Cancer Probability Axioms Rule 2 Rule 2 Rule 3 Probability Axioms Rule 2 Rule 3

No, no! EFFECT OF WARM-UP ON INJURY (After Shrier & Platt, 2008)

TRANSPORTABILITY OF KNOWLEDGE ACROSS DOMAINS (with E. Bareinboim) • A Theory of causal transportability • When can causal relations learned from experiments • be transferred to a different environment in which no • experiment can be conducted? • A Theory of statistical transportabilityWhen can statistical information learned in one domain be transferred to a different domain in which • only a subset of variables can be observed? Or, • only a few samples are available?

EXTERNAL VALIDITY (how transportability is seen in other sciences) • Extrapolation across studies requires “some understanding of the reasons for the differences.”(Cox, 1958) • “`External validity’ asks the question of generalizability: To what population, settings, treatment variables, and measurement variables can this effect be generalized?”(Shadish, Cook and Campbell, 2002) • “An experiment is said to have “external validity” if the distribution of outcomes realized by a treatment group is the same as the distribution of outcome that would be realized in an actual program.”(Manski, 2007) • "A threat to external validity is an explanation of how you might be wrong in making a generalization." (Trochin, 2006)

MOVING FROM THE LAB TO THE REAL WORLD . . . Real world Everything is assumed to be the same, trivially transportable! Lab H1 Z Z X W Y Z Everything is assumed to be the different, not transportable! H2 X Y W X W Y

MOTIVATION WHAT CAN EXPERIMENTS IN LA TELL ABOUT NYC? Z (Age) Z (Age) X (Intervention) Y (Outcome) X (Observation) Y (Outcome) Experimental study in LA Measured: Needed: Observational study in NYC Measured: Transport Formula (calibration):

S S Z S S Factors producing differences Z Y Y X X (b) TRANSPORT FORMULAS DEPEND ON THE STORY (a) a) Z represents age b) Z represents language skill ?

S S S Y X Z (c) TRANSPORT FORMULAS DEPEND ON THE STORY Z Z Y Y X X (a) (b) a) Z represents age b) Z represents language skill c) Z represents a bio-marker ?

S Factors creating differences GOAL: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE • INPUT: Annotated Causal Graph • OUTPUT: • Transportable or not? • Measurements to be taken in the experimental study • Measurements to be taken in the target population • A transport formula U V T S Y X W Z

TRANSPORTABILITY REDUCED TO CALCULUS S Z W Y X Theorem A causal relation R is transportable from ∏ to ∏*ifand only if it is reducible, using the rules of do-calculus, to an expression in which S is separated from do( ).

S Factors creating differences RESULT: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE • INPUT: Annotated Causal Graph • OUTPUT: • Transportable or not? • Measurements to be taken in the experimental study • Measurements to be taken in the target population • A transport formula • Completeness (Bareinboim, 2012) U V T S Y X W Z

WHICH MODEL LICENSES THE TRANSPORT OF THE CAUSAL EFFECT X→Y Yes Yes Yes No Yes No S S S X Y X Y X Z Y (a) (b) (c) S S S X W Z Y X X Z Z Y Y (d) (f) (f) S External factors creating disparities S S S X X X W W W Z Z Z Y Y Y (e)

STATISTICAL TRANSPORTABILITY (Transfer Learning) • Why should we transport statistical information? • i.e., Why not re-learn things from scratch ? • Measurements are costly. • Limit measurements to a subset V * of variables • called “scope”. • Samples are scarce. • Pooling samples from diverse populations will • improve precision, if differences can be filtered • out.

S X Z Y STATISTICAL TRANSPORTABILITY Definition: (Statistical Transportability) A statistical relation R(P) is said to be transportable from ∏to ∏* over V * if R(P*) is identified from P, P*(V *), and D where P*(V *) is the marginal distribution of P* over a subset of variables V *. R=P*(y | x) is transportable over V* = {X,Z},i.e.,Risestimable without re-measuring Y Transfer Learning If few samples (N2) are available from ∏* and many samples (N1) from ∏then estimating R = P*(y | x) by achieves a much higher precision S X Z Y

META-ANALYSIS OR MULTI-SOURCE LEARNING Target population R = P*(y | do(x)) S (a) (b) (c) Z Z Z X W Y X W Y X W Y (d) (e) (f) Z Z Z S S X W Y X W Y X W Y (g) (h) (i) Z Z Z S S X W Y X W Y X W Y

(d) Z X W Y (h) (i) Z Z S S S X W Y X W Y CAN WE GET A BIAS-FREE ESTIMATE OF THE TARGET QUANTITY? Target population R = P*(y | do(x)) Is R identifiable from (d) and (h) ? (a) Z X W Y R(∏*) is identifiable from studies (d) and (h). R(∏*) is not identifiable from studies (d) and (i).

FROM META-ANALYSIS TO META-SYNTHESIS The problem How to combine results of several experimental and observational studies, each conducted on a different population and under a different set of conditions, so as to construct an aggregate measure of effect size that is "better" than any one study in isolation.

META-SYNTHESIS REDUCED TO CALCULUS Theorem {∏1, ∏2,…,∏K} – a set of studies. {D1, D2,…, DK} – selection diagrams (relative to ∏*). A relation R(∏*) is "meta estimable" if it can be decomposed into terms of the form: such that each Qk is transportable from Dk. Open-problem: Systematic decomposition

BIAS VS. PRECISION IN META-SYNTHESIS (a) (g) (h) (i) (d) Z Z Z Z Z S S X W Y X W Y X W Y X W Y X W Y Calibration Pooling Principle 1: Calibrate estimands before pooling (to minimize bias) Principle 2: Decompose to sub-relations before calibrating (to improve precision)

BIAS VS. PRECISION IN META-SYNTHESIS Pooling Composition Pooling Z Z Z Z Z (a) (g) (h) (i) (d) S S X W Y X W Y X W Y X W Y X W Y

MISSING DATA: A SEEMINGLY STATISTICAL PROBLEM (Mohan & Pearl, 2012) • Pervasive in every experimental science. • Huge literature, powerful software industry, deeply entrenched culture. • Current practices are based on statistical characterization (Rubin, 1976) of a problem that is inherently causal. • Consequence: Like Alchemy before Boyle and Dalton, the field is craving for (1) theoretical guidance and (2) performance guarantees.

ESTIMATE P(X,Y,Z)

WHAT CAN CAUSAL THEORY DO FOR MISSING DATA? • Q-1. What should the world be like, for a given • statistical procedure to produce the expected result? • Q-2. Can we tell from the postulated world whether any method can produce a bias-free result? How? • Q-3. Can we tell from data if the world does not • work as postulated? • None of these questions can be answered by statistical characterization of the problem. • All can be answered using causal models

MISSING DATA: TWO PERSPECTIVES • Causal inference is a missing data problem. • (Rubin 2012) • Missing data is a causal inference problem. • (Pearl 2012) • Why is missingness a causal problem? • Which mechanism causes missingness makes a difference in whether / how we can recover information from the data. • Mechanisms require causal language to be properly described – statistics is not sufficient. • Different causal assumptions lead to different routines for recovering information from data, even when the assumptions are indistinguishable by any statistical means.

ESTIMATE P(X,Y,Z) Missingness graph X RX X * {

NAIVE ESTIMATE OF P(X,Y,Z) Complete Cases • Line deletion estimate is generally biased. X Y Z MCAR Rx Ry Rz

SMART ESTIMATE OF P(X,Y,Z) Z X Y Ry Rz Rx

SMART ESTIMATE OF P(X,Y,Z)

SMART ESTIMATE OF P(X,Y,Z) Compute P(Y|Ry=0)

SMART ESTIMATE OF P(X,Y,Z) Compute P(Y|Ry=0) Compute P(X|Y,Rx=0,Ry=0)

SMART ESTIMATE OF P(X,Y,Z) Compute P(Y|Ry=0) Compute P(X|Y,Rx=0,Ry=0) Compute P(Z|X,Y,Rx=0,Ry=0,Rz=0)

INESTIMABLE P(X,Y,Z) Z X Y Ry Rz Rx

RECOVERABILITY FROM MISSING DATA Definition: Given a missingness model M, a probabilistic quantity Q is said to berecoverableif there exists an algorithm that produces a consistent estimate of Q for every dataset generated by M. Theorem: Q is recoverable iff it is decomposable into terms of the form Qj=P(Sj | Tj), such that: For each variable V that is in Tj, RV is also in Tj . For each variable V that is in Sj, RV is either in Tjor in Sj. e.g., (That is, in the limit of large sample, Q is estimable as if no data were missing.)

THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl