1 / 54

THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl

THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl Departments of Computer Science and Statistics UCLA. OUTLINE. The causal revolution – from statistics to counterfactuals The fundamental laws of causal inference

casey-chase
Download Presentation

THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. THE MATHEMATICS OF CAUSE AND EFFECT: With Reflections on Machine Learning Judea Pearl Departments of Computer Science and Statistics UCLA

  2. OUTLINE • The causal revolution – from statistics to counterfactuals • The fundamental laws of causal inference • From counterfactuals to practical victories • policy evaluation • attribution • mediation • generalizability – external validity • missing data

  3. P Joint Distribution Q(P) (Aspects of P) Data Inference TRADITIONAL STATISTICAL INFERENCE PARADIGM e.g., Infer whether customers who bought product A would also buy product B. Q = P(B | A)

  4. FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES P Joint Distribution P′ Joint Distribution Q(P′) (Aspects of P′) Data change Inference How does P change to P′? New oracle e.g., Estimate P′(cancer) if we ban smoking.

  5. FROM STATISTICAL TO CAUSAL ANALYSIS: 1. THE DIFFERENCES P Joint Distribution P′ Joint Distribution Q(P′) (Aspects of P′) Data change Inference e.g., Estimate the probability that a customer who bought Awould buy B if we were to double the price.

  6. THE STRUCTURAL MODEL PARADIGM Joint Distribution Data Generating Model Q(M) (Aspects of M) Data M Inference • M – Invariant strategy (mechanism, recipe, law, protocol) by which Nature assigns values to variables in the analysis. “A painful de-crowning of a beloved oracle!”

  7. WHAT KIND OF QUESTIONS SHOULD THE ORACLE ANSWER? • Observational Questions: • “What if we see A” • Action Questions: • “What if we do A?” • Counterfactuals Questions: • “What if we did things differently?” • Options: • “With what probability?” P(y | A) (What is?) (What if?) P(y | do(A) (Why?) P(yA’ | A) THE CAUSAL HIERARCHY - SYNTACTIC DISTINCTION

  8. e.g., STRUCTURAL CAUSAL MODELS: THE WORLD AS A COLLECTION OF SPRINGS • Definition: A structural causal model is a 4-tuple • <V,U, F, P(u)>, where • V = {V1,...,Vn} are endogenous variables • U={U1,...,Um} are background variables • F = {f1,...,fn} are functions determining V, • vi = fi(v, u) • P(u) is a distribution over U • P(u) and F induce a distribution P(v) over observable variables

  9. The Fundamental Equation of Counterfactuals: COUNTERFACTUALS ARE EMBARRASINGLY SIMPLE Definition: The sentence: “Y would be y (in situation u), had X beenx,” denoted Yx(u) = y, means: The solution for Y in a mutilated model Mx, (i.e., the equations for X replaced by X = x) with input U=u, is equal to y.

  10. THE TWO FUNDAMENTAL LAWS OF CAUSAL INFERENCE The Law of Counterfactuals (Mgenerates and evaluates all counterfactuals.) The Law of Conditional Independence (d-separation) (Separation in the model ⇒ independence in the distribution.)

  11. U3 U2 U4 S C U1 THE LAW OF CONDITIONAL INDEPENDENCE C (Climate) S (Sprinkler) R (Rain) W (Wetness) Each function summarizes millions of micro processes.

  12. U3 U2 U4 S C U1 THE LAW OF CONDITIONAL INDEPENDENCE C (Climate) S (Sprinkler) R (Rain) W (Wetness) Each function summarizes millions of micro processes. Still, if the U 's are independent, the observed distribution P(C,R,S,W) must satisfy certain constraints that are:(1) independent of the f‘s and of P(U) and (2) can be read  from the structure of the graph.

  13. D-SEPARATION: NATURE’S LANGUAGE FOR COMMUNICATING ITS STRUCTURE C (Climate) S (Sprinkler) R (Rain) W (Wetness) Every missing arrow advertises an independency, conditional on a separating set. • Applications: • Model testing • Structure learning • Reducing "what if I do" questions to symbolic calculus • Reducing scientific questions to symbolic calculus

  14. SEEING VS. DOING Effect of turning the sprinkler ON

  15. CAUSAL MODEL (MA) A - CAUSAL ASSUMPTIONS A* - Logical implications of A Causal inference Q Queries of interest Q(P) - Identified estimands T(MA) - Testable implications Statistical inference Data (D) Q - Estimates of Q(P) Goodness of fit Provisional claims Model testing THE LOGIC OF CAUSAL ANALYSIS

  16. THE MACHINERY OF CAUSAL CALCULUS • Rule 1:Ignoring observations • P(y |do{x},z, w) = P(y |do{x},w) • Rule 2:Action/observation exchange • P(y |do{x},do{z}, w) = P(y |do{x},z,w) • Rule 3: Ignoring actions • P(y |do{x},do{z},w) = P(y |do{x},w) Completeness Theorem (Shpitser, 2006)

  17. DERIVATION IN CAUSAL CALCULUS Genotype (Unobserved) Smoking Tar Cancer Probability Axioms Rule 2 Rule 2 Rule 3 Probability Axioms Rule 2 Rule 3

  18. No, no! EFFECT OF WARM-UP ON INJURY (After Shrier & Platt, 2008)

  19. TRANSPORTABILITY OF KNOWLEDGE ACROSS DOMAINS (with E. Bareinboim) • A Theory of causal transportability • When can causal relations learned from experiments • be transferred to a different environment in which no • experiment can be conducted? • A Theory of statistical transportabilityWhen can statistical information learned in one domain be transferred to a different domain in which • only a subset of variables can be observed? Or, • only a few samples are available?

  20. EXTERNAL VALIDITY (how transportability is seen in other sciences) • Extrapolation across studies requires “some understanding of the reasons for the differences.”(Cox, 1958) • “`External validity’ asks the question of generalizability: To what population, settings, treatment variables, and measurement variables can this effect be generalized?”(Shadish, Cook and Campbell, 2002) • “An experiment is said to have “external validity” if the distribution of outcomes realized by a treatment group is the same as the distribution of outcome that would be realized in an actual program.”(Manski, 2007) • "A threat to external validity is an explanation of how you might be wrong in making a generalization." (Trochin, 2006)

  21. MOVING FROM THE LAB TO THE REAL WORLD . . . Real world Everything is assumed to be the same, trivially transportable! Lab H1 Z Z X W Y Z Everything is assumed to be the different, not transportable! H2 X Y W X W Y

  22. MOTIVATION WHAT CAN EXPERIMENTS IN LA TELL ABOUT NYC? Z (Age) Z (Age) X (Intervention) Y (Outcome) X (Observation) Y (Outcome) Experimental study in LA Measured: Needed: Observational study in NYC Measured: Transport Formula (calibration):

  23. S S Z S S Factors producing differences Z Y Y X X (b) TRANSPORT FORMULAS DEPEND ON THE STORY (a) a) Z represents age b) Z represents language skill ?

  24. S S S Y X Z (c) TRANSPORT FORMULAS DEPEND ON THE STORY Z Z Y Y X X (a) (b) a) Z represents age b) Z represents language skill c) Z represents a bio-marker ?

  25. S Factors creating differences GOAL: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE • INPUT: Annotated Causal Graph • OUTPUT: • Transportable or not? • Measurements to be taken in the experimental study • Measurements to be taken in the target population • A transport formula U V T S Y X W Z

  26. TRANSPORTABILITY REDUCED TO CALCULUS S Z W Y X Theorem A causal relation R is transportable from ∏ to ∏*ifand only if it is reducible, using the rules of do-calculus, to an expression in which S is separated from do( ).

  27. S Factors creating differences RESULT: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE • INPUT: Annotated Causal Graph • OUTPUT: • Transportable or not? • Measurements to be taken in the experimental study • Measurements to be taken in the target population • A transport formula • Completeness (Bareinboim, 2012) U V T S Y X W Z

  28. WHICH MODEL LICENSES THE TRANSPORT OF THE CAUSAL EFFECT X→Y Yes Yes Yes No Yes No S S S X Y X Y X Z Y (a) (b) (c) S S S X W Z Y X X Z Z Y Y (d) (f) (f) S External factors creating disparities S S S X X X W W W Z Z Z Y Y Y (e)

  29. STATISTICAL TRANSPORTABILITY (Transfer Learning) • Why should we transport statistical information? • i.e., Why not re-learn things from scratch ? • Measurements are costly. • Limit measurements to a subset V * of variables • called “scope”. • Samples are scarce. • Pooling samples from diverse populations will • improve precision, if differences can be filtered • out.

  30. S X Z Y STATISTICAL TRANSPORTABILITY Definition: (Statistical Transportability) A statistical relation R(P) is said to be transportable from ∏to ∏* over V * if R(P*) is identified from P, P*(V *), and D where P*(V *) is the marginal distribution of P* over a subset of variables V *. R=P*(y | x) is transportable over V* = {X,Z},i.e.,Risestimable without re-measuring Y Transfer Learning If few samples (N2) are available from ∏* and many samples (N1) from ∏then estimating R = P*(y | x) by achieves a much higher precision S X Z Y

  31. META-ANALYSIS OR MULTI-SOURCE LEARNING Target population R = P*(y | do(x)) S (a) (b) (c) Z Z Z X W Y X W Y X W Y (d) (e) (f) Z Z Z S S X W Y X W Y X W Y (g) (h) (i) Z Z Z S S X W Y X W Y X W Y

  32. (d) Z X W Y (h) (i) Z Z S S S X W Y X W Y CAN WE GET A BIAS-FREE ESTIMATE OF THE TARGET QUANTITY? Target population R = P*(y | do(x)) Is R identifiable from (d) and (h) ? (a) Z X W Y R(∏*) is identifiable from studies (d) and (h). R(∏*) is not identifiable from studies (d) and (i).

  33. FROM META-ANALYSIS TO META-SYNTHESIS The problem How to combine results of several experimental and observational studies, each conducted on a different population and under a different set of conditions, so as to construct an aggregate measure of effect size that is "better" than any one study in isolation.

  34. META-SYNTHESIS REDUCED TO CALCULUS Theorem {∏1, ∏2,…,∏K} – a set of studies. {D1, D2,…, DK} – selection diagrams (relative to ∏*). A relation R(∏*) is "meta estimable" if it can be decomposed into terms of the form: such that each Qk is transportable from Dk. Open-problem: Systematic decomposition

  35. BIAS VS. PRECISION IN META-SYNTHESIS (a) (g) (h) (i) (d) Z Z Z Z Z S S X W Y X W Y X W Y X W Y X W Y Calibration Pooling Principle 1: Calibrate estimands before pooling (to minimize bias) Principle 2: Decompose to sub-relations before calibrating (to improve precision)

  36. BIAS VS. PRECISION IN META-SYNTHESIS Pooling Composition Pooling Z Z Z Z Z (a) (g) (h) (i) (d) S S X W Y X W Y X W Y X W Y X W Y

  37. MISSING DATA: A SEEMINGLY STATISTICAL PROBLEM (Mohan & Pearl, 2012) • Pervasive in every experimental science. • Huge literature, powerful software industry, deeply entrenched culture. • Current practices are based on statistical characterization (Rubin, 1976) of a problem that is inherently causal. • Consequence: Like Alchemy before Boyle and Dalton, the field is craving for (1) theoretical guidance and (2) performance guarantees.

  38. ESTIMATE P(X,Y,Z)

  39. WHAT CAN CAUSAL THEORY DO FOR MISSING DATA? • Q-1. What should the world be like, for a given • statistical procedure to produce the expected result? • Q-2. Can we tell from the postulated world whether any method can produce a bias-free result? How? • Q-3. Can we tell from data if the world does not • work as postulated? • None of these questions can be answered by statistical characterization of the problem. • All can be answered using causal models

  40. MISSING DATA: TWO PERSPECTIVES • Causal inference is a missing data problem. • (Rubin 2012) • Missing data is a causal inference problem. • (Pearl 2012) • Why is missingness a causal problem? • Which mechanism causes missingness makes a difference in whether / how we can recover information from the data. • Mechanisms require causal language to be properly described – statistics is not sufficient. • Different causal assumptions lead to different routines for recovering information from data, even when the assumptions are indistinguishable by any statistical means.

  41. ESTIMATE P(X,Y,Z) Missingness graph X RX X * {

  42. NAIVE ESTIMATE OF P(X,Y,Z) Complete Cases • Line deletion estimate is generally biased. X Y Z MCAR Rx Ry Rz

  43. SMART ESTIMATE OF P(X,Y,Z) Z X Y Ry Rz Rx

  44. SMART ESTIMATE OF P(X,Y,Z)

  45. SMART ESTIMATE OF P(X,Y,Z) Compute P(Y|Ry=0)

  46. SMART ESTIMATE OF P(X,Y,Z) Compute P(Y|Ry=0) Compute P(X|Y,Rx=0,Ry=0)

  47. SMART ESTIMATE OF P(X,Y,Z) Compute P(Y|Ry=0) Compute P(X|Y,Rx=0,Ry=0) Compute P(Z|X,Y,Rx=0,Ry=0,Rz=0)

  48. INESTIMABLE P(X,Y,Z) Z X Y Ry Rz Rx

  49. RECOVERABILITY FROM MISSING DATA Definition: Given a missingness model M, a probabilistic quantity Q is said to berecoverableif there exists an algorithm that produces a consistent estimate of Q for every dataset generated by M. Theorem: Q is recoverable iff it is decomposable into terms of the form Qj=P(Sj | Tj), such that: For each variable V that is in Tj, RV is also in Tj . For each variable V that is in Sj, RV is either in Tjor in Sj. e.g., (That is, in the limit of large sample, Q is estimable as if no data were missing.)

More Related