180 likes | 344 Views
Causal Data Mining. Richard Scheines Dept. of Philosophy, Machine Learning, & Human-Computer Interaction Carnegie Mellon. Causal Graphs. Causal Graph G = { V,E } Each edge X Y represents a direct causal claim: X is a direct cause of Y relative to V. Chicken Pox.
E N D
CausalData Mining Richard Scheines Dept. of Philosophy, Machine Learning, & Human-Computer Interaction Carnegie Mellon
Causal Graphs Causal Graph G = {V,E} Each edge X Y represents a direct causal claim: X is a direct cause of Y relative to V Chicken Pox
Causal Bayes Networks The Joint Distribution Factors According to the Causal Graph, i.e., for all X in V P(V) = P(X|Immediate Causes of(X)) P(S = 0) = .7 P(S = 1) = .3 P(YF = 0 | S = 0) = .99 P(LC = 0 | S = 0) = .95 P(YF = 1 | S = 0) = .01 P(LC = 1 | S = 0) = .05 P(YF = 0 | S = 1) = .20 P(LC = 0 | S = 1) = .80 P(YF = 1 | S = 1) = .80 P(LC = 1 | S = 1) = .20 P(S,YF, LC) = P(S) P(YF | S) P(LC | S)
Structural Equation Models • Structural Equations: One Equation for each variable V in the graph: V = f(parents(V), errorV) for SEM (linear regression) f is a linear function • Statistical Constraints: Joint Distribution over the Error terms Causal Graph
Causal Graph SEM Graph (path diagram) Structural Equation Models Equations: Education = ed Income =Educationincome Longevity =EducationLongevity Statistical Constraints: (ed, Income,Income ) ~N(0,2) 2diagonal - no variance is zero
Tetrad 4: Demo www.phil.cmu.edu/projects/tetrad
Causal Datamining in Ed. Research • Collect Raw Data • Build Meaningful Variables • Constrain Model Space with Background Knowledge • Search for Models • Estimate and Test • Interpret
CSR Online Are Online students learning as much?What features of online behavior matter?
CSR Online Are Online students learning as much? Raw Data : Pitt 2001, 87 studentsFor everyone: Pre-test, Recitation attendance, final examFor Online Students: logged: Voluntary question attempts, online quizzes, requests to print modules
CSR Online Build Meaningful Variables: • Online [0,1] • Pre-test[%] • Recitation Attendance [%] • Final Exam [%]
CSR Online Data: Correlation Matrix (corrs.dat, N=83)
CSR Online Background Knowledge: Temporal Tiers: • Online, Pre • Rec • Final
CSR Online Model Search: No latents (patterns – with PC or GES) - no time order : 729 models - temporal tiers: 96 models) With Latents (PAGs – with FCI search) - no time order : 4,096 - temporal tiers: 2,916
Tetrad Demo Online vs. Lecture Data file: corrs.dat
Estimate and Test: Results • Model fit excellent • Online students attended 10% fewer recitations • Each recitation gives an increase of 2% on the final exam • Online students did 1/2 a Stdev better than lecture students (p = .059)
An Introduction to Causal Inference, (1997), R. Scheines, in Causality in Crisis?, V. McKim and S. Turner (eds.), Univ. of Notre Dame Press, pp. 185-200. Causation, Prediction, and Search, 2nd Edition, (2000), by P. Spirtes, C. Glymour, and R. Scheines ( MIT Press) Causality: Models, Reasoning, and Inference, (2000), Judea Pearl, Cambridge Univ. Press “Causal Inference,” (2004), Spirtes, P., Scheines, R.,Glymour, C., Richardson, T., and Meek, C. (2004), in Handbook of Quantitative Methodology in the Social Sciences, ed. David Kaplan, Sage Publications, 447-478 Computation, Causation, & Discovery (1999), edited by C. Glymour and G. Cooper, MIT Press References