1 / 18

Multiple Testing of Causal Hypotheses

Multiple Testing of Causal Hypotheses. Samantha Kleinberg NYU Bioinformatics Group, Courant Institute, NYU 9/12/08 (Jointly with Bud Mishra .). Motivation. It is frequently said “smoking causes lung cancer”

roymichael
Download Presentation

Multiple Testing of Causal Hypotheses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Testing of Causal Hypotheses Samantha Kleinberg NYU Bioinformatics Group, Courant Institute, NYU 9/12/08 (Jointly with Bud Mishra.)

  2. Motivation • It is frequently said “smoking causes lung cancer” • But, what about other ways of developing cancer, and other conditions required to develop cancer? • Goal: Find details of this relationship • How probable is it that someone will get cancer if they smoke? • How long will this take to happen? Lung Cancer

  3. Motivation, continued • Compare: • A. Smoking causes lung cancer with probability ≈ 1 after 90 years • B. Smoking causes lung cancer with probability = ½ in less than 10 years. • Different implications! • Also, consider other conditions that will make cancer more likely

  4. Neural spike train data • Simulation of neural spike trains • 26 neurons, 5 causal structures • At each time point: • Neuron can fire randomly (dependent on noise level) • Neuron can be triggered by one of the neurons that causes it to fire • Known information • Neuron has 20 time unit refractory period • Window of 20 time units after refractory period when it can activate another neuron A B time t t+20 t+40 Data from 2006 KDD workshop on temporal data mining. K.P. Unnikrishnan, Naren Ramakrishnan, P.S. Sastry.

  5. Patterns 1-3 Pattern 1 Pattern 3 Pattern 2

  6. Patterns 4 and 5 Pattern 4 Pattern 5

  7. Desiderata • A (philosophically) sound notion of causality. • It should be able to work with the kinds of data that are available, in a variety of domains: • Politics, finance, biology • A (logically) rigorous method of expressing these notions of causality. • It should capture a notion of probabilistic nature of the data • It should be able reason about time; time must be metric, capturing a notion of locality • An (algorithmic) automated method for finding all prima-facie causes • Model Checking • A (statistically) sound method for finding all genuine causes.

  8. Causal relationships as logical formulae • Prima facie causes: earlier than effect, raise probability of effect • Represent both cause and effect as formula in probabilistic temporal logic • Example: (a^b) U c • Neural spike trains: • t is 20-40 time unit window

  9. Causal Hypotheses • Need not consider all other events; just other prima facie causes of e • Why? • We are testing to see if, in the presence of some other factor, x, c is still correlated with e • For x to remove c’s influence, it must be correlated with e itself • Provides a way to narrow down the factors that must be considered • Compute difference cause makes to effect • But, which values of this difference are significant? 1. 2.

  10. FDR • FDR = V/R • Local FDR (fdr) • For each hypothesis, compute probability of it being null

  11. Two groups of data • Two classes of prior probabilities • p0 = Pr(uninteresting), f0(z) density • p1 = Pr(interesting), f1(z) density • Assume p0 large. • Mixture density: • f(z) = p0 f0(z) + p1 f1(z) • Prob of being uninteresting given z-value z • fdr(z) ≈ Pr(null|z) = p0 f0(z) /f(z)

  12. Steps • 1. Estimate distribution of data, f(z) • E.g. splines or Poisson regression • 2. Define null density f0(z) from data • One method is to fit to central peak of data. • 3. Calculate fdr(z) • 4. Call Hi where fdr(zi) < threshold interesting • Common threshold is 0.10

  13. Causal Inference • Enumerate logical formulas describing possible causes • From experimental data determine prima facie causes • Calculate ε for each, translate to z-values • Take set of z values, calculate empirical null, label prima facie causes with z-value where fdr(z) < threshold as genuine

  14. Examples

  15. Neural data • We used the multiple hypothesis testing framework • Empirical null: N(-0.15,-0.39) • Genuine causes have z>3

  16. Political data • Empirical null: N(0.39,0.96) • No genuine causes with z>0, but look at z<0 • 3 phrases with false discovery rate, fdr<0.1, all have z around -3 • Homes, progress, lebanon • What does this mean? • For example “had President Bush NOT said homes, his rating would have gone down”

  17. Cellular data • Looked at relationships between pairs of genes where relationship takes place at next unit of time • Empirical null: N(-1.00,0.89) • Thousands of prima facie causes where f(z)< 0.1

  18. Conclusion • New method of representing causality • Can describe probabilistic relationships with a temporal component • Allows for arbitrarily complex causes • Can infer relationships with model-checking • Automated way of finding prima facie causes • Statistical method of determining genuine causality • What’s next? • Inferring time between cause and effect • Magnitude of relationship • Testing on larger data sets with more complex structures Questions? samantha@cs.nyu.edu

More Related