740 likes | 936 Views
Testing hypotheses using model selection. Eric D. Stolen InoMedic Health Applications, Ecological Program, Kennedy Space Center, Florida NASA Environmental Management Branch. We h ve inv st d a l t of t m nd eff rt in cr at ng R, pl s c te it wh n us ng it f r d t n lys s. .
E N D
Testing hypotheses using model selection Eric D. Stolen InoMedic Health Applications, Ecological Program, Kennedy Space Center, Florida NASA Environmental Management Branch
We hve inv st d a l t of t m ndeffrt in cratng R, pls cte it whn usng it fr dtnlyss.
We have invested a lot of timeand effort in creating R, please cite it when using it for dataanalysis.
“The human understanding, once it has adopted an opinion, collects any instances that confirm it, and though the contrary instances may be more numerous and more weighty, it either does not notice them or else rejects them, in order that this opinion will remain unshaken.” - Francis Bacon (1620)
Outline Science issues The method of multiple working hypotheses Statistical models as science tools Making inference in science Information-theoretic model selection Multi-model inference
Science What is it?
Science is the organized process of creating testable explanations of how the natural world works.
Understanding Theory Hypothesis
Hypothetico-deductive model • Generate hypothesis (from theory) • Make a prediction from the hypothesis • Conduct experiment to test prediction • Decide whether or not the theory is supported
Hypothetico-deductive model • Taught in Primary through graduate-school education • Not the way science is done in many fields • Modern science is largely inductive
Null hypothesis testing H0: No effect HA: Effect of interest Probability{ data | H0 } Is this what we want to know?
Null hypothesis testing Karl Pearson (1857 – 1936) Jerzy Neyman (1894 – 1981) R. A. Fisher (1890 – 1962) Known as the frequentist approach Not what Fisher, Neyman nor Pearson intended!
Oops (c) Ian Britton - FreeFoto.com
NHT problems • Some problems: • Silly nulls • Slow progress • Many systems not amenable • Inference dependent upon the sample space • Fosters unthinking approaches
an alternative Probability{ HA | data }
Multiple working hypotheses • Thomas C. Chamberlin (1843-1928) • Geologist • President University of Wisconsin • Director Walker Museum and Chair Dept. of Geology at the University of Chicago • President of the American Association for the Advancement of Science Chamberlin, T. C. 1890. The method of multiple working hypotheses. Science 15:92-96 (reprinted 1965, Science 148:754-759
Reality Theory Data Alternative Hypotheses
Multiple working hypotheses Wading bird group foraging H1: No effect H2: Group effect same for all species H3: Group effect differs by species H4: (Group by species) + prey density H5: Group + prey density • H6: (Group by species) + prey + habitat
Mathematical models in science “Nature's great book is written in mathematics.” - Galileo Galilei
Mathematical models in science Empirical Models Mechanistic Models Ecology Chemistry in 19th Century Climatology Physics Modern Chemistry Molecular biology
Generalized Linear Model • Three parts • Probability distribution (error) Y i ~ N(i, 2) • Link function E(Y i) = i • linear equation i= n(xi1, xi2, xi3, …xiq)
Generalized Linear Model Y = b0 + b1X1 + b2X2 + e • Linear regression and ANOVA • Link function – Identity link • linear equation • error distribution – Normal Distribution (Gaussian)
Generalized Linear Model Logit(p) = b0 + b1X1 + b2X2 + e • Logistic Regression • Link function - Logit link: ln(p / (1-p)) • linear equation • error distribution – Binomial Distribution
Maximum likelihood estimnation • R. A. Fisher (1980-1962) • The parameter estimates that are most likely, given the data and the model • Example • Receive a cookie from the cafeteria 11 days • Observe 7 chocolate chip and 4 oatmeal raisin • What is the best estimate of p = proportion chocolate chip (given the observed data)
Maximum likelihood estimnation “CC” “CC” “OR” “CC” “CC” “OR” “OR” “CC” “OR” “CC” “CC”
Maximum likelihood estimnation “CC” “CC” “OR” “CC” “CC” “OR” “OR” “CC” “OR” “CC” “CC”
Multiple working hypotheses Wading bird group foraging H1: No effect H2: Group effect same for all species H3: Group effect differs by species H4: (Group by species) + prey density H5: Group + prey density • H6: (Group by species) + prey + habitat
Multiple working hypotheses Wading bird group foraging H1: Foraging rate = b0 + e H2: Group effect same for all species H3: Group effect differs by species H4: (Group by species) + prey density H5: Group + prey density • H6: (Group by species) + prey + habitat
Multiple working hypotheses Wading bird group foraging H1: No effect H2: FR = b0 + Group * b1 + e H3: Group effect differs by species H4: (Group by species) + prey density H5: Group + prey density • H6: (Group by species) + prey + habitat
Approaches to science Observational Study Experimental Study Strength of Inference
Experimental study What is the effect of a particular treatment (or series of treatments) on a particular aspect of the system
Experimental study Treatments: A, B, C, D Replicates: 1,2,3,…,n A B C D control 1,4,5, 38,62, 99 10,15, 41,44, 88 7,22,21,54,67, 81 6,29,33,61,77, 79 11,12, 69,74, 91,92
Experimental study Treatments: A, B, C, D Replicates: 1,2,3,…,n A B C D control Randomization 1,4,5, 38,62, 99 10,15, 41,44, 88 7,22,21,54,67, 81 6,29,33,61,77, 79 11,12, 69,74, 91,92
Observational study Treatments: A, B, C, D Replicates: 1,2,3,…,n A B C D control Bias 1,4,5, 38,62, 99 10,15, 41,44, 88 7,22,21,54,67, 81 6,29,33,61,77, 79 11,12, 69,74, 91,92
Approaches to science Observational Study Confirmatory Study Experimental Study Strength of Inference
Confirmatory study Make predictions a priori Design collection of observational data including as much replication and control as possible Weakness is still lack of randomization (not assigning treatment)
Summary so far Science is a process to postulate and refine reliable descriptions (explanations) of reality The method of multiple working hypotheses is a particularly useful science tool Mathematics is the language of science Experiments are golden, confirmatory studies are helpful
Next… Statistical model selection theory Information-theoretic tools R Model selection in practice Multi-model inference
Precision-Bias Trade-off Y = b0 + b1X1 + b2X2 + e Bias 2 Model Complexity – increasing number of Parameters
Precision-Bias Trade-off Y = b0 + b1X1 + b2X2 + e variance Model Complexity – increasing number of Parameters
Precision-Bias Trade-off Y = b0 + b1X1 + b2X2 + e variance Bias 2 Model Complexity – increasing number of Parameters
Kullbeck-Leibler information (1907-1994) (1914-2003) Kullback, S., and R. A. Leibler. 1951. On Information and Sufficiency The Annals of Mathematical Statistics 22:79-86
Kullback-Leibler information divergence Full Truth G1 (best model in set) G2 G3