Practical Model Selection and Multi-model Inference using R

Practical Model Selection and Multi-model Inference using R Presented by: Eric Stolen and Dan Hunt

Foundation: Theory, hypotheses, and models

Theory • This is the link with science, which is about understanding how the world works

Theory • “A set of propositions set out as an explanation.” • “Theories are generalizations.” • “Theories contain questions.” • “Theories continually change…” (Ford, E. D. 2000. Scientific Method for Ecological Research. Cambridge University Press.)

Theory • Example 1 – Wading bird foraging: • Ideal Free Distribution • Marginal Value Theorem • Scramble Competition

Theory • Example 2 – Indigo Snake Habitat selection • Animal perception • Evolutionary Biology • Population Demography

Hypotheses • Many views – confusing! • A hypothesis is a statement derived from scientific theory that postulates something about how the world works • A testable hypothesis is a hypothesis that can be falsified by a contradiction between a prediction derived from the hypothesis and data measured in the appropriate way

Hypotheses • To use the Information-theoretic toolbox, we must be able to state a hypothesis as a statistical model (or more precisely an equation which allows us to calculate the maximum likelihood of the hypothesis)

Multiple Working Hypotheses • We operate with a set of multiple alternative hypotheses (models) • The many advantages include safeguarding objectivity, and allowing rigorous inference. • Chamberlain (1890) • Strong Inference - Platt (1964) • Karl Popper (ca. 1960)– Bold Conjectures

Deriving the model set • This is the tough part (but also the creative part) • much thought needed, so don’t rush • collaborate, seek outside advice, read the literature, go to meetings… • How and When hypotheses are better than What hypotheses (strive to predict rather than describe)

Models – Indigo Snake example • Study of indigo snake habitat use • Response variable: home range size ln(ha) • SEX • Land cover – 2-3 levels (lC2) • weeks = effort/exposure • Science question: “Is there a seasonal difference in habitat use between sexes?”

Models – Indigo Snake example SEX land cover type (lc2) weeks SEX + lc2 SEX + weeks llc2 + weeks SEX + lc2 + weeks SEX + lc2 + SEX * lc2 SEX + lc2 + weeks + SEX * lc2

Models – Indigo Snake example SEX land cover weeks SEX + land cover SEX + weeks llc2 + weeks SEX + land cover + weeks SEX + land cover + SEX * land cover SEX + land cover + weeks +SEX * land cover

Models – fish habitat use example • Study of fish habitat use in salt marsh • Response variable was density ln(fish m-2 +1) • Habitat – vegetated or unvegetated • Site – 7 impoundments • Season – 4 seasons • Science questions: • “Is there evidence for a difference in density between habitats?” • “Is there a seasonal difference in habitat use by resident marsh fish?”

Models – fish habitat use example Site + Season + Habitat + Site*Habitat + Season*Habitat + Site*Season Site + Season + Habitat + Site*Habitat + Season*Habitat Site + Season + Habitat + Site*Season + Site*Habitat Site + Season + Habitat + Site*Season + Season*Habitat Site + Season + Habitat + Site*Habitat Site + Habitat + Site*Habitat Site + Season + Habitat + Season*Habitat Season + Habitat + Season*Habitat Site + Season + Habitat + Site*Season Site + Season + Site*Season Site + Season + Habitat Site + Season Site + Habitat Season + Habitat Site Season Habitat

The importance of a priori thinking…You can’t go back home!

Modeling • Trade-off between precision and bias • Trying to derive knowledge / advance learning; not “fit the data” • Relationship between data (quantity and quality) and sophistication of the model

Precision-Bias Trade-off Bias 2 Model Complexity – increasing umber of Parameters

Precision-Bias Trade-off variance Bias 2 Model Complexity – increasing umber of Parameters

Kullback-Leibler Information • Basic concept from Information theory • The information lost when a model is used to represent full reality • Can also think of it as the distance between a model and full reality

Kullback-Leibler Information Truth / reality G1 (best model in set) G2 G3

Kullback-Leibler Information Truth / reality G1 (best model in set) G2 The relative difference between models is constant G3

Akaike’s Contributions • Figured out how to estimate the relative Kullback-Leibler distance between models in a set of models • Figured out how to link maximum likelihood estimation theory with expected K-L information • An (Akaike’s) Information Criteria • AIC = -2 loge (L{modeli }| data) + 2K

Akaike’s Contributions • Figured out how to estimate the relative K-L distance between models in a set of models • Figured out how to link maximum likelihood estimation theory with expected K-L information • An (Akaike’s) Information Criteria • AIC = -2 loge (L{modeli }| data) + 2K

I-T mechanics AICci = -2*loge (Likelihood of model i given the data) + 2*K (n/(n-K-1)) or = AIC + 2*K*(K+1)/(n-K-1) (where K = the number of parameters estimated and n = the sample size)

I-T mechanics AICcmin = AICcfor the model with the lowest AICc value Di = AICci– AICcmin

I-T mechanics wi =Prob{gi | data} Model Probability (model probabilities) evidence ratio of model i to model j = wi / wj

I-T mechanics Least Squares Regression AIC = n loge (s2) + 2*K (n/(n-K-1)) Where s2 = RSS / n (explain offset for constant part)

I-T mechanics Counting Parameters: K = number of parameters estimated Least Square Regression K = number of parameters + 2 (for intercept & s)

I-T mechanics Counting Parameters: K = number of parameters estimated Logistic Regression K = number of parameters + 1 (for intercept)

I-T mechanics Counting Parameters: Non-identifiable parameters

Comparing Models

Comparing Models Combined model weight = 0.995

Comparing Models Evidence Ratio = 4.52

Comparing Models

Comparing Models Evidence Ratio = 3.03

Comparing Models Evidence Ratio =4.28 (.34+.22+.14+.08) / (.11+.04+.02+.01)

Generalized Linear Models

Mathematical details • Three parts to a GLM • Link function • linear equation • error distribution

Mathematical details • General Linear Models – linear regression and ANOVA • Link function – Identity link • linear equation • error distribution – Normal Distribution (Gaussian) Y = b0 + b1X1 + b2X2 + e

Mathematical details • Logistic Regression • Link function - Logit link: ln(p / (1-p)) • linear equation • error distribution – Binomial Distribution Logit(p) = b0 + b1X1 + b2X2 + e

Mathematical details • What types of models can be compared within a single I-T analysis? • Data must be fixed (including response) • Must be able to calculate maximum likelihood • (ways to deal with quasi-likelihood) • Models do not need to be nested • In some cases AIC is additive

Model Fitting Preliminaries • Understanding the data/variables • Avoid data dredging! • safe data screening practices • Detect outliers, scale issues, collinearity • Tools in R

Practical Model Selection and Multi-model Inference using R