Bayesian Statistics

Bayesian Statistics Lecture 8 Likelihood Methods in Forest Ecology October 9th – 20th , 2006

“Real knowledge is to know the extent of one’s ignorance” -Confucius

How do we measure our knowledge (ignorance)? • Scientific point of view: Knowledge is acceptable if it explains a body of natural phenomena (Scientific model). • Statistical point of view: Knowledge is uncertain but we can use it if we can measure its uncertainty. The question is how to measure uncertainty and make use of available knowledge.

Limitations of likelihoodist & frequentist approaches • Parsimony is often an insufficient criterion for inference particularly if our objective is forecasting. • Model selection uncertainty is the big elephant in the room. • Since parameters do not have probability distributions, error propagation in models cannot be interpreted in a probabilistic manner. • Cannot deal with multiple sources of error and complex error structures in an efficient way. • New data require new analyses.

Standard statistics revisited: Complex Variance Structures

Inference Addresses three basic questions: • What do I believe now that I have these data? [Credibility or confidence] • What should I do now that I have these data? [Decision] • How should I interpret these data as evidence of one hypothesis vs. other competing hypotheses? [Evidence]

Body of knowledge Scientific Hypothesis Scientific Model DATA Statistical Model Statistical Hypothesis

An example Body of knowledge= Fruit production in trees Scientific Hypothesis yi = DBH b Scientific Explanation = physiology, Life history DATA Statistical Hypothesis b = value Pred (y) Statistical Model= Poisson dist.

Body of knowledge= Fruit production in trees Scientific Hypothesis Log yi = b log(DBH) Scientific Explanation = physiology DATA Statistical Model= Poisson dist. Statistical Hypothesis b  0 The Frequentist Take b = 0.4 Belief: Only with reference to an infinite series of trials Decision: Accept or reject that b=0 Evidence: None

Body of knowledge= Fruit production in trees Scientific Hypothesis Log yi = b log(DBH) Scientific Explanation = physiology DATA Statistical Model= Poisson dist. Statistical Hypothesis b  0 The Likelihodist Take b = 0.4 Belief: None, only relevant to the data at hand. Decision: Only with reference to alternate models Evidence: Likelihood Ratio Test or AIC.

Body of knowledge= Fruit production in trees Scientific Hypothesis Log yi = b log(DBH) Scientific Explanation = physiology DATA Statistical Model= Poisson dist. Statistical Hypothesis b  0 The Bayesian Take b = 0.4 Belief: Credible intervals Decision: Parameter in or out of a distribution Evidence: None

Parallels and differences in Bayesian & Frequentist statistics • Bayesian and frequentist approaches use the data to derive a parameter estimate and a measure of uncertainty around the parameter that can be interpreted using probability. • In Bayesian inference, parameters are treated as random variables that have a distribution. • If we know their distribution, we can assess the probability that they will take on a particular value (posterior ratios or credible intervals).

Evidence vs Probability “As a matter of principle , the infrequency with which, in particular circumstances, decisive evidence is obtained, should not be confused with the force or cogency, of such evidence”. Fischer 1959

Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Frequentist vs Bayesian Source: “All of statistics”, Larry Wasserman

Frequentism • Probability only defined as a long-term average in an infinite sequence of trials (that typically never happen!). • p-value is probability of that extreme outcome given a specified null hypothesis. • Null hypotheses are often strawmen set up to be rejected • Improperly used p values are poor tools for statistical inference. • We are interested in parameter estimation rather than p values per se.

Frequentist statistics violates the likelihood principle “The use of p-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” Jeffreys, 1961

Some rules of probability assuming independence A B

Bayes Theorem

Bayes Theorem ?

For a set of mutually exclusive hypotheses…..

Bolker

An example from medical testing

ill Test + Not ill

Bayes Theorem Rarely known Hard to integrate function MCMC methods

Joint and Marginal distributions:Probability that 2 pigeon species (S & R) occupy an island Diamond 1975

Conjugacy • In Bayesian probability theory, a conjugate prior is a family of prior probability distributions which has the property that the posterior probability distribution also belongs to that family. • A conjugate prior is an algebraic convenience: otherwise a difficult numerical integration may be necessary.

Jointly distributed random variables We have to normalize this to turn it into a probability

Hierarchical Bayes • Ecological models tend to be high-dimensional and include many sources of stochasticity. • These sources of “noise” often don’t comply with assumptions of traditional statistics: • Independence (spatial or temporal) • Balanced among groups • Distributional assumptions • HB can deal with these problems by partioning a complex problem into a series of univariate distributions for which we can solve –typically using sophisticated computational methods.

Hierarchical Bayes

Clark et al. 2004

Hierarchical Bayes • Marginal distribution of a parameter averaged over all other parameters and hyperparameters:

Complex models can be constructed from simple, conditional relationships. We don’t need an integrated specification of the problem, only the conditional components. We are drawing boxes and • arrows (Fig. 4). We relax the traditional requirement for independent data. Condindependence is enough. We typically take up the relationships that cause correlation • we can accommodate multiple data types within a single analysis, even treating model output as ‘data’. More on this later. Sampling based approaches (MCMC) can do the integration for us (thething we avoided in advantage 1). Hierarchical Bayes: Advantages • Complex models can be constructed from simple, conditional relationships. We don’t need an integrated specification of the problem, only the conditional components. We are drawing boxes and arrows. • We relax the traditional requirement for independent data. Conditional independence is enough. We typically take up the relationships that cause correlation at a lower process stage. We can accommodate multiple data types within a single analysis, even treating model output as ‘data’. • Sampling based approaches (MCMC) can do the integration for us (thething we avoided in advantage 1).

Why Hierarchical Bayes? Useful approach for understanding ecological processes because: • Incorporates uncertainty using a probabilistic framework • Model parameters are random variables – output is a probability distribution (the posterior distribution) • Complex models are partitioned into a hierarchical structure • Performs well for high-dimensional models (ie - many parameters) with little data

Bayes’ Rule Posterior distribution Prior distribution • Posterior distribution is affected by the data only through the likelihood function • If prior distribution is non-informative, then the data dominate the outcome Likelihood  is set of model parameters y is observed data p(|y) = p() * p(y|) p(y) p(y) = p()p(y|)d (marginal distribution of y or prior predictive distribution) Normalizing density

How do we do this? Baby steps: Rejection sampling • Suppose we have a distribution Target distribution

Bound target distribution with a function f(x) so that Cf(x)>=p(x) • Calculate ratio

Proposed distribution Target distribution • With prob a accept this value of a random draw from p(x). With probability a-1 reject this value of X and repeat the procedure. To do this draw a random variate (z) from the uniform density. If z<a, accept X.

Build an empirical distribution of accepted draws which approximates the target distribution. Theoretical distribution Smoothed empirical distribution

MCMC Methods • Markov process – a random process whose next step depends only on the prior realization (lag of 1) • The joint probability distribution (p(|y), which is the posterior distribution of the parameters) is generally impossible to integrate in closed form • So…use a simulation approach based on conditional probability • The goal is to sample from this joint distribution of all parameters in the model, given the data, (the target distribution) in order to estimate the parameters, but… • …we don’t know what the target distribution looks like, so we have to make a proposal distribution

p(x) X Monte Carlo principle • Given a very large set X and a distribution p(x) over it • We draw i.i.d. a set of N samples • We can then approximate the distribution using these samples

p(x) X Markov Chain Monte Carlo (MCMC) • Recall again the set X and the distribution p(x) we wish to sample from • Suppose that it is hard to sample p(x) but that it is possible to “walk around” in X using only local state transitions • Insight: we can use a “random walk” to help us draw random samples from p(x)

MCMC Methods • Metropolis-Hastings algorithms are a way to construct a Markov chain in such a way that its equilibrium (or stationary) distribution is the target distribution. • Proposal is some kind of bounding distribution that completely contains the target distribution • Acceptance-rejection methods are used to decide whether a proposed value is accepted or rejected as being drawn from the target distribution • Jumping rules determine when and how the chain moves on to new proposal values

MCMC • The basic rule is that the ratio of successful jump probabilities is proportional to the ratio of posterior probabilities. • This means that over the long term, we stay in areas with high probability and the long-term occupancy of the chain matches the posterior distribution of interest.

MCMC Methods • Eventually, through many proposals that are updated iteratively (based on jumping rules), the Markov chain will converge to the target distribution, at which time it has reached equilibrium (or stationarity) • This is achieved after the so-called “burn-in” (“the chain converged”) • Simulations (proposals) made prior to reaching stationarity (ie - during burn-in) are not used in estimating the target • Burning questions: When have you achieved stationarity and how do you know???(some diagnostics, but no objective answer because the target distribution is not known)

More burning questions • How can you pick a proposal distribution when you don’t know what the target distribution is? (this is what M-H figured out!) • Series of proposals depends on a ratio involving the target distribution, which itself cancels out in the ratio • So you don’t need to know the target distribution in order to make a set of proposals that will eventually converge to the target • This is (vaguely) analogous in K-L information theory to not having to “know the truth” in order to estimate the difference between 2 models in their distance from the truth (truth drops out in the comparison)

Posterior Distributions • Assuming the chain converged, you obtain an estimate for each parameter of its marginal distribution, p(1|2, 3 … n, y) That is, the distribution of 1 , averaged over the distributions for all other parameters in the model & given the data • This marginal distribution is the posterior distribution that represents the probability distribution of this parameter, given the data & other parameters in the model • These posterior distributions of the parameters are the basis of inference

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Assessing Convergence • Run multiple chains (chains are independent) • Many iterations (>2000) • First half are burn-in • “Thin” the chain (take every xth value; depends on auto-correlation) • Compare traces of chains Chain 1 Chain 2 Not converged Chain 3

Bayesian Statistics