Chapter 8: Model Inference and Averaging

Chapter 8: Model Inference and Averaging Presented by Hui Fang

Basic Concepts • Statistical inference • Using data to infer the distribution that generated the data • We observe . • We want to infer (or estimate or learn) F or some feature of F such as its mean. • Statistical model • A set of distributions ( or a set of densities) • Parametric model • Non parametric model

Statistical Model(1) • Parametric model • A set that can be parameterized by a finite number of parameters • E.g. Assume the data come from a normal distribution, the model is • A parametric model takes the form

Probability density function, PDF, f(x): Cumulative density function,CDF, F(x): Statistical Model(2) • Non-parametric model • A set that cannot be parameterized by a finite number of parameters • E.g. Assume the data comes from

Outline • Model Inference • Maximum likelihood inference (8.2.2) • EM Algorithm (8.5) • Bayesian inference (8.3) • Gibbs Sampling (8.6) • Bootstrap (8.2.1,8.2.3,8.4) • Model Averaging and improvement • Bagging (8.7) • Bumping (8.9)

Parametric Inference • Parametric models: • The problem of inference problem of estimating the parameter • Method • Maximum Likelihood Inference • Bayesian Inference

But you don’t know or MLE: For which is most likely? An Example of MLE Suppose you have

Maximum likelihood estimator:Maximizes Likelihood function Likelihood function • Write Log-likelihood function 2. Work out using high-school calculus 3. Solve the set of simultaneous equations A General MLE strategy Suppose is a vector of parameters. Task: Find MLE for 4. Check you are at a maximum

is true value of Fisher information Information matrix Properties of MLE(?) • Sampling distributions of the maximum likelihood estimator has a limiting normal distribution.(P230)

The parameters are The log-likelihood based on the N training cases is An Example for EM Algorithm(1) • Model Y as a mixture of two normal distribution where with sum of terms is inside the logarithm=>difficult to maximize it

Consider unobserved latent variables : comes from model 2; otherwise from model 1. • Take initial guesses for the parameters • Expectation Step: compute • Maximization Step: compute the values for the parameters which can maximize the log-likelihood given • Iterate steps 2 and 3 until convergence. If we knew the values of An Example for EM Algorithm(2)

An Example for EM Algorithm(3)

We can always recover it, since Bayesian Inference • Prior (knowledge before we see the data): • Sampling model: • After observing data Z, we update our beliefs and form the posterior distribution Doesn’t it cause a problem to throw away the constant? Posterior is proportional to likelihood times prior!

Prediction using inference • Task: predict the values of a future observation • Bayesian approach • Maximum likelihood approach

However, if we can draw samples then we can estimate MCMC(1) General Problem: evaluating can be difficult. where This is Monte Carlo (MC) integration.

As , the Markov chain converges to its stationary distribution. MCMC(2) ? • A stochastic process is an indexed random variable where t maybe time and X is a random variable. • A Markov chain is generated by sampling So, depends only on ,not on p is the transition kernel.

Two key objectives: • Generate a sample from a joint probability distribution • Estimate expectations using generated sample averages ( I.e. doing MC integration) MCMC(3) • Problem: How do we construct a Markov chain whose stationary distribution is our target distribution, ? This is called Markov chain Monte Carlo (MCMC)

Gibbs Sampling(1) • Purpose: Draw from a Joint Distribution • Method: Iterative Conditional Sampling target Draw

Gibbs Sampling(2) • Suppose that • Sample or update in turn: …… Always use the most recent values

An Example for Conditional Sampling • Target distribution: • How to draw samples?

For simplicity, assume the parameters are Recall: Same Example for EM (1) • Model Y as a mixture of two normal distribution where with

EM • Take initial guesses for the parameters • Expectation Step: compute • Maximization Step: compute the values for the parameters which can maximize the log-likelihood given • Iterate steps 2 and 3 until convergence. Gibbs Comparison between EM and Gibbs Sampling • Take initial guesses for the parameters • Repeat for t=1.2.,…. • For i=1,2,…,N generate with • Generate • Continue step 2 until the joint distribution of doesn’t change

Bootstrap(0) • Basic idea: • Randomly draw datasets with replacement from the training data • Each sample has the same size as the original training set Bootstrap samples …… Training sample

bioequivalence Z Y Example for Bootstrap(1)

The estimator is : Example for Bootstrap(2) We want to estimate What is the accuracy of the estimator?

Data: • Statistic(any function of the data): • We want to know Real world Bootstrap world can be estimated with ? Bootstrap(1) • The bootstrap was introduced as a general method for assessing the statistical accuracy of an estimator.

Bootstrap(2)---Detour • Suppose we draw a sample from a distribution .

Bootstrap Variance Estimation • Draw • Compute • Repeat steps 1 and 2, B times, to get • Let Bootstrap(3) • Real world • Bootstrap world

Bootstrap(4) • Non-parametric Bootstrap • Uses the raw data, not a specific parametric model, to generate new datasets • Parametric Bootstrap • Simulate new responses by adding Gaussian noise to the predicted values • Example from the book… • ---estimate • We simulate new (x,y) by

Bootstrap(5)---Summary • Nonparametric bootstrap • No underlying distribution assumption • Parametric bootstrap agrees with maximum likelihood • Bootstrap distribution approximates posterior distribution of parameters with non-informative priors (?)

Bootstrap estimators Bootstrap sample …… Original sample Bagging(1) • Bootstrap: • A way of assessing the accuracy of a parameter estimate or a prediction • Bagging (Bootstrap Aggregating) • Use bootstrap samples to predict data classifiers Classification becomes majority voting

Bagging(2) • Pros • The estimator can be significantly improved if the learning algorithm is unstable. • Some change to training set causes large change in output hypothesis • Reduce the variance, bias unchanged • Cons • Degrade the performance of stable procedures ??? • Lose the structure after bagging

Bootstrap estimators Bootstrap sample …… Original sample Bumping • A stochastic flavor of model selection • Bootstrap Umbrella of Model Parameters • Sample data set, train it, until we are satisfied or tired Compare different models on the training data

Conclusions • Maximum Likelihood vs. Bayesian Inference • EM vs. Gibbs Sampling • Bootstrap • Bagging • Bumping

Chapter 8: Model Inference and Averaging

Chapter 8: Model Inference and Averaging

Presentation Transcript

Chapter 8 Inference for Proportions

Probability and Statistical Inference Gehlbach: Chapter 8

Model Assessment, Selection and Averaging

Basic Bayes: model fitting, model selection, model averaging

Turning Bayesian Model Averaging Into Bayesian Model Combination

Chapter 8 Statistical Inference and Sampling

Likelihood, Inference, and Model Comparison

Lecture 9. Model Inference and Averaging

Lectures 13,14 – Model Inference and Averaging

Chapter 8 Inference for Proportions

Chapter 8 Inference and Resolution for Problem Solving

AVERAGING

Model Inference and Averaging

Model Averaging: Beyond Model Uncertainty in Risk Analysis *

JEFS Calibration: Bayesian Model Averaging

Chapter 8: Introduction to Statistical Inference

Chapter 8 Inference and Resolution for Problem Solving

Model Assessment, Selection and Averaging

Model Inference and Averaging

Chapter 8: Model Inference and Averaging

Chapter 8 Inference for Proportions