370 likes | 634 Views
Chapter 8: Model Inference and Averaging. Presented by Hui Fang. Basic Concepts. Statistical inference Using data to infer the distribution that generated the data We observe . We want to infer (or estimate or learn) F or some feature of F such as its mean. Statistical model
E N D
Chapter 8: Model Inference and Averaging Presented by Hui Fang
Basic Concepts • Statistical inference • Using data to infer the distribution that generated the data • We observe . • We want to infer (or estimate or learn) F or some feature of F such as its mean. • Statistical model • A set of distributions ( or a set of densities) • Parametric model • Non parametric model
Statistical Model(1) • Parametric model • A set that can be parameterized by a finite number of parameters • E.g. Assume the data come from a normal distribution, the model is • A parametric model takes the form
Probability density function, PDF, f(x): Cumulative density function,CDF, F(x): Statistical Model(2) • Non-parametric model • A set that cannot be parameterized by a finite number of parameters • E.g. Assume the data comes from
Outline • Model Inference • Maximum likelihood inference (8.2.2) • EM Algorithm (8.5) • Bayesian inference (8.3) • Gibbs Sampling (8.6) • Bootstrap (8.2.1,8.2.3,8.4) • Model Averaging and improvement • Bagging (8.7) • Bumping (8.9)
Parametric Inference • Parametric models: • The problem of inference problem of estimating the parameter • Method • Maximum Likelihood Inference • Bayesian Inference
But you don’t know or MLE: For which is most likely? An Example of MLE Suppose you have
Maximum likelihood estimator:Maximizes Likelihood function Likelihood function • Write Log-likelihood function 2. Work out using high-school calculus 3. Solve the set of simultaneous equations A General MLE strategy Suppose is a vector of parameters. Task: Find MLE for 4. Check you are at a maximum
is true value of Fisher information Information matrix Properties of MLE(?) • Sampling distributions of the maximum likelihood estimator has a limiting normal distribution.(P230)
The parameters are The log-likelihood based on the N training cases is An Example for EM Algorithm(1) • Model Y as a mixture of two normal distribution where with sum of terms is inside the logarithm=>difficult to maximize it
Consider unobserved latent variables : comes from model 2; otherwise from model 1. • Take initial guesses for the parameters • Expectation Step: compute • Maximization Step: compute the values for the parameters which can maximize the log-likelihood given • Iterate steps 2 and 3 until convergence. If we knew the values of An Example for EM Algorithm(2)
We can always recover it, since Bayesian Inference • Prior (knowledge before we see the data): • Sampling model: • After observing data Z, we update our beliefs and form the posterior distribution Doesn’t it cause a problem to throw away the constant? Posterior is proportional to likelihood times prior!
Prediction using inference • Task: predict the values of a future observation • Bayesian approach • Maximum likelihood approach
However, if we can draw samples then we can estimate MCMC(1) General Problem: evaluating can be difficult. where This is Monte Carlo (MC) integration.
As , the Markov chain converges to its stationary distribution. MCMC(2) ? • A stochastic process is an indexed random variable where t maybe time and X is a random variable. • A Markov chain is generated by sampling So, depends only on ,not on p is the transition kernel.
Two key objectives: • Generate a sample from a joint probability distribution • Estimate expectations using generated sample averages ( I.e. doing MC integration) MCMC(3) • Problem: How do we construct a Markov chain whose stationary distribution is our target distribution, ? This is called Markov chain Monte Carlo (MCMC)
Gibbs Sampling(1) • Purpose: Draw from a Joint Distribution • Method: Iterative Conditional Sampling target Draw
Gibbs Sampling(2) • Suppose that • Sample or update in turn: …… Always use the most recent values
An Example for Conditional Sampling • Target distribution: • How to draw samples?
For simplicity, assume the parameters are Recall: Same Example for EM (1) • Model Y as a mixture of two normal distribution where with
EM • Take initial guesses for the parameters • Expectation Step: compute • Maximization Step: compute the values for the parameters which can maximize the log-likelihood given • Iterate steps 2 and 3 until convergence. Gibbs Comparison between EM and Gibbs Sampling • Take initial guesses for the parameters • Repeat for t=1.2.,…. • For i=1,2,…,N generate with • Generate • Continue step 2 until the joint distribution of doesn’t change
Bootstrap(0) • Basic idea: • Randomly draw datasets with replacement from the training data • Each sample has the same size as the original training set Bootstrap samples …… Training sample
bioequivalence Z Y Example for Bootstrap(1)
The estimator is : Example for Bootstrap(2) We want to estimate What is the accuracy of the estimator?
Data: • Statistic(any function of the data): • We want to know Real world Bootstrap world can be estimated with ? Bootstrap(1) • The bootstrap was introduced as a general method for assessing the statistical accuracy of an estimator.
Bootstrap(2)---Detour • Suppose we draw a sample from a distribution .
Bootstrap Variance Estimation • Draw • Compute • Repeat steps 1 and 2, B times, to get • Let Bootstrap(3) • Real world • Bootstrap world
Bootstrap(4) • Non-parametric Bootstrap • Uses the raw data, not a specific parametric model, to generate new datasets • Parametric Bootstrap • Simulate new responses by adding Gaussian noise to the predicted values • Example from the book… • ---estimate • We simulate new (x,y) by
Bootstrap(5)---Summary • Nonparametric bootstrap • No underlying distribution assumption • Parametric bootstrap agrees with maximum likelihood • Bootstrap distribution approximates posterior distribution of parameters with non-informative priors (?)
Bootstrap estimators Bootstrap sample …… Original sample Bagging(1) • Bootstrap: • A way of assessing the accuracy of a parameter estimate or a prediction • Bagging (Bootstrap Aggregating) • Use bootstrap samples to predict data classifiers Classification becomes majority voting
Bagging(2) • Pros • The estimator can be significantly improved if the learning algorithm is unstable. • Some change to training set causes large change in output hypothesis • Reduce the variance, bias unchanged • Cons • Degrade the performance of stable procedures ??? • Lose the structure after bagging
Bootstrap estimators Bootstrap sample …… Original sample Bumping • A stochastic flavor of model selection • Bootstrap Umbrella of Model Parameters • Sample data set, train it, until we are satisfied or tired Compare different models on the training data
Conclusions • Maximum Likelihood vs. Bayesian Inference • EM vs. Gibbs Sampling • Bootstrap • Bagging • Bumping