Bayesian Learning

Bayesian Learning Rong Jin

Outline • MAP learning vs. ML learning • Minimum description length principle • Bayes optimal classifier • Bagging

Maximum Likelihood Learning (ML) • Find the model that best model by maximizing the log-likelihood of the training data • Logistic regression • Parameters are found by maximizing the likelihood of training data

Prior for parameters Maximum A Posterior Learning (MAP) • In ML learning, models are solely determined by the training examples • Very often, we have prior knowledge/preference about parameters/models • ML learning is unable to incorporate the prior knowledge/preference on parameters/models • Maximum a posterior learning (MAP) • Knowledge/preference about parameters/models are incorporated through a prior

Example: Logistic Regression • ML learning • Prior knowledge/Preference • No feature should dominate over all other features  Prefer small weights • Gaussian prior for parameters/models:

Example (cont’d) • MAP learning for logistic regression • Compared to regularized logistic regression

Complexity of Model # of Mistakes Minimum Description Length Principle • Occam’s razor: prefer the simplest hypothesis • Simplest hypothesis  hypothesis with shortest description length • Minimum description length • Prefer shortest hypothesis • LC(x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h

Minimum Description Length Principle Receiver Sender Send only D ? Send only h ? D Send h + D/h ?

Example: Decision Tree • H = decision trees, D = training data labels • LC1(h) is # bits to describe tree h • LC2(D|h) is # bits to describe D given tree h • Note LC2(D|h)=0 if examples are classified perfectly by h. • Only need to describe exceptions • hMDL trades off tree size for training errors

Description length of exceptions under optimal coding Description length of h under optimal coding MAP vs. MDL • MAP learning: • Fact from information theory • The optimal (shortest expected coding length) code for an event with probability p is –log2p • Interpret MAP using MDL principle

Problems with Maximum Approaches • Consider • Three possible hypotheses: • Maximum approaches will pick h1 • Given new instance x • Maximum approaches will output + • However, is this most probably result?

Bayes Optimal Classifier (Bayesian Average) • Bayes optimal classification: • Example: • The most probably class is -

When do We Need Bayesian Average? • Bayes optimal classification When do we need Bayesian average? • Multiple mode case • Optimal mode is flat When NOT Bayesian Average? • Can’t estimate Pr(h|D) accurately

Computational Issues with Bayes Optimal Classifier • Bayes optimal classification • Computational issues: • Need to sum over all possible models/hypotheses h • It is expensive or impossible when the model/hypothesis space is large • Example: decision tree • Solution: sampling !

Gibbs Classifier • Gibbs algorithm • Choose one hypothesis at random, according to P(h|D) • Use this to classify new instance • Surprising fact: • Improve by sampling multiple hypotheses from P(h|D) and average their classification results • Markov chain Monte Carlo (MCMC) sampling • Importance sampling

Bagging Classifiers • In general, sampling from P(h|D) is difficult because • P(h|D) is rather difficult to compute • Example: how to compute P(h|D) for decision tree? • P(h|D) is impossible to compute for non-probabilistic classifier such as SVM • P(h|D) is extremely small when hypothesis space is large • Bagging Classifiers: • Realize sampling P(h|D) through a sampling of training examples

Boostrap Sampling • Bagging = Boostrap aggregating • Boostrap sampling: given set D containing m training examples • Create Diby drawing m examples at random with replacement from D • Diexpects to leave out about 0.37 of examples from D

Bagging Algorithm • Create k boostrap samples D1,D2,…, Dk • Train distinct classifier hi on each Di • Classify new instance by classifier vote with equal weights

Bayesian Average Bagging D P(h|D) Boostrap Sampling Sampling … … h1 h2 hk h1 hk D1 D2 Dk h2 Boostrap sampling is almost equivalent to sampling from posterior P(h|D) Bagging  Bayesian Average

Empirical Study of Bagging • Bagging decision trees • Boostrap 50 different samples from the original training data • Learn a decision tree over each boostrap sample • Predicate the class labels for test instances by the majority vote of 50 decision trees • Bagging decision tree performances better than a single decision tree

Irreducible variance Model bias: The simpler the (x|D), the larger the bias Model variance: The simpler the (x|D), the smaller the variance Bias-Variance Tradeoff • Why Bagging works better than a single classifier? • Bias-variance tradeoff • Real value case • Output y for x follows y~f(x)+, ~N(0,) • (x|D) is a predictor learned from training data D • Bias-variance decomposition

Small model bias True Model Fit with Complicated Models Large model variance Bias-Variance Tradeoff

Large model bias True Model Fit with Simple Models Small model variance Bias-Variance Tradeoff

Bagging • Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree

Bayesian Learning