280 likes | 622 Views
Bayesian Learning. Rong Jin. Outline. MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging. Maximum Likelihood Learning (ML). Find the model that best model by maximizing the log-likelihood of the training data Logistic regression
E N D
Bayesian Learning Rong Jin
Outline • MAP learning vs. ML learning • Minimum description length principle • Bayes optimal classifier • Bagging
Maximum Likelihood Learning (ML) • Find the model that best model by maximizing the log-likelihood of the training data • Logistic regression • Parameters are found by maximizing the likelihood of training data
Prior for parameters Maximum A Posterior Learning (MAP) • In ML learning, models are solely determined by the training examples • Very often, we have prior knowledge/preference about parameters/models • ML learning is unable to incorporate the prior knowledge/preference on parameters/models • Maximum a posterior learning (MAP) • Knowledge/preference about parameters/models are incorporated through a prior
Example: Logistic Regression • ML learning • Prior knowledge/Preference • No feature should dominate over all other features Prefer small weights • Gaussian prior for parameters/models:
Example: Logistic Regression • ML learning • Prior knowledge/Preference • No feature should dominate over all other features Prefer small weights • Gaussian prior for parameters/models:
Example (cont’d) • MAP learning for logistic regression • Compared to regularized logistic regression
Example (cont’d) • MAP learning for logistic regression • Compared to regularized logistic regression
Complexity of Model # of Mistakes Minimum Description Length Principle • Occam’s razor: prefer the simplest hypothesis • Simplest hypothesis hypothesis with shortest description length • Minimum description length • Prefer shortest hypothesis • LC(x) is the description length for message x under coding scheme c # of bits to encode data D given h # of bits to encode hypothesis h
Minimum Description Length Principle Receiver Sender Send only D ? Send only h ? D Send h + D/h ?
Example: Decision Tree • H = decision trees, D = training data labels • LC1(h) is # bits to describe tree h • LC2(D|h) is # bits to describe D given tree h • Note LC2(D|h)=0 if examples are classified perfectly by h. • Only need to describe exceptions • hMDL trades off tree size for training errors
Description length of exceptions under optimal coding Description length of h under optimal coding MAP vs. MDL • MAP learning: • Fact from information theory • The optimal (shortest expected coding length) code for an event with probability p is –log2p • Interpret MAP using MDL principle
Problems with Maximum Approaches • Consider • Three possible hypotheses: • Maximum approaches will pick h1 • Given new instance x • Maximum approaches will output + • However, is this most probably result?
Bayes Optimal Classifier (Bayesian Average) • Bayes optimal classification: • Example: • The most probably class is -
Bayes Optimal Classifier (Bayesian Average) • Bayes optimal classification: • Example: • The most probably class is -
When do We Need Bayesian Average? • Bayes optimal classification When do we need Bayesian average? • Multiple mode case • Optimal mode is flat When NOT Bayesian Average? • Can’t estimate Pr(h|D) accurately
Computational Issues with Bayes Optimal Classifier • Bayes optimal classification • Computational issues: • Need to sum over all possible models/hypotheses h • It is expensive or impossible when the model/hypothesis space is large • Example: decision tree • Solution: sampling !
Gibbs Classifier • Gibbs algorithm • Choose one hypothesis at random, according to P(h|D) • Use this to classify new instance • Surprising fact: • Improve by sampling multiple hypotheses from P(h|D) and average their classification results • Markov chain Monte Carlo (MCMC) sampling • Importance sampling
Bagging Classifiers • In general, sampling from P(h|D) is difficult because • P(h|D) is rather difficult to compute • Example: how to compute P(h|D) for decision tree? • P(h|D) is impossible to compute for non-probabilistic classifier such as SVM • P(h|D) is extremely small when hypothesis space is large • Bagging Classifiers: • Realize sampling P(h|D) through a sampling of training examples
Boostrap Sampling • Bagging = Boostrap aggregating • Boostrap sampling: given set D containing m training examples • Create Diby drawing m examples at random with replacement from D • Diexpects to leave out about 0.37 of examples from D
Bagging Algorithm • Create k boostrap samples D1,D2,…, Dk • Train distinct classifier hi on each Di • Classify new instance by classifier vote with equal weights
Bayesian Average Bagging D P(h|D) Boostrap Sampling Sampling … … h1 h2 hk h1 hk D1 D2 Dk h2 Boostrap sampling is almost equivalent to sampling from posterior P(h|D) Bagging Bayesian Average
Empirical Study of Bagging • Bagging decision trees • Boostrap 50 different samples from the original training data • Learn a decision tree over each boostrap sample • Predicate the class labels for test instances by the majority vote of 50 decision trees • Bagging decision tree performances better than a single decision tree
Irreducible variance Model bias: The simpler the (x|D), the larger the bias Model variance: The simpler the (x|D), the smaller the variance Bias-Variance Tradeoff • Why Bagging works better than a single classifier? • Bias-variance tradeoff • Real value case • Output y for x follows y~f(x)+, ~N(0,) • (x|D) is a predictor learned from training data D • Bias-variance decomposition
Small model bias True Model Fit with Complicated Models Large model variance Bias-Variance Tradeoff
Large model bias True Model Fit with Simple Models Small model variance Bias-Variance Tradeoff
Bagging • Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree