第 6 章贝叶斯学习

第6章贝叶斯学习 Bayesian Learning 1

贝叶斯学习(Bayesian Learning) • Bayes Theorem • 􀀀 MAP ML hypotheses • 􀀀 MAP learner • 􀀀 Minimum description length principle • 􀀀 Bayes optimal classier • 􀀀 Naive Bayes learner • 􀀀 Example Learning over text data • 􀀀 Bayesian belief networks • 􀀀 Expectation Maximization algorithm 2

Two Roles for Bayesian Methods Provides practical learning algorithms • Naive Bayes learning • Bayesian belief network learning • Combine prior knowledge(prior probabilities) with observed data • Requires prior probabilities 3

Two Roles for Bayesian Methods Provides useful conceptual framework • Provides gold standard for evaluating other learning algorithms • Additional insight into Occams razor 4

Bayes Theorem 5

Choosing Hypotheses：MAP Maximum a posteriori hypothesis hMAP 6

Choosing Hypotheses：ML • If assume P(hi)= P(hj) then can further simplify, and choose the Maximum likelihood ML hypothesis 7

示例：病人是否患有癌症？ • 两个可选的假设：（1）病人有某种类型的癌症(P(cancer))，（2）病人无癌症(P(n_cancer))。 • 先验知识：在所有人口中只有0.008的人患有该疾病P(cancer)=0.008，P(n_cancer)=0.992。 • 可用的数据来自于一化验测试，它有两种可能的输出：（正）和Θ（负）。该化验测试只是该病的不完全预计，针对确实有病的患者有98%的可能返回正确的结果(P(|cancer)=0.98，p(Θ|cancer)=0.02), 而对无该病的患者有97%的可能正确返回Θ结果(P(|n_cancer)=0.03, P(Θ|cancer)=0.97) 。 8

示例：病人是否患有癌症？ • 现有一新病人，化验测试返回了结果。是否应将病人断定为有癌症呢？ 9

计算概率的基本公式(1) • 乘法公式(Product rule)：两事件A和B的交的概率P(AB) • 加法公式(Sum Rule)：两事件A和B的并的概率P(AB) 10

计算概率的基本公式(2) • 贝叶斯法则(Bayes theorem)：给定D时h的后验概率P(h|D) • 全概率公式(Theorem of total probability)：如果事件A1, …, An 互斥且，则： 11

Brute Force MAP Hypothesis Learner • For each hypothesis h in H calculate the posterior probability • Output the hypothesis hMAP with the highest posterior probability 12

Relation to Concept Learning(problem) Consider our usual concept learning task • instance space X, hypothesis space H, training examples D. • consider the FINDS learning algorithm outputs most specic hypothesis from the version space VSH,D. • What would Bayes rule produce as the MAP hypothesis? • Does FINDS output a MAP hypothesis? 13

Relation to Concept Learning(Task for Brute Force) Assume fixed set of instances <x1, …, xm>,Assume D is the set of classications D=<c(x1), …, c(xm)> Choose P(D |h): • P(D |h)=1 if h consistent with D • P(D |h)=0 otherwise • Choose P(h) to be uniform distribution: 14

Characterizing Learning Algorithms by Equivalent MAP Learners 隐含的归纳偏置 :目标概念c包含在假设空间H中 16

Characterizing Learning Algorithms by Equivalent MAP Learners 显式的先验假设 17

Learning A Real Valued Function 18

Learning A Real Valued Function • Consider any realvalued target function f, Training examples <xi ,di>, where di = f(xi)+eiis noisy training value, eiis random variable noise drawn independently for each xi according to some Gaussian distribution with mean. 19

Learning A Real Valued Function • Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors: 20

Learning A Real Valued Function 训练样例相互独立 21

Learning A Real Valued Function • Maximize natural log of this instead… 22

关于极大似然和最小误差平方假设的结论 • 在特定前提下，任一学习算法如果使输出的假设预测和训练数据之间的误差平方最小化，它将输出一极大似然假设。 23

Learning to Predict Probabilities Consider predicting survival probability from patient data • Training examples <xi , di >, where diis 0 or 1. • Want to train neural network to output a probability given xi (not a 0 or 1) • Minimize cross entropy. 24

Minimum Description Length Principle • Occams razor: prefer the shortest hypothesis. • MDL: prefer the hypothesis h that minimizes. where LC(x) is the description length of x under encoding C 25

Minimum Description Length Principle 26

Minimum Description Length Principle • Interesting fact from information theory:The optimal shortest expected coding length code for an event with probability p is log2p bits. 27

Minimum Description Length Principle Then: • -log2P(h)is length of h under optimal code • -log2P(D|h)is length of D given h under optimal code • prefer the hypothesis that minimizes length(h)+ length(misclassifications) 28

Most Probable Classication of New Instances The most probable hypothesis given the data D is hMAP Given new instance x, what is its most probable classication? • hMAP(x) is not the most probable classication! 29

Consider • Three possible hypotheses： P(h1|D)=0.4, P(h2|D)=0.3, P(h3|D)=0.3 • Given new instance x： h1(x)=  , h2(x)= Θ , h3(x)= Θ • Whats most probable classication of x? 30

Bayes Optimal Classier • Bayes optimal classication • 上例中V={,Θ} P(h1|D)=0.4, P(Θ|h1)=0, P(|h1)=1 P(h2|D)=0.3, P(Θ|h2)=1, P(|h2)=0 P(h3|D)=0.3, P(Θ|h3)=1, P(|h3)=0 31

Gibbs Classier • Bayes optimal classier provides best result but can be expensive if many hypotheses. • Gibbs algorithm: • 1. Choose one hypothesis at random according to P(D|h) • 2. Use this to classify new instance 32

Gibbs Classier Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then: Suppose correct uniform prior distribution over H, then • Pick any hypothesis from VS, with uniform probability • Its expected error no worse than twice Bayes optimal. 33

Naive Bayes Classier Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods. When to use: • Moderate or large training set available • Attributes that describe instances are conditionally independent given classification Successful applications: • Diagnosis • Classifying text documents 34

Naive Bayes Classier 35

Naive Bayes Classier • Naive Bayes assumption: • Naive Bayes classier 36

Naive Bayes Algorithm • Naive_Bayes_Learn(examples) • For each target value vj  estimate • For each attribute value of each attribute  estimate Classify_New_Instance(x): 37

Naive Bayes: Example • Consider PlayTennis again and new instance <Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong> • Want to compute 38

表3-2 目标概念PlayTennis的训练样例 gain Gain ratio 39

Naive Bayes Subtleties • Conditional independence assumption is often violated • But it works surprisingly well anyway. Note don’t need estimated posteriors to be correct. need only that • See [Domingos, Pazzani,1996 ] 41

Naive Bayes: Subtleties • what if none of the training instances with target value vjhave attribute value ai ? Then  • Typical solution is Bayesian estimate for 42

Naive Bayes: Subtleties Typical solution is Bayesian estimate for • Where n is number of training examples for which v = vj • nc is number of examples for which v = vj and a = ai • p is prior estimate for • m is weight given to prior(i.e. number of “virtual” examples) 43

6.11 Bayesian Belief Networks • Naïve Bayes assumption of conditional independence too restrictive • Bayesian Belief networks describe conditional independence among subsets of variables. 44

6.11 Bayesian Belief Networks • 该网络表示了一组条件独立性假定。每个节点在给定其父结点时，条件独立于其非后代结点。 45

6.12 Expectation Maximization(EM) When to use • Data is only partially observable • Unsupervised clustering (target value unobservable) • Supervised learning (some instance attributes unobservable) 46

Generating Data from Mixture of k Gaussians • Each instance x generated by Choosing one of the k Gaussians with uniform probability • Generating an instance at random according to that Gaussian 47

EM for Estimating k Means • Given Instances from X generated by mixture of k Gaussian distributions • Unknown means <μ1,…,μk>of the k Gaussians • Don’t know which instance xi was generated by which Gaussian Determine • Maximum likelihood estimates of <μ1,…,μk> 48

EM for Estimating k Means • K=2 49

EM for Estimating k Means Think of full description of each instance as <xi, zi1, zi2> ,where • zij is 1 if xi generated by jth Gaussian • xi observable • zij unobservable 50

第 6 章 贝叶斯学习