Bayesian Learning

Bayesian Learning Thanks to Nir Friedman, HU .

sensor Example • Suppose we are required to build a controller that removes bad oranges from a packaging line • Decision are made based on a sensor that reports the overall color of the orange Bad oranges

Classifying oranges Suppose we know all the aspects of the problem: Prior Probabilities: • Probability of good (+1) and bad (-1) oranges • P(C = +1) = probability of a good orange • P(C = -1) = probability of a bad orange • Note: P(C = +1) + P(C = -1)= 1 • Assumption: oranges are independent The occurrence of a bad orange does not depend on previous

p(X | C = -1 ) p(X | C = +1 ) Classifying oranges (cont) Sensor performance: • Let X denote sensor measurement from each type of oranges

Bayes Rule • Given this knowledge, we can compute the posterior probabilities Bayes Rule

p(C = -1 |X) P(C = +1|X) 1 0 good bad bad Decision making Intuition: • Predict “Good” if P(C=+1 | X) > P(C=-1 | X) • Predict “Bad”, otherwise

Loss function • Assume we have classes +1, -1 • Suppose we can make predictions a1,…,ak • A loss function L(ai,cj)describes the loss associated with making prediction ai when the class is cj Real Label Prediction

Expected Risk • Given the estimates of P(C | X) we can compute the expected conditional risk of each decision

p(C = -1 |X) P(C = +1|X) 1 0 The Risk in Oranges 10 R(Good|X) 5 R(Bad|X) 0

Optimal Decisions Goal: • Minimize risk Optimal decision rule: “Given X = x, predict ai if R(ai|X=x) = mina R(a|X=x) “ • (break ties arbitrarily) Note: randomized decisions do not help

0-1 Loss • If we don’t have prior knowledge, it is common to use the 0-1 loss • L(a,c) = 0 if a = c • L(a,c) = 1 otherwise Consequence: • R(a|X) = P(a c|X) • Decision rule:“choose ai if P(C = ai | X) = maxa P(C = a|X) “

Bayesian Decisions: Summery Decisions based on two components: • Conditional distribution P(C|X) • Loss function L(A,C) Pros: • Specifies optimal actions in presence of noisy signals • Can deal with skewed loss functions Cons: • Requires P(C|X)

Simple Statistics : Binomial Experiment • When tossed, it can land in one of two positions: Head or Tail • We denote by  the (unknown) probability P(H). Estimation task: • Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  Head Tail

Why Learning is Possible? Suppose we perform M independent flips of the thumbtack • The number of head we see is a binomial distribution • and thus This suggests, that we can estimate  by

Maximum Likelihood Estimation MLE Principle: Learn parameters that maximize the likelihood function • This is one of the most commonly used estimators in statistics • Intuitively appealing • Well studied properties

Computing the Likelihood Functions • To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) • Applying the MLE principle we get • NH and NT are sufficient statistics for the binomial distribution

Datasets Statistics Sufficient Statistics • A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood • Formally, s(D) is a sufficient statistics if for any two datasets D and D’ • s(D) = s(D’ )  L( |D) = L( |D’)

Maximum A Posterior (MAP) • Suppose we observe the sequence • H, H • MLE estimate is P(H) = 1, P(T) = 0 • Should we really believe that tails are impossible at this stage? • Such an estimate can have disastrous effect • If we assume that P(T) = 0, then we are willing to act as though this outcome is impossible

Laplace Correction Suppose we observe n coin flips with k heads • MLE • Laplace correction: As though we observed one additional H and one additional T • Can we justify this estimate? Uniform prior!

Bayesian Reasoning • In Bayesian reasoning we represent our uncertainty about the unknown parameter  by a probability distribution • This probability distribution can be viewed as subjective probability • This is a personal judgment of uncertainty

Bayesian Inference We start with • P() - prior distribution about the values of  • P(x1, …, xn|) - likelihood of examples given a known value  Given examples x1, …, xn, we can compute posterior distribution on  Where the marginal likelihood is

Binomial Distribution: Laplace Est. • In this case the unknown parameter is  = P(H) • Simplest prior P() = 1 for 0< <1 • Likelihood where k is number of heads in the sequence • Marginal Likelihood:

Marginal Likelihood Using integration by parts we have: Multiply both side by n choose k, we have

Marginal Likelihood - Cont • The recursion terminates when k = n Thus We conclude that the posterior is

Bayesian Prediction • How do we predict using the posterior? • We can think of this as computing the probability of the next element in the sequence • Assumption: if we know , the probability of Xn+1 is independent of X1, …, Xn

Bayesian Prediction • Thus, we conclude that

Naïve Bayes .

Bayesian Classification: Binary Domain Consider the following situation: • Two classes: -1, +1 • Each example is described by by N attributes • Xn is a binary variable with value 0,1 Example dataset:

Binary Domain - Priors How do we estimate P(C) ? • Simple Binomial estimation • Count # of instances with C = -1, and with C = +1

Training set for P(X1,…,XN|C=+1): Training set for P(X1,…,XN|C=-1): Binary Domain - Attribute Probability How do we estimate P(X1,…,XN|C) ? Two sub-problems:

Naïve Bayes Naïve Bayes: • Assume This is an independence assumption • Each attribute Xi is independent of the other attributes once we know the value of C

Naïve Bayes:Boolean Domain • Parameters: • for each i How do we estimate 1|+1? • Simple binomial estimation • Count #1 and #0 values of X1in instances where C=+1

Interpretation of Naïve Bayes

Interpretation of Naïve Bayes • Each Xi “votes” about the prediction • If P(Xi|C=-1) = P(Xi|C=+1) then Xi has no say in classification • If P(Xi|C=-1) = 0 then Xi overrides all other votes (“veto”)

Interpretation of Naïve Bayes Set Classification rule

N(0,12) N(4,22) Normal Distribution The Gaussian distribution: 0.4 0.3 0.2 0.1 0 -4 -2 0 2 4 6 8 10

Maximum Likelihood Estimate • Suppose we observe x1, …, xm Simple calculations show that the MLE is • Sufficient statistics are

Naïve Bayes with Gaussian Distributions • Recall, • Assume: • Mean of Xidepends on class • Variance of Xi does not

Naïve Bayes with Gaussian Distributions Recall Distance between means Distance of Xi to midway point

Different Variances? • If we allow different variances, the classification rule is more complex • The term is quadratic in Xi

Bayesian Learning