Bayesian Decision Theory (Classification)

Bayesian Decision Theory(Classification) 主講人：虞台文

Contents • Introduction • Generalize Bayesian Decision Rule • Discriminant Functions • The Normal Distribution • Discriminant Functions for the Normal Populations. • Minimax Criterion • Neyman-Pearson Criterion

Bayesian Decision Theory(Classification) Introduction

What is Bayesian Decision Theory? • Mathematical foundation for decision making. • Using probabilistic approach to help making decision (e.g., classification) so as to minimize the risk (cost).

Preliminaries and Notations a state of nature prior probability feature vector class-conditional density posterior probability

Bayesian Rule

Decision unimportant in making decision

Decision Decide i if P(i|x) > P(j|x)  j  i Decide i if p(x|i)P(i) > p(x|j)P(j)  j  i • Special cases: • P(1)=P(2)=   =P(c) • p(x|1)=p(x|2) =   = p(x|c)

Example R2 R1 P(1)=P(2)

R2 R1 R2 R1 Example P(1)=2/3 P(2)=1/3 Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Classification Error Consider two categories: Decide 1 if P(1|x) > P(2|x); otherwise decide 2

Bayesian Decision Theory(Classification) Generalized Bayesian Decision Rule

The Generation a set of c states of nature a set of a possible actions The loss incurred for taking action i when the true state of nature is j. Risk can be zero. We want to minimize the expected loss in making decision.

Conditional Risk Given x, the expected loss (risk) associated with taking action i.

0/1 Loss Function

Decision Bayesian Decision Rule:

Overall Risk Decision function • Bayesian decision rule: • the optimal one to minimize the overall risk • Its resulting overall risk is called the Bayesian risk

State of Nature Loss Function Action Two-Category Classification

Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2

Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2 positive positive Posterior probabilities are scaled before comparison.

irrelevant Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2

This slide will be recalled later. Two-Category Classification Threshold Likelihood Ratio Perform 1 if

Bayesian Decision Theory(Classification) Discriminant Functions

Action (e.g., classification) x How to define discriminant functions? The Multicategory Classification gi(x)’s are called the discriminant functions. g1(x) (x) g2(x) gc(x) Assign x to i if gi(x) > gj(x) for all j i.

If f(．) is a monotonically increasing function, than f(gi(．) )’s are also be discriminant functions. Simple Discriminant Functions Minimum Risk case: Minimum Error-Rate case:

Decision Regions Two-category example Decision regions are separated by decision boundaries.

Bayesian Decision Theory(Classification) The Normal Distribution

Basics of Probability Discrete random variable (X) －Assume integer Probability mass function (pmf): Cumulative distribution function (cdf): Continuous random variable (X) not a probability Probability density function (pdf): Cumulative distribution function (cdf):

Expectations Let g be a function of random variable X. The kth moment The 1st moment The kth central moments

Fact: Important Expectations Mean Variance

Entropy The entropy measures the fundamental uncertainty in the value of points selected randomly from a distribution.

p(x) μ x σ σ 2σ 2σ 3σ 3σ • Properties: • Maximize the entropy • Central limit theorem Univariate Gaussian Distribution X~N(μ,σ2) E[X] =μ Var[X] =σ2

Random Vectors A d-dimensional random vector VectorMean: Covariance Matrix:

Multivariate Gaussian Distribution X~N(μ,Σ) A d-dimensional random vector E[X] =μ E[(X-μ) (X-μ)T] =Σ

Properties of N(μ,Σ) X~N(μ,Σ) A d-dimensional random vector Let Y=ATX, where A is a d × k matrix. Y~N(ATμ, ATΣA)

On Parameters of N(μ,Σ) X~N(μ,Σ)

More On Covariance Matrix  is symmetric and positive semidefinite. : orthonormal matrix, whose columns are eigenvectors of . : diagonal matrix (eigenvalues).

Whitening Transform X~N(μ,Σ) Y=ATX Y~N(ATμ, ATΣA) Let

Whitening Transform Whitening X~N(μ,Σ) Linear Transform Y=ATX Y~N(ATμ, ATΣA) Let Projection

Mahalanobis Distance X~N(μ,Σ) r2 constant depends on the value of r2

Bayesian Decision Theory(Classification) Discriminant Functions for the Normal Populations

Minimum-Error-Rate Classification Xi~N(μi,Σi)

Minimum-Error-Rate Classification Three Cases: Case 1: Classes are centered at different mean, and their feature components are pairwisely independent have the same variance. Case 2: Classes are centered at different mean, but have the same variation. Case 3: Arbitrary.

Case 1. i = 2I irrelevant irrelevant

Case 1. i = 2I

Case 1. i = 2I Boundary btw. i and j

Bayesian Decision Theory (Classification)