190 likes | 201 Views
This chapter discusses probability distributions and density estimation, as well as the concept of Bayesian inference and the use of conjugate priors in binary and multinomial variables. It also introduces the Gaussian distribution and its applications.
E N D
Pattern Recognition and Machine Learning Chapter 2: ProbabilityDistributions chonbuk national university
Probability Distributions:General Density Estimation: given a finite set x1, . . . , xN of observations, find distribution p(x) ofx Frequentist’s Way: chose specific parameter values by optimizing criterion (e.g.,likelihood) Bayesian Way: prior distribution over parameters, compute posterior distribution with Bayes’rule Conjugate Prior: leads to a posterior distribution of the same functional form as the prior (makes life a lot easier:) chonbuk national university
Binary Variables: Frequentist’sWay Given a binary random variable x ∈ {0, 1} (tossing a coin)with p(x=1|µ)=µ,p(x=0|µ)=1−µ.(2.1) p(x) can be described by the Bernoullidistribution: Bern(x|µ)=µx(1−µ)1−x.(2.2) The maximum likelihood estimate for µis: µML =m with m=(#observationsofx=1)(2.8) N Yet this can lead to overfitting (especially for small N ),e.g., N = m = 3 yields µML =1 chonbuk national university
Binary Variables: Bayesian Way (1) The binomial distribution describes the number m of observations of x = 1 out of a data set of size N : N = fixed number of trials, m = number of success in N trial = probability of success, (1-) = probability of failure (2.9) Where, (2.10) 0.3 Histogram plot of the binomial distribution fro N = 10 and = 0.25 0.2 0.1 0 012345678910 Bishop Chapter 2: ProbabilityDistributions
Binary Variables: Bayesian Way (2) For a Bayesian treatment, we take the beta distribution as conjugatepriorby introducing prior distribution over the parameter . Γ(a +b) a−1 b−1 Beta(µ|a, b)= µ (1 −µ) (2.13) Γ(a)Γ(b) (Thegammafunctionextendsthefactorialtorealnumbers,i.e., Γ(n) = (n − 1)!.) Mean and variance are givenby: a a +b E[µ]= var[µ]= (2.15) ab (a+b)2(a+b+1) (2.16) Bishop Chapter 2: ProbabilityDistributions
Binary Variables: BetaDistribution Some plots of the betadistribution: 3 3 a =1 b =1 a =0 .1 b =0 .1 2 2 1 1 0 0 0 0.5 1 0 0.5 1 3 3 a =2 b =3 a =8 b =4 2 2 1 1 0 0 0 0.5 1 0 0.5 1 plots of the beta distribution given by (2.13) as a function of for various values of the hyper parameters and Bishop Chapter 2: ProbabilityDistributions
Binary Variables: Bayesian Way (3) Multiplying the binomial likelihood function (2.9) and the beta prior(2.13),theposteriorisabetadistributionandhastheform: (2.17) with l = N −m. Simple interpretation of hyperparameters a and b as effective number of observations of x = 1 and x = 0 (apriori) As we observe new data, a and b areupdated. As N → ∞, the variance (uncertainty) decreases and the mean converges to the MLestimate Bishop Chapter 2: ProbabilityDistributions
Multinomial Variables: Frequentist’sWay A random variable with K mutually exclusive states can be represented as a K dimensional vector x in which one elements ofxk = 1 and all other remaining equals 0. The Bernoulli distribution can be generalizedto (2.26) . For a data set D with Nindependent observations,thecorrespondinglikelihoodfunction takes theform (2.29) The maximum likelihood estimate for µis: µML =mk (2.33) k N Bishop Chapter 2: ProbabilityDistributions
Multinomial Variables: Bayesian Way(1) The multinomial distribution is a joint distribution of the parametersm1,...,mK,conditionedonµandN: Mult(m1,m2,...,mK|µ,N)= (2.34) Which is known as multinomial distribution. The normalization coefficient is the number of ways of partitioning objects into groups of size and is given by: (2.35) where the variables mk are subject to theconstraint: (2.36) Bishop Chapter 2: ProbabilityDistributions
Multinomial Variables: Bayesian Way(2) For a Bayesian treatment, the Dirichlet distribution can be taken as conjugateprior: (2.38) (2.39) is a gamma function and Bishop Chapter 2: ProbabilityDistributions
Multinomial Variables: DirichletDistribution SomeplotsofaDirichletdistributionover3variables: Dirichlet distributionwithval- ues (from left to right): α= Dirichlet distribution with values (clockwise from top left): α = (6,2,2),(3,7,5),(6,2,6),(2,3,4). (0.1,0.1,0.1),(1,1,1). Bishop Chapter 2: ProbabilityDistributions
Multinomial Variables: Bayesian Way(3) Multiplying the prior (2.38) by the likelihood function (2.34) yields theposterior for { in the form: (2.40) Again posterior distribution takes the form of Dirichlet distribution which is conjugate prior for the multinomial. This allows to find the normalization coefficient by comparing (2.38) (2.41) withm=(m1,...,mK)T.Similarlytothebinomialdistribution with its beta prior, αk can be interpreted as effective number of observations of xk= 1 (apriori). Bishop Chapter 2: ProbabilityDistributions
The gaussiandistribution The gaussian law of a Ddimensional vector x is: (2.43) Where is a D- dimensional mean vector, is a covariance matrix and denotes the determinant of . Motivations: maximum of theentropy, central limittheorem. 3 3 2 2 1 1 3 2 1 0 0 0 0.5 1 0 0 0.5 1 0 0.5 1 Histogram plots of the mean of N uniformly distributed numbers for various values of N. we observe that as N increases, the distribution tends towards a Gaussian. Bishop Chapter 2: ProbabilityDistributions
The gaussian distribution :Properties The law is a function of the Mahalanobis distance from x to µ: ∆2=(x−µ)TΣ−1(x−µ)(2.44) The expectation of x under the Gaussian distributionis: IE(x)=µ,(2.59) The covariance matrix of xis: cov(x)=Σ.(2.64) Bishop Chapter 2: ProbabilityDistributions
The gaussian distribution :Properties The law is constant on ellipticalsurfaces x2 u 2 The red curve shows the elliptical surface of constant probability density for a Gaussian in a two-dimensional space on which the density is of its value at The major axes of the ellipse are defined by the eigenvectors of the covariance matrix, with corresponding eigenvalues u1 y2 y1 µ λ1/2 2 λ1/2 1 x1 Where λi are the eigenvalues ofΣ, ui are the associatedeigenvectors. Bishop Chapter 2: ProbabilityDistributions
The gaussian distribution : Conditional and marginallaws Given a Gausian distribution N (x|µ, Σ)with: x=(xa,xb)T,µ=(µa,µb)T(2.94) Σ Σ aa ab Σ= (2.95) ΣbaΣbb The conditional distribution p(xa|xb) is again a Gaussian law with parameters: (2.81) Σ−1 Σ. (2.82) Σa|b = Σaa −Σab ba bb The marginal distribution p(xa) is a gaussian law with parameters (µa,Σaa). Bishop Chapter 2: ProbabilityDistributions
The gaussian distribution : Bayes’theorem Given a marginal Gaussian distribution for X and a conditional Gaussian distribution for y given X in the form p(x)=N(x,µ,Λ) p(y|x)=N(y,Ax+b,L−1) (2.113) (2.114) (y = Ax + b + s) where x is gaussian and s is a centered gaussian noise). Λ and are precision matrices. are parameters governing the means. Then, p(y)=N(y,Aµ+b,L−1 +AΛ−1AT) p(x|y)=N(x|Σ(ATL(y−b)+Λµ),Σ) (2.115) (2.116) where Σ=(Λ+ATLA)−1 (2.117) Bishop Chapter 2: ProbabilityDistributions
The gaussian distribution : Maximumlikelihood Assume we have X a set of N iid observations following a Gaussian law. The parameters of the law, estimated by ML for the mean and covariance are: (2.121) (2.122) The empirical mean is unbiased but it is not the case of the empirical variance. The bias can be correct multiplying ΣMLby N N −1 thefactor . Bishop Chapter 2: ProbabilityDistributions
The gaussian distribution : Maximumlikehood The mean estimated form N data points is a revision of the estimator obtained from the (N − 1) first datapoints: µ(N)=µ(N−1)+1(x (N−1) −µ ).(2.126) N ML ML ML N It is a particular case of the algorithm of Robbins-Monro, which iteratively search the root of a regressionfunction. Bishop Chapter 2: ProbabilityDistributions