Probability Distributions and Bayesian Inference

Pattern Recognition and Machine Learning Chapter 2: ProbabilityDistributions chonbuk national university

Probability Distributions:General Density Estimation: given a finite set x1, . . . , xN of observations, find distribution p(x) ofx Frequentist’s Way: chose specific parameter values by optimizing criterion (e.g.,likelihood) Bayesian Way: prior distribution over parameters, compute posterior distribution with Bayes’rule Conjugate Prior: leads to a posterior distribution of the same functional form as the prior (makes life a lot easier:) chonbuk national university

Binary Variables: Frequentist’sWay Given a binary random variable x ∈ {0, 1} (tossing a coin)with p(x=1|µ)=µ,p(x=0|µ)=1−µ.(2.1) p(x) can be described by the Bernoullidistribution: Bern(x|µ)=µx(1−µ)1−x.(2.2) The maximum likelihood estimate for µis: µML =m with m=(#observationsofx=1)(2.8) N Yet this can lead to overfitting (especially for small N ),e.g., N = m = 3 yields µML =1 chonbuk national university

Binary Variables: Bayesian Way (1) The binomial distribution describes the number m of observations of x = 1 out of a data set of size N : N = fixed number of trials, m = number of success in N trial = probability of success, (1-) = probability of failure (2.9) Where, (2.10) 0.3 Histogram plot of the binomial distribution fro N = 10 and = 0.25 0.2 0.1 0 012345678910 Bishop Chapter 2: ProbabilityDistributions

Binary Variables: Bayesian Way (2) For a Bayesian treatment, we take the beta distribution as conjugatepriorby introducing prior distribution over the parameter . Γ(a +b) a−1 b−1 Beta(µ|a, b)= µ (1 −µ) (2.13) Γ(a)Γ(b) (Thegammafunctionextendsthefactorialtorealnumbers,i.e., Γ(n) = (n − 1)!.) Mean and variance are givenby: a a +b E[µ]= var[µ]= (2.15) ab (a+b)2(a+b+1) (2.16) Bishop Chapter 2: ProbabilityDistributions

Binary Variables: BetaDistribution Some plots of the betadistribution: 3 3 a =1 b =1 a =0 .1 b =0 .1 2 2 1 1 0 0 0 0.5 1 0 0.5 1 3 3 a =2 b =3 a =8 b =4 2 2 1 1 0 0 0 0.5 1 0 0.5 1 plots of the beta distribution given by (2.13) as a function of for various values of the hyper parameters and Bishop Chapter 2: ProbabilityDistributions

Binary Variables: Bayesian Way (3) Multiplying the binomial likelihood function (2.9) and the beta prior(2.13),theposteriorisabetadistributionandhastheform: (2.17) with l = N −m. Simple interpretation of hyperparameters a and b as effective number of observations of x = 1 and x = 0 (apriori) As we observe new data, a and b areupdated. As N → ∞, the variance (uncertainty) decreases and the mean converges to the MLestimate Bishop Chapter 2: ProbabilityDistributions

Multinomial Variables: Frequentist’sWay A random variable with K mutually exclusive states can be represented as a K dimensional vector x in which one elements ofxk = 1 and all other remaining equals 0. The Bernoulli distribution can be generalizedto (2.26) . For a data set D with Nindependent observations,thecorrespondinglikelihoodfunction takes theform (2.29) The maximum likelihood estimate for µis: µML =mk (2.33) k N Bishop Chapter 2: ProbabilityDistributions

Multinomial Variables: Bayesian Way(1) The multinomial distribution is a joint distribution of the parametersm1,...,mK,conditionedonµandN: Mult(m1,m2,...,mK|µ,N)= (2.34) Which is known as multinomial distribution. The normalization coefficient is the number of ways of partitioning objects into groups of size and is given by: (2.35) where the variables mk are subject to theconstraint: (2.36) Bishop Chapter 2: ProbabilityDistributions

Multinomial Variables: Bayesian Way(2) For a Bayesian treatment, the Dirichlet distribution can be taken as conjugateprior: (2.38) (2.39) is a gamma function and Bishop Chapter 2: ProbabilityDistributions

Multinomial Variables: DirichletDistribution SomeplotsofaDirichletdistributionover3variables: Dirichlet distributionwithval- ues (from left to right): α= Dirichlet distribution with values (clockwise from top left): α = (6,2,2),(3,7,5),(6,2,6),(2,3,4). (0.1,0.1,0.1),(1,1,1). Bishop Chapter 2: ProbabilityDistributions

Multinomial Variables: Bayesian Way(3) Multiplying the prior (2.38) by the likelihood function (2.34) yields theposterior for { in the form: (2.40) Again posterior distribution takes the form of Dirichlet distribution which is conjugate prior for the multinomial. This allows to find the normalization coefficient by comparing (2.38) (2.41) withm=(m1,...,mK)T.Similarlytothebinomialdistribution with its beta prior, αk can be interpreted as effective number of observations of xk= 1 (apriori). Bishop Chapter 2: ProbabilityDistributions

The gaussiandistribution The gaussian law of a Ddimensional vector x is: (2.43) Where is a D- dimensional mean vector, is a covariance matrix and denotes the determinant of . Motivations: maximum of theentropy, central limittheorem. 3 3 2 2 1 1 3 2 1 0 0 0 0.5 1 0 0 0.5 1 0 0.5 1 Histogram plots of the mean of N uniformly distributed numbers for various values of N. we observe that as N increases, the distribution tends towards a Gaussian. Bishop Chapter 2: ProbabilityDistributions

The gaussian distribution :Properties The law is a function of the Mahalanobis distance from x to µ: ∆2=(x−µ)TΣ−1(x−µ)(2.44) The expectation of x under the Gaussian distributionis: IE(x)=µ,(2.59) The covariance matrix of xis: cov(x)=Σ.(2.64) Bishop Chapter 2: ProbabilityDistributions

The gaussian distribution :Properties The law is constant on ellipticalsurfaces x2 u 2 The red curve shows the elliptical surface of constant probability density for a Gaussian in a two-dimensional space on which the density is of its value at The major axes of the ellipse are defined by the eigenvectors of the covariance matrix, with corresponding eigenvalues u1 y2 y1 µ λ1/2 2 λ1/2 1 x1 Where λi are the eigenvalues ofΣ, ui are the associatedeigenvectors. Bishop Chapter 2: ProbabilityDistributions

The gaussian distribution : Conditional and marginallaws Given a Gausian distribution N (x|µ, Σ)with: x=(xa,xb)T,µ=(µa,µb)T(2.94) Σ Σ aa ab Σ= (2.95) ΣbaΣbb The conditional distribution p(xa|xb) is again a Gaussian law with parameters: (2.81) Σ−1 Σ. (2.82) Σa|b = Σaa −Σab ba bb The marginal distribution p(xa) is a gaussian law with parameters (µa,Σaa). Bishop Chapter 2: ProbabilityDistributions

The gaussian distribution : Bayes’theorem Given a marginal Gaussian distribution for X and a conditional Gaussian distribution for y given X in the form p(x)=N(x,µ,Λ) p(y|x)=N(y,Ax+b,L−1) (2.113) (2.114) (y = Ax + b + s) where x is gaussian and s is a centered gaussian noise). Λ and are precision matrices. are parameters governing the means. Then, p(y)=N(y,Aµ+b,L−1 +AΛ−1AT) p(x|y)=N(x|Σ(ATL(y−b)+Λµ),Σ) (2.115) (2.116) where Σ=(Λ+ATLA)−1 (2.117) Bishop Chapter 2: ProbabilityDistributions

The gaussian distribution : Maximumlikelihood Assume we have X a set of N iid observations following a Gaussian law. The parameters of the law, estimated by ML for the mean and covariance are: (2.121) (2.122) The empirical mean is unbiased but it is not the case of the empirical variance. The bias can be correct multiplying ΣMLby N N −1 thefactor . Bishop Chapter 2: ProbabilityDistributions

The gaussian distribution : Maximumlikehood The mean estimated form N data points is a revision of the estimator obtained from the (N − 1) first datapoints: µ(N)=µ(N−1)+1(x (N−1) −µ ).(2.126) N ML ML ML N It is a particular case of the algorithm of Robbins-Monro, which iteratively search the root of a regressionfunction. Bishop Chapter 2: ProbabilityDistributions

Probability Distributions and Bayesian Inference