Gaussians

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Gaussians Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599

Gaussians in Data Mining • Why we should care • The entropy of a PDF • Univariate Gaussians • Multivariate Gaussians • Bayes Rule and Gaussians • Maximum Likelihood and MAP using Gaussians

Why we should care • Gaussians are as natural as Orange Juice and Sunshine • We need them to understand Bayes Optimal Classifiers • We need them to understand regression • We need them to understand neural nets • We need them to understand mixture models • … (You get the idea)

The “box” distribution 1/w -w/2 0 w/2

Entropy of a PDF Natural log (ln or loge) The larger the entropy of a distribution… …the harder it is to predict …the harder it is to compress it …the less spiky the distribution

The “box” distribution 1/w -w/2 0 w/2

Unit variance box distribution 0

The Hat distribution 0

Unit variance hat distribution 0

Dirac Delta The “2 spikes” distribution -1 0 1

Entropies of unit-variance distributions Largest possible entropy of any unit-variance distribution

Unit variance Gaussian

General Gaussian s=15 m=100

General Gaussian Also known as the normal distribution or Bell-shaped curve s=15 m=100 Shorthand: We say X ~ N(m,s2) to mean “X is distributed as a Gaussian with parameters m and s2”. In the above figure, X ~ N(100,152)

The Error Function Assume X ~ N(0,1) Define ERF(x) = P(X<x) = Cumulative Distribution of X

Using The Error Function Assume X ~ N(m,s2) P(X<x| m,s2) =

The Central Limit Theorem • If (X1,X2, … Xn) are i.i.d. continuous random variables • Then define • As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi] Somewhat of a justification for assuming Gaussian noise is common

Other amazing facts about Gaussians • Wouldn’t you like to know? • We will not examine them until we need to.

Then define to mean Bivariate Gaussians Where the Gaussian’s parameters are… Where we insist that S is symmetric non-negative definite

Then define to mean Bivariate Gaussians Where the Gaussian’s parameters are… Where we insist that S is symmetric non-negative definite It turns out that E[X] = m and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)* *This note rates 7.4 on the pedanticness scale

Evaluating p(x): Step 1 • Begin with vector x x m

Evaluating p(x): Step 2 • Begin with vector x • Define d = x - m x d m

Evaluating p(x): Step 3 Contours defined by sqrt(dTS-1d) = constant • Begin with vector x • Define d = x - m • Count the number of contours crossed of the ellipsoids formed S-1 • D = this count = sqrt(dTS-1d) = Mahalonobis Distance between x andm x d m

Evaluating p(x): Step 4 • Begin with vector x • Define d = x - m • Count the number of contours crossed of the ellipsoids formed S-1 • D = this count = sqrt(dTS-1d) = Mahalonobis Distance between x andm • Define w = exp(-D 2/2) exp(-D 2/2) D 2 x close to m in squared Mahalonobis space gets a large weight. Far away gets a tiny weight

Evaluating p(x): Step 5 • Begin with vector x • Define d = x - m • Count the number of contours crossed of the ellipsoids formed S-1 • D = this count = sqrt(dTS-1d) = Mahalonobis Distance between x andm • Define w = exp(-D 2/2) exp(-D 2/2) D 2

Example Observe: Mean, Principal axes, implication of off-diagonal covariance term, max gradient zone of p(x) Common convention: show contour corresponding to 2 standard deviations from mean

Example

Example In this example, x and y are almost independent

Example In this example, x and “x+y” are clearly not independent

Example In this example, x and “20x+y” are clearly not independent

Then define to mean Multivariate Gaussians Where the Gaussian’s parameters have… Where we insist that S is symmetric non-negative definite Again, E[X] = m and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)

General Gaussians x2 x1

Axis-Aligned Gaussians x2 x1

Spherical Gaussians x2 x1

Degenerate Gaussians What’s so wrong with clipping one’s toenails in public? x2 x1

Where are we now? • We’ve seen the formulae for Gaussians • We have an intuition of how they behave • We have some experience of “reading” a Gaussian’s covariance matrix • Coming next: Some useful tricks with Gaussians

Subsets of variables This will be our standard notation for breaking an m-dimensional distribution into subsets of variables

Gaussian Marginals are Gaussian Margin- alize THEN U is also distributed as a Gaussian

Gaussian Marginals are Gaussian Margin- alize This fact is not immediately obvious THEN U is also distributed as a Gaussian Obvious, once we know it’s a Gaussian (why?)

Margin- alize Gaussian Marginals are Gaussian How would you prove this? THEN U is also distributed as a Gaussian

Matrix A Multiply Linear Transforms remain Gaussian Assume X is an m-dimensional Gaussian r.v. Define Y to be a p-dimensional r. v. thusly (note ): …where A is a p x m matrix. Then… Note: the “subset” result is a special case of this result

+ Adding samples of 2 independent Gaussians is Gaussian • Why doesn’t this hold if X and Y are dependent? • Which of the below statements is true? • If X and Y are dependent, then X+Y is Gaussian but possibly with some other covariance • If X and Y are dependent, then X+Y might be non-Gaussian

Condition- alize Conditional of Gaussian is Gaussian

P(w|m=82) P(w|m=76) P(w)

Note: when given value of v is mv, the conditional mean of u is mu Note: marginal mean is a linear function of v P(w|m=82) Note: conditional variance can only be equal to or smaller than marginal variance P(w|m=76) Note: conditional variance is independent of the given value of v P(w)

Chain Rule Gaussians and the chain rule Let A be a constant matrix

Matrix A Condition- alize Margin- alize Chain Rule Multiply + Available Gaussian tools

Assume… • You are an intellectual snob • You have a child

Gaussians

Gaussians

Presentation Transcript

Gaussians Pieter Abbeel UC Berkeley EECS

Image Denoising Using Scale Mixtures of Gaussians in the Wavelet Domain

Drawing samples from high dimensional Gaussians using polynomials

Probability, Gaussians and Estimation

CIAR Summer School Tutorial Lecture 1a: Mixtures of Gaussians, EM, and Variational Free Energy

Factorial Mixture of Gaussians and the Marginal Independence Model

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians

Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example

Fitting Sums of Gaussians

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians

Using polynomials and matrix splittings to sample from LARGE Gaussians

Mixtures of Gaussians and Advanced Feature Encoding

Normal Distributions (aka Bell Curves, Gaussians)

Difference of Gaussians

Factorial Mixture of Gaussians and the Marginal Independence Model