CS 5331: Applied Machine Learning Spring 2011

CS 5331: Applied Machine LearningSpring 2011 Probability Distributions 1/26/2011 Mohan Sridharan

Overview • Probability density estimation given a set of i.i.d. data observations: ill-posed problem! • Parametric methods: • Specific functional form with parameters to estimate. • Binomial and Multinomial distributions for discrete RVs. • Gaussian distribution for continuous RV. • Conjugate priors: exponential family of distributions. • Non-parametric methods: • Parameters control model complexity instead of functional form. • Histograms, Nearest-neighbors, Kernels.

Parametric Distributions • Basic building blocks: • Need to determine given • Representation: or ?

Binary Variables (1) • Coin flipping: heads=1, tails=0. • Bernoulli Distribution:

Parameter Estimation (1) • ML for Bernoulli. Given: • Compute:

Parameter Estimation (2) • Example: • Prediction: all future tosses will land heads up! • Overfitting to D 

Binary Variables (2) • N coin flips: • Binomial Distribution:

Binomial Distribution

Beta Distribution • Distribution of mean: . • Hyper-parameters: a, b.

Bayesian Bernoulli The Beta distribution provides the conjugate prior for the Bernoulli distribution.

Beta Distribution

Prior ∙ Likelihood = Posterior

Properties of the Posterior • As the size of the dataset (N) increases: • Is this true for all of Bayesian learning?

Prediction under the Posterior What is the probability that the next coin toss will land heads up?

Multinomial Variables 1-of-K coding scheme:

ML Parameter estimation • Given: • Ensure using a Lagrange multiplier:

The Multinomial Distribution

The Dirichlet Distribution Conjugate prior for the multinomial distribution.

Bayesian Multinomial (1)

The Gaussian Distribution

Central Limit Theorem • The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. • Example: N uniform [0,1] random variables.

Bayesian Multinomial (2)

Geometry of Multivariate Gaussian(Section 2.3, PRML)

Moments of the Multivariate Gaussian (1) thanks to anti-symmetry of z

Moments of the Multivariate Gaussian (2)

Maximum Likelihood for the Gaussian (1) • Sections 2.3.1--2.3.3: conditional and marginal Gaussians. • Given i.i.d. data the log likelihood function is: • Sufficient statistics:

Maximum Likelihood for the Gaussian (2) • Set the derivative of the log likelihood function to zero: • To obtain: • Similarly:

Maximum Likelihood for the Gaussian (3) Under the true distribution Hence define

Sequential Estimation Contribution of the Nth data point, xN correction given xN correction weight old estimate

Bayesian Inference for the Gaussian (1) • Assume variance is known. Given datathe likelihood function is given by: • This has the form of quadratic exponential of but it is not a distribution over .

Bayesian Inference for the Gaussian (2) • Combined with a Gaussian prior: • Gives the posterior: • Completing the square in the exponent:

Bayesian Inference for the Gaussian (3) • Where: • Note:

Bayesian Inference for the Gaussian (4) • Example: for N = 0, 1, 2 and 10.

Bayesian Inference for the Gaussian (5) • Sequential estimation: • The posterior obtained after observing N-1 data points becomes the prior when we observe the Nth data point.

Bayesian Inference for the Gaussian (6) • Now assume is known and need to estimate . • This has a Gamma shape as a function of:

Bayesian Inference for the Gaussian (7) • The Gamma distribution:

Bayesian Inference for the Gaussian (8) • Now we combine a Gamma prior with the likelihood function to obtain: • Which is the same as

Bayesian Inference for the Gaussian (9) • If both mean and variance are unknown, the joint likelihood function is given by: • We need a prior with the same functional dependence on mean and variance.

Bayesian Inference for the Gaussian (10) • The Gaussian-gamma distribution: • Quadratic in mean. • Linear in variance. • Gamma distribution of precision. • Independent of mean.

Bayesian Inference for the Gaussian (11) • The Gaussian-gamma distribution:

Bayesian Inference for the Gaussian (12) • Multivariate conjugate priors. • Mean unknown, precision known: Gaussian prior. • Precision unknown, mean known: Wishart prior. • Mean and precision unknown: Gaussian-Wishart prior.

Student’s t-Distribution where Infinite mixture of Gaussians.

Student’s t-Distribution

Student’s t-Distribution • Robustness to outliers: Gaussian vs. t-distribution.

Periodic variables • Examples: calendar time, direction etc. • We require:

von Mises Distribution (1) This requirement is satisfied by: Where: is the 0th order modified Bessel function of the 1st kind.

von Mises Distribution (4)

Maximum Likelihood for von Mises • Given a data set the log likelihood function is given by: • Maximizing with respect to µ0 we directly obtain: • Similarly, maximizing with respect to m we get: • which can be solved numerically for mML.

Mixtures of Gaussians (1) • Old Faithful data set: Single Gaussian Mixture of two Gaussians

Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3

CS 5331: Applied Machine Learning Spring 2011

CS 5331: Applied Machine Learning Spring 2011

Presentation Transcript

Machine Learning: Learning finite state environments

Overview of Machine Learning for NLP Tasks: part I

CS546 Spring 2009 Machine Learning in Natural Language

A Brief Survey of Machine Learning

A Brief Survey of Machine Learning

CS 9633 Machine Learning Concept Learning

Overview of Machine Learning

Machine Learning: Learning finite state environments

Machine Learning CS 165B Spring 2012

WEKA ： A Practical Machine Learning Tool

machine learning

Machine Learning Contest

Lecture 1: Introduction to Machine Learning

Machine Learning Methods

PRESENTATION TO ALUMNI - March 5, 2011

Machine Learning II - Outline

Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning at Orbitz

Machine Learning in CSC 196K

Rating Systems Vs Machine Learning on the context of sports