EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell

EE3J2 Data MiningLecture 10 Statistical ModellingMartin Russell EE3J2 Data Mining

Objectives • To review basic statistical modelling • To review the notion of probability distribution • To review the notion of probability distribution • To review the notion of probability density function • To introduce mixture densities • To introduce the multivariate Gaussian density EE3J2 Data Mining

Discrete variables • Suppose that Y is a random variable which can take any value in a discrete set X={x1,x2,…,xM} • Suppose that y1,y2,…,yNare samples of the random variable Y • If cmis the number of times that the yn = xmthen an estimate of the probability that yn takes the value xmis given by: EE3J2 Data Mining

Symbol 1 2 3 4 5 6 7 8 9 Total Num.Occurrences 120 231 90 87 63 57 156 203 91 1098 Discrete Probability Mass Function EE3J2 Data Mining

Continuous Random Variables • In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space • Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities… • …but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods EE3J2 Data Mining

Continuous Random Variables • An alternative is to use a parametric model • In a parametric model, probabilities are defined by a small set of parameters • Simplest example is a normal, or Gaussian model • A Gaussian probability density function (PDF) is defined by two parameters – its mean and variance  EE3J2 Data Mining

Gaussian PDF • ‘Standard’ 1-dimensional Guassian PDF: • mean =0 • variance =1 EE3J2 Data Mining

Gaussian PDF P(a  x  b) a b EE3J2 Data Mining

Constant to ensure area under curve is 1 Defines ‘bell’ shape Gaussian PDF • For a 1-dimensional Gaussian PDF p with mean  and variance : EE3J2 Data Mining

=0.1 =10.0 =1.0 =5.0 More examples EE3J2 Data Mining

Fitting a Gaussian PDF to Data • Suppose y = y1,…,yn,…,yN is a set of N data values • Given a Gaussian PDF p with mean  and variance , define: • How do we choose  and  to maximise this probability? EE3J2 Data Mining

Fitting a Gaussian PDF to Data Good fit Poor fit EE3J2 Data Mining

Maximum Likelihood Estimation • Define the best fitting Gaussian to be the one such that p(y|,) is maximised. • Terminology: • p(y|,), thought of as a function of y is the probability (density) of y • p(y|,), thought of as a function of , is the likelihood of , • Maximising p(y|,) with respect to , is called Maximum Likelihood (ML) estimation of , EE3J2 Data Mining

ML estimation of , • Intuitively: • The maximum likelihood estimate of  should be the average value of y1,…,yN, (the sample mean) • The maximum likelihood estimate of  should be the variance of y1,…,yN. (the sample variance) • This turns out to be true: p(y| , ) is maximised by setting: EE3J2 Data Mining

Multi-modal distributions • In practice the distributions of many naturally occurring phenomena do not follow the simple bell-shaped Gaussian curve • For example, if the data arises from several difference sources, there may be several distinct peaks (e.g. distribution of heights of adults) • These peaks are the modes of the distribution and the distribution is called multi-modal EE3J2 Data Mining

Gaussian Mixture PDFs • Gaussian Mixture PDFs, or Gaussian Mixture Models (GMMs) are commonly used to model multi-modal, or other non-Gaussian distributions. • A GMM is just a weighted average of several Gaussian PDFs, called the component PDFs • For example, if p1and p2are Gaussiam PDFs, then p(y) = w1p1(y) + w2p2(y) defines a 2 component Gaussian mixture PDF EE3J2 Data Mining

Gaussian Mixture - Example • 2 component mixture model • Component 1: =0, =0.1 • Component 2: =2, =1 • w1 = w2=0.5 EE3J2 Data Mining

Example 2 • 2 component mixture model • Component 1: =0, =0.1 • Component 2: =2, =1 • w1 = 0.2 w2=0.8 EE3J2 Data Mining

Example 3 • 2 component mixture model • Component 1: =0, =0.1 • Component 2: =2, =1 • w1 = 0.2 w2=0.8 EE3J2 Data Mining

Example 4 • 5 component Gaussian mixture PDF EE3J2 Data Mining

Gaussian Mixture Model • In general, an M component Gaussian mixture PDF is defined by: where each pmis a Gaussian PDF and EE3J2 Data Mining

Estimating the parameters of a Gaussian mixture model • A Gaussian Mixture Model with M components has: • M means: 1,…,M • M variances 1,…,M • M mixture weights w1,…,wM. • Given a set of data y = y1,…,yN, how can we estimate these parameters? • I.e. how do we find a maximum likelihood estimate of 1,…,M, 1,…,M, w1,…,wM? EE3J2 Data Mining

Parameter Estimation • If we knew which component each sample ytcame from, then parameter estimation would be easy: • Set mto be the average value of the samples which belong to the mth component • Set mto be the variance of the samples which belong to the mth component • Set wmto be the proportion of samples which belong to the mth component • But we don’t know which component each sample belongs to. EE3J2 Data Mining

This is a measure of how much yn ‘belongs to’ the mth component Solution – the E-M algorithm • Guess initial values • For each n calculate the probabilities • Use these probabilities to estimate how much each sample yn‘belongs to’ the mth component • Calculate: REPEAT EE3J2 Data Mining

The E-M algorithm local optimum p(y | ) (0)… (i) Parameter set  EE3J2 Data Mining

E-M Algorithm • Let’s just look at estimation of a the mean μ of a single component of a GMM • In fact, • In other words, λn is the probability of the mth component given the data point yn EE3J2 Data Mining

Calculate from mth Gaussian component mth weight Sum over all components E-M continued • From Bayes’ theorem: EE3J2 Data Mining

Example – initial model P(m1|y6)=λ1 m1 P(m2|y6)=λ2 m2 y6 EE3J2 Data Mining

Example – after 1st iteration of E-M EE3J2 Data Mining

Example – after 2nd iteration of E-M EE3J2 Data Mining

Example – after 4th iteration of E-M EE3J2 Data Mining

Example – after 10th iteration of E-M EE3J2 Data Mining

Multivariate Gaussian PDFs • All PDFs so far have been 1-dimensional • They take scalar values • But most real data will be represented as D-dimensional vectors • The vector equivalent of a Gaussian PDF is called a multivariate Gaussian PDF EE3J2 Data Mining

1-dimensional Gaussian PDFs Multivariate Gaussian PDFs Contours of equal probability EE3J2 Data Mining

1-dimensional Gaussian PDFs Multivariate Gaussian PDFs EE3J2 Data Mining

The covariance matrix Multivariate Gaussian PDF • The parameters of a multivariate Gaussian PDF are: • The (vector) mean  • The (vector) variance  • The covariance EE3J2 Data Mining

Multivariate Gaussian PDFs • Multivariate Gaussian PDFs are commonly used in pattern processing and data mining • Vector data is often not unimodal, so we use mixtures of multivariate Gaussian PDFs • The E-M algorithm works for multivariate Gaussian mixture PDFs EE3J2 Data Mining

Summary • Basic statistical modelling • Probability distributions • Probability density function • Gaussian PDFs • Gaussian mixture PDFs and the E-M algorithm • Multivariate Gaussian PDFs EE3J2 Data Mining

EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell