A latent Gaussian model for compositional data with structural zeroes

A latent Gaussian model for compositional data with structural zeroes Adam Butler & Chris Glasbey Biomathematics & Statistics Scotland

1. Application to seabird diet • How does the composition of seabirddiet vary between colonies, years and seasons…? • Kittiwake data from four islands on the East coast of Scotland for 1997-2000 • Previously analysed byBull et al. (2004)

Relative proportions of D=3 food types: - SE0: juveline sandeels - SE1: adult sandeels - Other species (aggregated) • 543 individual birds – • 251 have SE0 only • 51 have SE1 only • 80 have “other” only • 158 have a mix

2. Compositional data • Compositional data refer to relative frequencies (proportions), and frequently arise in fields such as geology, economics and ecology. • If x denote data on the proportions of D components then x must lie on the unix simplex: • Such data cannot be analysed using standard methods because of the sum constraintthat xT1 = 1.

Well established approach for dealing with compositional data by modelling log-ratios of x using a multivariate normal distribution: Aitchison (1986) • If x lies on the interior of the simplex this works well, but it cannot be applied when some proportions of x are zero • No general approach for situation in which zero values of x may correspond to genuine absences of a component: “structural zeroes”

3. A latent Gaussian model

We assume that x=g(y), where: • y has a D-dimensional multivariate normal distribution with mean  and covariance matrix , where T1=1 and 1=0. • g is the function which performs a Euclidean projection of yonto the unit SimplexSD

Parsimonious: (D-1)(D+2)/2 parameters • Relatively flexible– can cope with a high proportion of zero values • No mathematical justification for our model, so important to check fit to the data • Diagnostic: compare patterns of zero values in the data with those given by the model

4. Inference • The log-likelihood function is • where: D(x;,)is the PDF of a multivariate normal distribution • is the “inverse” of g(y)

For general D the likelihood cannot be evaluated analytically, because: • There are no explicit formulae for either g(y) or h(x) • If we could evaluate h(x) the likelihood would still contain intractable integrals…

But in order to simulate from the model we only need to find the Euclidean projection of y onto the unit simplex: • We propose an iterative algorithm for doing this – will reach solution in at most D-1 steps

5. Approximate Bayesian Computation “ABC” is a methodology for drawing inferences by Monte Carlo simulation when the likelihood is intractable but the model is easy to simulate from In usual MCMC we tend to accept parameter values that have relatively high values of the likelihood In ABC we tend to accept parameter values that simulate data with summary statistics similar to those of the real data

Elements of ABC: Prior distribution() Summary statisticsS, Distance measure, threshold  Number of samplesN

Basic ABC algorithm: for (i = 1,…,N) { (1) Generate values *by simulating from prior () (2) Simulate y*from model with parameters * (3) If D(S(y*), S(y)) <  then set (i) = *; else go to (1) }

Generate values {0(1),…,0(N)}by simulating from prior () and applying basic ABC algorithm with threshold e0 for (t = 1,…,T) { Generate values {t(1),…,t(N)}by sampling from {t-1(1),…,t-1(N)}, proposing a move using q, and applying basic ABC algorithm with threshold et } Take et = , need proposal distn q,thresholdse0, e0,…,eT-1 Sequential ABC algorithm (Sisson et al., 2006)

Elements of ABC – our choices: Prior distribution(): uniform over a wide interval Summary statisticsS: - marginal means, marginal variances (x2); - means of differences between components (/2); - proportions of zero and one values for each component Distance measure D: Mean of absolute values of the elements of S(y*) - S(y)

6. Results – simulated data D=3 components Compare ABC (black) and analytic MLEs (red) Generate n=200 obs from symmetric model with marginal SDs of 1

6. Results – seabird data Aim in future to apply model to: - individual groups - more diet classes

7. Conclusions • Parsimonious model for compositional data that contain structural zeroes • Developed an iterative algorithm to simulate from the model • Likelihood cannot be computed analytically, so use ABC methods to draw inferences • Sequential ABC algorithm (Sisson et al., 2006) much more efficient than other ABC algorithms

Further information Email: adam@bioss.ac.uk Manuscript: www.bioss.ac.uk/staff/adam/publications.html http://www.rolexawards.com/special-feature/creatures/img/large506.jpg

A latent Gaussian model for compositional data with structural zeroes

A latent Gaussian model for compositional data with structural zeroes

Presentation Transcript

Latent Dirichlet Allocation a generative model for text

Gaussian Process Structural Equation Models with Latent Variables

Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models

Learning Structural SVMs with Latent Variables

The general structural equation model with latent variates

A Non-Gaussian Asymmetric Volatility Model

Gaussian Mixture Model

A Multicomponent Latent Trait Model for Diagnosis

A Robust Approach for Dealing with Missing Values in Compositional Data

Curriculum Learning for Latent Structural SVM

Univariate Gaussian Mixture Model

Latent Risk Model

Data Analysis with a Latent Variable Model

Group Norm for Learning Latent Structural SVMs

Latent model

Modeling compositional data

CoDaPack : A tool for Compositional Data Analysis

Structural Equation Modeling (SEM) With Latent Variables

A Constrained Latent Variable Model for Coreference Resolution

CoDaPack : A tool for Compositional Data Analysis

Gaussian Process Latent Variable Model (GPLVM)

Modeling compositional data