Multivariate Analysis

Multivariate Analysis Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources • Everitt, BS, and G Dunn (2001) Applied Multivariate Data Analysis, London:Arnold. • Everitt, BS (2005) An R and S-PLUS® Companion to Multivariate Analysis, London:Springer

Roadmap • PBL group assignments • Multivariate data graphics tutorials • Testing distributional assumptions • Principle components analysis • Cluster analysis • Summary

PBL group assignments • Two groups

Multivariate data graphics tutorials • Available on the module website • Covers both standard and lattice graphics

Testing distributional assumptions • For these techniques to work, the data have to be distributed in a multivariate normal distribution. • There are two ways of testing this: • Examine each variable separately (this does not imply the data follow a multivariate normal distribution) • Convert the data to a single number (a generalised distance) and plot against an appropriate chi-squared distribution.

Separate Examination • X has two columns, and the combined data are bivariate normal: par(mfrow=c(1,2) qqnorm(X[,1],ylab= “Ordered observations”) qqline(X[,1]) qqnorm(X[,2],ylab= “Ordered observations”) qqline(X[,2])

Comparison to a chi-squared distribution • Same data, using chisplot available at http://biostatistics.iop.kcl.ac.uk/publications/everitt/ par(mfrow=c(1,1) chisplot(X)

Principle components analysis (PCA) • Describe the variation of a set of multivariate data in terms of a set of uncorrelated variables, each a linear combination of the original variables. • The goal is to reduce the number of meaningful variables to a small number that summarise the data set. • Deals with highly correlated explanatory variables. • Representative of projection pursuit methods.

Cluster analysis • A tool for classifying a phenomenon that sorts the samples into a small number of groups or clusters, usually non-overlapping. • These clusters may not be unique. • Predictive clustering • Clustering based on causation • Hence a cluster analysis is neither true nor false, but is simply useful.

Cluster analysis approaches • Agglomerative hierarchical clustering (fusion from the bottom-up) • K-means type methods (partition from the top down) • Classification maximum likelihood methods (assume a model for the shape of the clusters) • Or you can simply use the tree library. library(tree) model<-tree(ozone~.,data=ozone.pollution) plot(model) text(model)

Summary • Multivariate statistics is usually done from the point of view that there are no laws of scientific inference—‘anything goes’. • First, you explore the data to come up with hypotheses—the models. • Then you confirm the models on a second data set. • If you have a single data set, split it into two parts, one for exploration and one for confirmation. • Good data analysis is based on the skilful interpretation of evidence and the subsequent development of hunches.

Multivariate Analysis