1.3k likes | 3.81k Views
Introduction to multivariate analysis. Data does not always come with a single response Nor does it always have a response A data set may consist simply n measurements on p variables For example, a doctor might record patient’s height, weight, blood pressure and pulse.
E N D
Introduction to multivariate analysis • Data does not always come with a single response • Nor does it always have a response • A data set may consist simply n measurements on p variables • For example, a doctor might record patient’s height, weight, blood pressure and pulse. • We could envision situations where any one of these variables is the response and the other’s are predictors. • Or we could have a situation where we just wanted to examine similarities (and differences) of patients Statistical Data Analysis - Lecture25 - 23/05/03
Multivariate data • So we can see that in some ways we’ve already encountered multivariate data – multiple regression is an example • But we really haven’t learned how to deal with anything other than a single response • And we’re not going to! • There is a whole body of multivariate data analysis literature devoted to extensions of techniques we’ve seen to far • Hotellings’ T2 – a multivariate extension to the t-test • MANOVA – multivariate ANOVA with more than one response • Multiple regression where the response is multidimensional (canonical correlation) Statistical Data Analysis - Lecture25 - 23/05/03
Multivariate data • The fact is that in some ways these extensions are trivial • Sure the interpretation issues are harder • And the assumptions are just about impossible to verify • For this reason, we will concentrate on multivariate data description • This is by no means easy • For example how do we visualize data in more than 3 dimensions • Like EDA for low dimension problems, multivariate data visualization and explanation is possibly one of the most important treatments of multivariate data Statistical Data Analysis - Lecture25 - 23/05/03
An introduction to linear algebra • Whilst it is theoretically possible to avoid discussing linear algebra when talking about multivariate techniques, it is practically impossible. • The reason for this is that we need a common language in order to get some handle on what we’re doing and what we’re talking about • Linear algebra provides that common language (and indeed underlies a majority of the statistics you have encountered already) • We will not get hung up on computational techniques, as they are often abstracted from the theory Statistical Data Analysis - Lecture25 - 23/05/03
Some definitions • Dimensions in linear algebra – every object in linear algebra (scalar, vector or matrix) has a set of dimensions associated with it. • These dimensions are reported as rows and columns. So for example a scalar (a real number) has 1 row and 1 column. An r c matrix has r rows and c columns • An n-dimensional row vector is a list of n points (arranged in 1 row and n columns) which describe a point in n-dimensional space. • An n-dimensional column vector is a list of n points (arranged in n rows and 1 column) which describe a point in n-dimensional space. • In terms of using row or column vectors to describe a point in space, it makes no difference. However they do behave differently when it comes to operations like multiplication Statistical Data Analysis - Lecture25 - 23/05/03
If a vector (of length n) is said to be real valued then it represents a point in n – the real value hyperplane (2 is the standard 2D plane, and 3 is what we live in) • We write vectors as bold lower case letters. • In statistics, the default orientation (unless otherwise defined by the author) of a vector, is a column vector. E.g.: a is a column vector not a row vector • A n p matrix a collection of np elements arranged in n rows and p columns. • Given that we tend to represent variables as column vectors, a matrix is a collection of p column vectors of length n. • We write matrices as bold capital letters, e.g. X is a matrix Statistical Data Analysis - Lecture25 - 23/05/03
The elements of a matrix are referred to by two indices, a row index and a column index, so xijrefers to the jth element on the ith row of the matrix X. • The transpose of a matrix reverses the order of the rows and the columns. That is, an n p matrix becomes a p n matrix with the ijth element equal to xji • The transpose of a matrix X is denoted XT, Xtor X • Usually we denote the jth column of X asxjand the ith row as Statistical Data Analysis - Lecture25 - 23/05/03
Like ordinary multiplication, there is the matrix equivalent of division, but some special conditions apply. • If X is an n n matrix (then X is called a square matrix), andX is of full rank (there is a unique solution b to the equation Xb = 0), then X has an inverseX-1, such that XX-1 = X-1X = Inwhere Inis the identity matrix (a matrix with ones down the diagonal and zeroes else where) • E.g. if Statistical Data Analysis - Lecture25 - 23/05/03
If A is an np matrix and Bis a rc then we can multiply B by A (to get AB)only ifp = r • Note this does not automatically mean we can multiply A by B – this can only happen if c = n • In general AB BA • If A is an np matrix and Bis a rc and p = r then the ijth element of the product (let C = AB) is given by • This is more easily demonstrated than seen from the formula Statistical Data Analysis - Lecture25 - 23/05/03
Distance measures • There are a variety of different ways that distance is measured (between two multidimensional points), and their pros and their cons are just about as varied • If we have just two variables (p = 2) X and Y with n observations on each, then the distance between the ith and the jth point is given by Pythagoras’ theorem Statistical Data Analysis - Lecture25 - 23/05/03
Euclidean distance • The examples you have seen so far are called Euclidean distances and have a natural extension when there are more than three variables (p>3). • For any p the Euclidean distance between two points is given by • This should look familiar – this is the distance we minimize for regression. • The second part of the equation is called the L2 norm Statistical Data Analysis - Lecture25 - 23/05/03
One direct downside of the Euclidean difference is that it is dominated by variables with a large mean (relative to the other variables). • For example, if variable X1 measures height in mm and variable X2 measures weight in stone , then most of the distance will be dominated by X1 • One solution to this is to scale the each variable before measuring the distance • That is, we subtract the mean of each variable from every measurement for that variable and divide by the standard deviation • This can work well, but has the disadvantage of removing information about separation Statistical Data Analysis - Lecture25 - 23/05/03
Alternative distance measures • There are a whole set of distance measures based on norms • The L2 norm is only one of a family of measures (called p norms, or Lp norms) given by the general formula • When p = 1 the L1 norm is sometimes called the Manhatten distance • When p is infinite the L, is called the infinity norm (or max or sup norm) and is defined by Statistical Data Analysis - Lecture25 - 23/05/03
Mahalanobis distance • The Mahalanobis distance is used to measure the distance of a single multivariate observation from the centre of the population that the observation comes from. • If are the values of for the individual, with corresponding population mean values then where V is the population covariance matrix. • If we have the population means and covariance matrix then D2 follows a chi-square distribution with p degrees of freedom Statistical Data Analysis - Lecture25 - 23/05/03
Mahalanobis distance • The covariance matrix is the multivariate equivalent of the variance for a single observation, with the diagonal elements equal to the sample variance and the off diagonal elements cij i.e. the sample covariance between the ith and jth variables • Unfortunately we almost never have the means or the covariance matrix, and so we must estimate them from the data • The covariance matrix V is estimated by taking a pooled average of the covariance matrices for each of the variables • It is unclear how quickly the Mahalanobis distance converges to a chi-square distribution, but Manly suggests when p = 100 (100 independent variables) there should be no problem in assuming this. Statistical Data Analysis - Lecture25 - 23/05/03