620 likes | 808 Views
Ordinations. Peter Shaw. Introduction to ordination. “Ordination” is the term used to arrange multivariate data in a rational order, usually into a 2D space (ie a diagram on paper).
E N D
Ordinations Peter Shaw
Introduction to ordination • “Ordination” is the term used to arrange multivariate data in a rational order, usually into a 2D space (ie a diagram on paper). • The underlying concept is that of taking a set of community data (or chemical or physical…) and organising them into a picture that gives you insight into their organisation. Wet 1,2 Dry 4,5 Muddy 3
A direct ordination, projecting communities onto time: The sand dune succession at the southern end of lake Michegan (re-drawn from Olsen 1958) Cottonwood. Populus deltoides Black oak Quercus velutina Pines, Pinus spp Bare sand stabilised by Marram grass Elevation, m 193 189 183 177 0 350 600 850 1100 3500 10,000 Years stable
4000 Alpine dwarf scrub A 2-D direct ordination, also called a mosaic diagram, in this case showing the distribution of vegetation types in relation to elevation and moisture in the Sequoia national park. This is an example of a direct ordination, laying out communities in relation to two well-understood axes of variation. Redrawn from Vankat (1982) with kind permission of the California botanical society. Subalpine conifer Wet meadows 3000 Elevation, m Lodgepole pine Red fir Jeffrey pine pinyon 2000 Mixed conifers Ponserosa pine Montane hardwood Montane hardwood 1000 Chaparral hardwoods Valley-foothill hardwood Annual grassland Moist….…………………Dry
Bray-Curtiss ordination • This technique is good for introducing the concept of ordination, but is almost never used nowadays. • It dates back to the late 1950s when computers were unavailable, and had the advantage that it could be run by hand. • 3 steps: • Convert raw data to a matrix of distances between samples • identify end points • plot each sample on a graph in relation to the end points.
Year A B C 1 100 0 0 2 90 10 0 3 80 20 5 4 60 35 10 5 50 50 20 6 40 60 30 7 20 30 40 8 5 20 60 9 0 10 75 10 0 0 90 Sample data - a succession: Choose a measure of distance between years: The usual one is the Bray-Curtiss index (the Czekanowski index). Between year 1 & 2: A B C Y1 100 0 0 Y2 90 10 0 Minimum 90 0 0 = 90 Sum 190 10 0 =200 distance = 1-2*90/200 = 0.1
The matrix of B_C distances between each of the 10 years. Notice that the matrix is symmetrical about the leading diaginal, so only lower half is shown. PS I did this by hand. PPS it took about 30 minutes! 1 2 3 4 5 6 7 8 9 10 1 0.00 2 0.10 0.00 3 0.29 0.12 0.00 4 0.41 0.28 0.15 0.00 5 0.54 0.45 0.33 0.16 0.00 6 0.65 0.50 0.45 0.28 0.12 0.00 7 0.79 0.68 0.54 0.38 0.33 0.27 0.00 8 0.95 0.84 0.68 0.63 0.56 0.49 0.26 0.00 9 1.00 0.89 0.74 0.79 0.71 0.63 0.43 0.18 0.00 10 1.00 1.00 0.95 0.90 0.81 0.73 0.56 0.33 0.14 0.00
Now establish the end points - this is a subjective choice, I choose years 1 and 10 as being extreme values. • Now draw a line with 1 at one end, 10 at the other. • The length of this line = a distance of 1.0, based on the matrix above. Distance = 1.0 10 1
Now locate obs 2, 1.0 from year 10 and 0.1 from year 1. This can be done by circles. Year2 locates here: Radius = 1.0 10 1 Radius = 0.1 Distance 1<-> 2 = 0.1 units, distance 2<->10 = 1.0 units, so draw 2 circles:
The final ordination of the points (approx.) 5 3 4 6 7 2 8 9 10 1
Principal ComponentsAnalysis, PCA • This is our first true multivariate technique, and is one of my favourites. • It is fairly close to multiple linear regression, with one big conceptual difference (and a hugely different output).
The difference lies in fitting of residuals: Y X2 • MLR fits them vertically to one special variable (Y, the dependent). • PCA fits them orthogonally - all variables are equally important, there is no dependent. X1 V3 V2 V1
Conceptually the intentions are very different: • MLR seeks to find the best hyper-plane through the data. • PCA [can be thought of as] starting off by fitting one best LINE through the cloud of datapoints, like shining a laser beam through an array of tethered balloons. This line passes through the middle of the dataset (defined as the mean of each variable), and runs along the axis which explains the greatest amount of variation within the data. • This first line is known as the first principal axis – it is the most useful single line that can be fitted through the data (formally the linear combination of variables with the greatest variance).
The 1st principal axis of a dataset – shown as a laser in a room of tethered balloons! Overall mean of the dataset
Often the first axis of a dataset will be an obvious simple source of variation: overall size (body/catchment size etc) for allometric data, experimental treatment in a well designed experiment.A good indicator of the importance of a principal axis is the % variance it explains. This will tend to decrease as the number of variables = number of axes in the dataspace increases. For half-decent data with 10-20 variables you expect c. 30% of variation on the 1st principal axis.Having fitted one principal axis, you can fit a 2nd. It will explain less variance than the 1st axis.This 2nd axis explains the maximum possible variation, subject to 2 constraints: 1: is orthogonal (at 90 degrees to) the 1st principal axis 2: it runs through the mean of the dataset.
The 2nd principal axis of a dataset,– shown in blue This diagram shows a 3D dataspace (axes being V1, V2 and V3). Overall mean of the dataset V1 1.0 0.3 0.5 1.3 1.5 2.5 2.5 3.4 V2 0.1 1.0 2.0 0.8 1.5 2.2 2.8 3.4 V3 0.3 2.1 2.0 1.0 1.2 2.5 2.5 3.5
We can now cast a shadow.. • The first 2 principal axes define a 2D plane, on which the positions of all the datapoints can be projected. • Note that this is exactly the same as casting a shadow. • Thus PCA allows us to examine the shadow of a hig-dimensional object, say a 30 dimensional dataspace (defined by 30 variables such as species density). • Such a projection is a basic ordination diagram. • I always use such diagrams when exploring data – the scattergraph of the 1st 2 principal axes of a dataset is the most genuinely informative description of the entire data.
More axes: A PCA can generate as many principal axes as there are axes in the original dataspace, but each one is progressively less important than the one preceeding it. In my experience the first axis is usually intelligible (often blindingly obvious), the second often useful, the third rarely useful, 4th and above always seem to be random noise.
To understand a PCA output.. • Inspecting the scattergraph is useful,but you need to know the importance of each variable with respect to each axis. This is the gradient of the axis with respect to that variable. • This is given as a standard in PCA output, but you need to know how the numbers are arrived at to know what the gibberish means!
How it’s done.. • 1: derive the matrix of all correlation coefficients – the correlation matrix. Note the similarity to Bray-Curtiss ordination: we start with N columns of data, then derive an N*N matrix informing us about the relationship between each pair of columns. • 2: Derive the eigenvectors and eigenvalues of this matrix. It turns out that MOST multivariate techniques involve eigenvector analysis, so you may as well get used to the term!
Eigenvectors 1: • Involve you knowing a little about matrix multiplication. • Matrix multiplication is essentially the same as solving a shopping bill! • I have 2 apples, 1banana and 3 oranges. • You have 1 apple 2 bananas and 3 oranges. • Costs: 10p per apple, 15p per banana, 20p per orange. • I pay 2*10 + 1*15 + 3*20 = 95p • you pay 1*10 + 2*15+3*30 = 130p my total your total my amounts 95 130 10 15 20 2 1 3 1 2 3 X = your amounts Such a calculation is called a linear combination
Eigenvectors 2: • Are intimately linked with matrix multiplication. You don’t have to know this bit, but it would help! • Take an [N*N] matrix M, and use it to multiply a [1*N] vector of 1’s. • This gives you a new [1*N] vector, of different numbers. Call this V1 • Multiply V1 by M, to get V2. • Multiply V2 * M to get V3, etc. • After infinite repetitions the elements of V will settle down to a steady pattern – this is the dominant eigenvector of the matrix M 1. • Each time 1 is multiplied by M it grows by a constant multiple, which is the first eigenvalue of M E1.
* 1 1 1 V1 * V1 V2 * V2 V3 After a while, each successive multiplication preserves the shape (the eigenvector) while increasing values by a constant amount (the eigenvalue
The projections are made: • by multiplying the source data by the corresponding eigenvector elements, then adding these together. • Thus the projection of site A on the first principal axis is based on the calculation: • (spp1A*V11 + spp2A*V21+spp3A*V31…) • Where • spp1A = number of species 1 at site A, etc • V21 = 1st eigenvector element for species 2 • Luckily the PC does this for you. • There is one added complication: you do not usually use raw data in the above calculation. It is possible to do so, but the resulting scores are very dominated by the commonest species. • Instead all species data is first converted to Z scores, so that mean = 0.00 and sd = 1.00 • This means that principal axes, which always run through the mean, are always centred on the origin 0,0,0,0,… • It also means that typically half the numbers in a PCA output are negative, and all are apparently unintelligible!
Formally: • 1: Derive eigenvectors of correlation matrix. Call the 1st one E1. • 2: convert all raw data into Z scores (mean = 0, sd = 1.0) • 3: 1st axis scores = • Z * E1 (where Z is an N*R matrix, E is a 1*N matrix). • 2nd axis scores = Z*E2, etc.
The important part of a PCA.. • Is interpreting the output! • DON’T PANIC!! • Stage 1: look at the variance on the 1st axis. Values above 50% are hopeful, values in the 20s suggest a poor or noisy dataset. There is no significance test, but referal to table of predicted values from the broken stick distribution is often helpful.
Inspecting PCA output, 2 • 2: Look at the eigenvector elements on the first axis – which species / variables have large loadings (positive or negative). Do some variables have opposed loadings (one very +ve, one very –ve) suggesting a gradient between the two extremes? • The actual values of the eigenvector elements are meaningless – it is their pattern which matters. In fact the sign of elements will sometimes differ between packages if they use different algorithms! It’s OK! The diagrams will look the same, it’s just that the pattern of the points will be reversed.
Inspect the diagrams • Stage 3: plot the scattergraph of the ordination scores and have a look. • This is the shadow of the dataset. • Look for clusters, outliers, gradients. Each point is one observation, so you can identify odd points and check the values in them. (Good way to pick up typing errors). • Overlay the graph with various markers – often this picks out important trends.
Data on Collembola of industrial waste sites. The 1st axis of a PCA ordination detected habitat type: woodland vs open lagoon, while the second detected succession within the Tilbury trial plots.
Biplots • Since there is an intimate connection between eigenvector loadings and axis scores, it is helpful to inspect them together. • There is an elegant solution here, known as a biplot. • You plot the site scores as points on a graph, and put eigenvector elements on the same graph (usually as arrows).
Pond community handout data – site scores plotted by SPSS These are the new variables which appear after PCA: FAC1 and FAC2. Note aquatic sites at left hand side.
Factor scores (=eigenvector elements) for each species in pond dataset. Note aquatic species at left hand side. .8 potamoge .7 .6 .5 .4 epilob .3 ranuscle .2 phragaus .1 potentil AX2 0.0 -1.0 -.5 0.0 .5 1.0 AX1
Pond community handout data – the biplot. Note aquatic sites and species at left hand side. Note also that this is only a description of the data – no hypotheses are stated and no significance values can be calculated. Potamog Epilobium Ranunc Potent. Phrag.
Other ordination techniques: • Corresponence analysis (CA), also known as Reciprocal Averaging (RA) This technique is widely used, and relates to the notion of the weighted mean. 1: Take an N*R matrix of sites*species data, and calculate the weighted mean score for each site, giving each species an arbitrary weighting. Now use the site scores to get a weighted mean for each species. 2: repeat stage 1 until the scores stabilise.
CA has a useful feature: • Namely that it derives scores for species and sites at the same time. These can be displayed in a biplot, just as with PCA. This is conceptually simpler than a PCA biplot due to the simultaneous derivation of scores. • The formal algorithm involved here is almost the same as PCA – except that (in theory) you extract eigenvectors of the matrix of chi-squared distances between samples or species, instead of the correlation coefficients.
PCA and CA have an odd feature: • This concerns the ordination diagram produced when analysing a succession, or a community which changes steadily along a transect in space. You would expect this pattern – and would be wrong!! Axis 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Axis 1
Idealised successional data (used in Bray-Curtiss ordination last week
PCA ordination of idealised successional data – the Horseshoe effect (arch distortion).
This horseshoe effect was a puzzle when 1st discovered.. • Ideas were that it represented a bug in the techniques, or a hitherto undiscovered fundamental truth about ecological successions. • Neither is the case. The algorithm simply tells the truth! It’s just that humans are no good at visualising high-dimensional spaces. If they were they would know that successions must be arched in a space where distance = difference between sites. • Think about the difference between an early-successional site and a mid-successional one. It is likely to be absolute – no species in common. How about an early vs a late successional site? The same, not more.
So in a dataspace where separation = difference between sites? Site 4 max. separation The only way to ensure that the early-mid distance = early-late distance is for the succession to be portrayed as a curve. It is! Site 1 max. separation Site 3 Site 2 not far! Axis 2 Late Early Axis 1 mid
This arch effect caught me out too.. • In analysing the communities of fungi in pine forests (which roughly corresponded to a succession), I found the the 2nd axis of the PCA correlated significantly with diversity (Shannon index). • Getting my head round why the 2D projection of a 10D space should correlate with the information content of the community was not exactly easy! Well done to my supervisor for pointing out the arch distortion and community diversity both peaked mid-succession.
Ax2 Ax2 Young Old HENCE: Ax1 Medium Site age ALSO Ax2 Diversity HENCE: Diversity Site age
DECORANA (or rather DCA) • This arch effect was a well known irritant to ecologists by the mid 1970s, and was “sorted out” by a piece of software written by Mark Hill (then and still of ITE). • The program was called DECORANA, for DEtrended CORrespondence ANAlysis. The algorithm is properly called Detrended correspondence analysis or DCA - it is a common minor misnomer to confuse the algorithm and Mark Hill’s FORTRAN code which implements the algorithm.
DCA contd. • In fact recently a minor bug has been found in the source code - it could invalidate a few analyses, but in practice seems of minor importance (it concerns the way eigenvalues are sorted when values are tied). • Post-1997 issues of DCA should be safe. I still use an older version, with a reasonably clean conscience! • It is not supplied in any standard package, but is widely available in ecological packages - we have a copy in the dept.
DCA basics • The algorithm uses CA rather than PCA (faster extraction, + simultaneous extraction of species and site scores). • It is iterative: the 1st pass is simply a 2D CA ordination. • The 1st axis is then chopped up into N subsegments (default = 26), and each one rescaled to have a common mean.
The axis is also stretched at each end • This is to ensure that the extremes get equal representation (the technical term is “a constant turnover rate”). Before Ax2 After 1 7 2 6 1 2 3 4 5 6 7 Ax1 3 5 4 1 2 3 4 5 6 7
DCA repeats this procedure.. • Until the iterations converge. This gives the 1st DCA axis. It has an eigenvalue, species and site scores just like CA - but don’t ask what the numbers actually mean!! • The procedure is then repeated to get a 2nd axis - same procedure, but the 1st axis is removed by subtraction at each stage. • DECORANA also gives a 3rd axis by the same algorithm, then stops. Note that the DCA algorithm could give N axes. As I said, 4th axes and above tend to be merely noise.
DECORANA is presented.. • As biplots, although the scores can be subject to ANOVA etc. • There is no hypothesis inherent in DCA so no significance test inherent. • It is excellent for dealing with successions etc, and is many ecologists ordination of choice. • I try to avoid it only because the input data format required (Cornell Condensed Format) is an unmitigated headache! (Unless you are happy with fixed columns and Fortran fomat statements).
Cluster Analysis "The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other statistical innovation (with the possible exception of multiple regression techniques)." Cormack (1971) A review of classification. Journal of the Royal Statistical SocietyA 134, 321-367. Here the aim is to aggregate individuals into clusters, based on an objective estimate of the distance between them. The aim is to produce a dendrogram: This involves 2 choices, both allowing many options: 1: How to measure the distance? 2: What rules to build the dendrogram by?
In practice there are c 3 standard distance measures and c5 standard sets of rules to build the dendrogram, giving 15 different algorithms for cluster analysis. In addition to these 15 different STANDARD ways of making dendrogram, there are other options. One package (called CLUSTAN) offers a total around 180 different algorithms. But each different algorithm can generate a different shape of the dendrogram, a different pattern of relationships – and you have no way of knowing which one is most useful. For a book I explored a small number of datasets in painful detail using various multivariate techniques, and they all gave the same basic story, identified the same extreme values and clusters. Except cluster analysis, which told me several different stories, all garbage!!