610 likes | 758 Views
The Fourth BIOSAFENET Seminar – January 12-15, 2009. International Centre for Genetic Engineering and Biosafety (ICGEB) Ca’ Tron di Roncade, Italy. Permutation tests in multivariate analysis Michele Scardi Department of Biology ‘Tor Vergata’ University Rome, Italy.
E N D
The Fourth BIOSAFENET Seminar– January 12-15, 2009 International Centre for Genetic Engineering and Biosafety (ICGEB) Ca’ Tron di Roncade, Italy Permutation tests in multivariate analysisMichele ScardiDepartment of Biology‘Tor Vergata’ UniversityRome, Italy Email: mscardi@mclink.itURL: http://www.michele.scardi.name
Ecology and agriculture A few reasons why studyingGMO effects on NTO is relevant to ecologists
Agriculture and biodiveristy High Nature Value farmland Intensive farmland Intermediatedisturbancehypothesis! Modified from EEA Report 2/2006
CAP and community ecology • In the past, Common Agricultural Policy mainly promoted the expansion of agricultural production. • A complete re-arrangement of the funding scheme now puts the environment at the center of farming policy. • Preserving biodiversity (and therefore monitoring agroecosystems) is one of the CAP main goals. • Biodiversity can only be monitored through community ecology studies. • Similar targets have been set in other fields (e.g. Water Framework Directive).
EU and GMOs • In December 2008, the Council of the EU adopted the following conclusions on GMOs: “GMOs, … , give rise to discussion and questions, within the scientific community and society at large regarding their possible impact on health, environment and ecosystems.” • Therefore, there is a growing need for ecological research on agroecosystems.
Permutation tests Because bugs don’t know they’re supposed to be normally distributed…
A basic (univariate) permutation test • Sample A: 28, 32, 45 (mA=35.00) • Sample B: 22, 25, 29 (mB=25.33) • H0: no difference between means • Six data can be arranged in two groups of three in 20 different combinations:
A basic (univariate) permutation test • Difference between Sample A and Sample B means ismA-mB=9.67 • By permuting our data we obtained an empirical distribution of the values of this difference acting as ifH0: equal meanswere true • How large is this original difference relative to the empirical distribution? • Is it so large that H0 is probably false?
P-value cannot be less than: A basic (univariate) permutation test In 2+1casesout of 19+1:|mA-mB| 9.67 P(H0: mA=mB) = = (2+1)/(19+1) = = 0.15 > 0.05 We cannot rejectH0: mA=mB mA-mB
What permutation tests do ecologist use? • ANOSIM, MRPP, NPMANOVA, Mantel Test and Indicator Species Analysis are quite popular • Other permutation have been developed to suit specific needs (e.g. testing significance of relationships between envrionmental variables and species in CCA) • If you have to solve a particular ecological problem, chances are you can develop your own permutation test
Treatment A Treatment B Treatment C Treatment D 16 14 12 10 species 2 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 22 24 species 1 A toy multivariate problem (4 treatments and a community with only 2 species)…
…and a toy permutation test • Differences between samples are measured as Euclidean distances • The test statistics is the the average distance between group centroids (i.e. average responses to treatments) • If the test statistics is large enough with respect to those obtained by data permutation, then differences between groups are significant.
Species 2 Distances between average responses to treatments (between centroids) Species 1
Mantel test X matrixe.g. geographical distances Y matrixe.g. species dissimilarity Problem: are X and Y (in)dependent of each other?
Mantel statistics Range: [0,) Range: [-1,1] i.e. correlation between xij and yij
Mantel statistics X Y Z=8.867
Z=8.867 Mantel statistics X Y NB: Mantel statistics is maximum when large xij are multiplied by large yij, i.e. when the two matrices have the same structure
Mantel test X actual value Z = 8.867 ZP1 = 8.365 ZP2 = 7.834 ZP3 = 6.897 ZP4 = 8.531 ZP5= 8.885 … ZPn = 7.852 permutations Y Z=8.867 ZP1=8.365
Frequency Z Mantel test actual value actual value H0: X and Y are independent Z = 8.867 ZP1 = 8.365 ZP2 = 7.834 ZP3 = 6.897 ZP4 = 8.531 ZP5= 8.885 … ZPn = 7.852 permutations permutations 3% Z=8.867 p(Z)=0.03 Therefore, we reject H0
Mantel test • The test can be performed on any couple of distance or similarity matrices, for instance: • two different groups of organisms • environmental data and species composition • etc. • In case the matrices are large enough (D>30) the test statistics is approximately distributed like Student’s t
W and E basins are separated by two shallow sills Mantel test Years: 1969-1970Samples: 9 (W), 12 (E)Depth: 570-2550 m H0: structure ofdeep copepod assemblages is independent of basin W E But… More about this at the endof the presentation! Jaccard similarity Same basin? (1=yes, 0=no) W E W E W W H0 was rejected, concluding that (according to common opinion) Eastern and Western basins actually had differentdeep copepod assemblages R=0.37 p<0.001 E E
A few more facts about Mantel test • Mantel test assumes that relationships between data in the two matrices are linear • Therefore, it is sensitive to non-linearity (as well as to outliers) • A rank-based Mantel test can be also performed, although it is not very popular • Partial Mantel correlation (e.g. using 3 matrices) can be computed • Mantel test can be performed on non-symmetrical matrices
i.e. they can be used to arrange objects, samples, etc. in an Euclidean space Measuring ecological resemblance • Similarity coefficients • Range: [0,1] • S=1 when similarity between species lists is maximum • Dissimilarity coefficients • D=1-S • Range: [0,1] • D=0 when similarity between species lists is maximum • Some of them are metric • Distance coefficients • Range: [0,] or [0,Dmax] • D=0 when two species lists are identical (In some cases proportional) • Most of them are metric
Measuring ecological resemblance • Legendre & Legendre (1983,1998) list 25 similarity coefficients and 14 metric distances, but many others have been used. • Different measures of ecological resemblance may lead to different results. • Selection of an optimal way of measuring ecological resemblance is inherently subjective.
A smart pick: asymmetric similarity coefficients ignore absence data (they rely upon the lowest level of information, i.e. presence) A few examples… Jaccard Euclidean Sørensen Manhattan Steinhaus(Bray-Curtis) Canberra
quantitative Manhattan Canberra treatmentor impact control treatmentor impact qualitative similaritieswithin groups Jaccard similaritiesbetween groups control
ANOSIM(ANalysisOfSIMilarities) H0: mean rank of similarites between groups is equal to mean rank of similarities within groups, i.e. rb = rw Problem: are differences between groups large enough as to say that groups are different from each other?
ANOSIM(ANalysisOfSIMilarities) N=6 sort n=6 n=9
rw= 9.17 rw= 7.17 rw= 7.08 rw= 5.75 rb=9.50 rb=7.22 rb=8.56 rb=8.61 p(R)=0.10 R= 0.50 R= 0.20 R= 0.19 R= -0.26 ... n=6 n=9 n=6 n=9 n=6 n=9 n=6 n=9
Final remarks about ANOSIM • In case of a posteriori pairwise comparisons, don’t forget the Bonferroni correction:p(R) = p(R) * n. of pairwise comparisons • Two-way analyses can also be performed • The most particular feature of ANOSIM is that it takes into account ranks of similarities (not sensitive to outliers, but maybe too sensitive to small differences)
INSIDE OUTSIDE Problem: is the fish assemblage composition in a Marine Protected Area different from theneighbouring sites?
Other approaches • ANOSIM is loosely related to ANOVA-like tests, but it can be easily integrated into explorative analyses based on ordination or clustering • Other approaches are more similar to ANOVA and to ANOVA users’ way of thinking: MRPP and NPMANOVA • Exactly like ANOVA, they support more complex experimental designs
MRPP(MultiResponse Permutation Procedure) • It is based on actual distances (usually Euclidean), not ranks • The test statistic is Δ, a weighted average within group distance • Significance of Δ is assessed by the empirical distribution of permuted Δs, but an approximation to Student’s t is also available (quite handy for very large data sets) • There are different methods for weighting Δ according to group size
MRPP • PRO: not using ranks, it can detect differences in mean values as well as differences in data dispersion • CON: biases due to distance metric are more pronounced than in ANOSIM • Output • observed Δ • significance of observed Δ • expected Δ • a ratio between observed and expected Δ
MRPP treatment control DT=(8.2+5.4+5.7)/3=6.43 DC=(5.6+6.9+5.8)/3=6.10 Weight = ni/sum(ni) Expected D = overall average distance (i.e. between and within groups) = 11.32 D=6.43*3/6+6.10*3/6=6.26 distances within groups distances between groups Euclidean distances
MRPP output ------------------------------------------------- MRPP -- Multi-Response Permutation Procedures [Michele Scardi, 1999] ------------------------------------------------- Input data: 6 objects, 10 variables, 2 groupsWeighting option: C(I) = n(I)/sum(n(I))Distance measure: Euclidean Group # 1, n = 3, avg(d) = 6.43 Group # 2, n = 3, avg(d) = 6.10 Test statistic, T = -2.95 Observed delta = 6.26 Expected delta = 11.32 Variance of delta = 2.94 Skewness of delta = -2.52 Within group agreement, R = 0.447 P-value of a <= delta = 0.022 groups are significantly more homegenous than expected (therefore, differences between groups are significant) R1 when D0 i.e. R is large when within group distance is small
Arpaia et al. (2007). Composition of Arthropod Species Assemblages inBt-expressing and Near Isogenic Eggplants in Experimental Fields.Environ. Entomol. 36(1): 213-227
NPMANOVA • Non-Parametric Multivariate ANalysis Of VAriance • Test statistics • SStotal = sum of squared distances between all observations and the overall centroid • SSwithin = sum of squared differences between group observations and group centroid • SSamong = SStotal-SSwithin • Pseudo-F = ratio of SSamong to SSwithin
NPMANOVA SSamong SSwithin Species 2 Species 1
NPMANOVA • Test is based on permutation of a distance matrix and computation of new Pseudo-F • Original Pseudo-F is then compared to the empirical distribution of permuted Pseudo-F values • Traditional ANOVA output (Pseudo-F) • One-way, two-ways and higher level ANOVA designs
NPMANOVA • It can be applied any time ANOSIM or MRPP can be used, and it seems more robust • It allows higher level ANOVA designs, including designs with interactions • Any distance coefficient can be used • Like ANOVA in multiple regression, it can be used to evaluate model functioning: e.g. community structure=f(disturbance,time) • Software available from the Author’s web site for complex analysis designs
NPMANOVA Non-parametric Multivariate Analysis of Variance Source df SS MS F P---------------------------------------------------Treatment 1 288.0 288.0 14.4 0.0981Residuals 4 80.0 20.0Total 5 368.0---------------------------------------------------
NPMANOVA vs. MRPP SSamong SSwithin Species 2 Species 1
NPMANOVA vs. MRPP expected D Species 2 observed D Species 1
Indicator Species Analysis • Indicator Species Analysis allows identifiying species that are significantly more frequent and/or abundant in a group of samples • Each species is associated to a vector of Indicator Values (IVs), i.e. to an IV for each group of samples • Significance of IVs is tested by permutation of the raw data matrix
Indicator Species Analysis Counts, biomass, etc. Relative abundance of species j in group k Average frequency of occurence ofspecies j in group k Presence or absence (0,1) The Indicator value (IV) is obtained by combining relative abundances (RA) and average frequencies of occurrence (RF)