Multivariate Tests Based on Pairwise Distance or Similarity Measures

The 8th Tartu Conference on Multivariate Statistics The 6th Conference on Multivariate Distributions with Fixed Marginals Tartu, Estonia, 26-29 June 2007 Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University Magdeburg, Germany

Contents • Introduction • Example data - microbial fingerprints • “Usual” way as multivariate test based onspherically distributed scores (PC scores) • Test based on pairwise similarity measures • Comparison of results in example data • Application to other data • Extensions of permutation test • Parametric “rotation” test for small n • Simulation studies on robustness • Summary

Introduction Consider global multivariate comparisons between two or more populations of high-dimensional data • Gene expression data (all genes, known groups of genes) • Neuroimaging • Genetical fingerprints (e.g. microbial DNA in soil samples) • … Formal description: independent sample vectors xkj ~ Np(k,) , k = 1, 2, …, K; j = 1, …, nk , p >> n = n1 + … + nK or more general xkj ~ Fk(x) , k = 1, 2, …, K; j = 1, …, nk , p >> n Wanted: test for H0: 1 = ... = K or F1(x) = … = FK(x)  x

Example data - microbial fingerprints • Question: What is the impact of different (natural or genetically modified) plant cultures to the soil microbial population? • Extraction of bacterial samples, DNA parts amplified by PCR • Several samples investigated together in electrophoresis gels (e.g. denaturing gradient gel electrophoresis, DGGE) • Gels scanned, analyzed with GelCompar,vector of hundreds or thousands greyscale values per lane.

M1 B1 B2 B3 B4 S1 S2 S3 M2 P1 P2 P3 P4 R1 R2 R3 R4 M3 Denaturing gradient gel with fingerprints of bacterial communities from rhizo-sphere soil (lanes S1 to S3: strawberry, P1 to P4: potato, R1 to R4: oilseed rape) and unplanted soil (lanes B1 to B4). Lanes M1 to M3: standard bacterial mix

“Usual” way as multivariate test based onspherically distributed scores (PC scores) Exact test based on spherically distributed scores: PCq test (Läuter et al., 1998) 1. Transformation of raw data vectors into q-dimensional score vectors: xkj  zkj = Dxkj (k = 1, ..., K; j = 1, ..., nk) with (p q)-matrix D (q << p) from EVP or better from dual EVP 2. Multivariate test (here Wilks‘ ) with q-dimensional scores zkj

Test based on pairwise similarity measures 1. Calculate pairwise similar. measures between sample elements, e.g., Pearson’s r

(2. Investigate similarities by cluster analyses, supported by GelCompar)

3. Calculate the test statistic • and use it as basis for a permutation test, • where in each new permutation step the n = n1 + … + nK sample elements are randomly allocated to the K groups of sizes n1, …, nK (simultaneous exchanges of rows and columns in correlation matrix). • Test can be carried out with all groups or pairwise. • We used random permutations. • Problems occur with small samples because of restricted number of permutations, e.g. with two samples of size 4 different permutations. • The permutation test in its basic form is a special case of the Mantel test (Mantel, 1967), similar application to electropheresis data by Aittokallio et al. (2000).

Comparison of results in example data p-values for global test and unadjusted pairwise tests d version performs quite well, but here at its limits – no Bonferroni possible

The same data with transformation x:= ln(1+x) p-values for global test and unadjusted pairwise tests

Application to other data • Other microbiological fingerprints (DGGE), from soil of four different regions, each four samples • Gene expression analyses from microarrays •  permutation test based on pairwise correlation coefficients of sample elements performed very well, outperformed PC test in examples (Kropf et al. 2007).

Extensions of permutation test • The correlation based test can be extended in different ways (Kropf et al., 2004): • Inclusion of block designs (e.g., use of several geles, where lanes may not be compared across different geles). • Comparison of dependent samples (e.g., the same soil samples analyzed with different types of geles). • Use of other distance or similarity measures instead of r (e.g., z-dot transformation of r, squared Euclidean distance, other distances for binary or ordinal data, …). High flexibility for applications.

Parametric ‚rotation‘ test for small n Usual assumptions: As the distribution of the test statistic might be too complicated for a ‚closed‘ solution, we are looking for a Monte Carlo version: • The test statistic is traced back to a left-spherically distributed matrix (particularly an iid multivariate normal rows with expectation zero), which  under H0 is distributional invariant to random orthogonal rotations. Use infinite no. of random rotations instead of restricted no. of permutations. • “Rotation” test (cf. Langsrud, 2005; Läuter et al. 2005)

1 … p 1 … n1 … … 1 … nK data matrix X dist./sim. matrix R = (rij) rij = r(x(i), x(j)) test statistic d = d(R)

1 … p 1 … n1 … … 1 … nK data matrix X data matrix X reduced data matrix X* 1 … p 1 … p 12… … n1n dist./sim. matrix R = (rij) rij = r(x(i), x(j)) test statistic d = d(R)

1 … p 1 … n1 … … 1 … nK data matrix X data matrix X reduced data matrix X* „decorrelated“ matrix X+ 1 … p 1 … p 12… … n1n repeatedly dist./sim. matrix R = (rij) rij = r(x(i), x(j)) R has to be invariantwith respect to a constant vector shift in argumentsr(x(i), x(j))=r(x(i)+a,x(j)+a),e.g. squared Eucl. distance,no longer Pearson‘s r ! random rotations: X + := X + = *(**)1/2*(n-1)(n-1) from iid standard normal elements test statistic d = d(R)

Example data (4 groups: bulk soil, strawberry, potato, oilseed rape) p-values from global test and with unadjusted pairwise comparisons

The same data with transformation x:= ln(1+x) p-values for global test and unadjusted pairwise tests

Simulation studies on robustness e.g. p indep. components from expontial distribution  others: uniform distribution: slightly anticonservative sum of normal and one of above: nearly exact

Summary • Tests based on pairwise similarity or distance measures show a high power in high-dimensional data. • The permutation tests for the pairwise methods are not dependent on normality assumptions and performed surprisingly well in many situations. • The basic idea is not new (cf. Mantel, 1967), but might have lost attention, at least in the field of medical biometry. • Extensions for other designs are possile to some degree. • Similar (partly asymptotic) methods in Software “CANOCO” (Canonical Community Ordination) by ter Braak und Šmilauer (2002). • Small number of possible permutations restricts application for very small samples. In this case the rotation test can help. • It is, however, dependent on the parametric assumptions, so variables should be checked and – if necessary – transformed.

References • Aittokallio, T., Ojala, P., Nevalainen, T.J., Nevalainen, O. (2000). Analysis of similarity of electrophoretic patterns in mRNA differential display. Electrophoresis21, 2947– 2956. • Kropf, S., Heuer, H., Grüning, M., Smalla, K. (2004). Significance test for comparing complex microbial community fingerprints using pairwise similarity measures. Journal of Microbiological Methods57/2, 187-195. • Kropf, S., Lux, A., Eszlinger, M., Heuer, H., Smalla, K. (2007). Comparison of independent samples of high-dimensional data by pairwise distance measures. Biometrical Journal49, 230-241. • Langsrud, Ø. (2005). Rotation Tests, Statistics and Computing, 15, 53-60. • Läuter, J., Glimm, E., Kropf, S. (1998). Multivariate Tests Based on Left-Spherically Distributed Linear Scores. Annals of Statistics26, 1972-1988. Erratum: Annals of Statistics 27, 1441. • Läuter, J., Glimm, E., Eszlinger, M. (2005). Search for Relevant Sets of Variables in a High-Dimensional Setup Keeping the Familywise Error Rate. Submitted to Statistica Neerlandica. • Mantel, N., 1967. The Detection of Disease Clustering and a Generalized Regression Approach. Cancer Res.27, 209-220. • ter Braak, C.J.F., Šmilauer, P. (2002). CANOCO Reference Manual and CanoDraw for Windows User’s Guide: Software for Canonical Community Ordination (Version 4.5). Microcomputer Power, Ithaca NY, USA.

Multivariate Tests Based on Pairwise Distance or Similarity Measures