280 likes | 616 Views
Introduction to Multivariate Analysis and Multivariate Distances. Hal Whitehead BIOL4062/5062. Data matrices Problems with data matrices missing values outliers Matrices used in multivariate analysis Multivariate distances Association matrices. The Data Matrix. Variables:. Units:.
E N D
Introduction to Multivariate Analysis and Multivariate Distances Hal Whitehead BIOL4062/5062
Data matrices • Problems with data matrices • missing values • outliers • Matrices used in multivariate analysis • Multivariate distances • Association matrices
The Data Matrix Variables: Units:
Problems with Data Matrix • Missing values • Outliers • Units not independent • Many zeros • Not multivariate normal
Missing DataOften present in ecological, or other biological, data • delete columns of data matrix
Missing DataOften present in ecological, or other biological, data • delete columns of data matrix • delete rows of data matrix
Missing DataOften present in ecological, or other biological, data • delete columns of data matrix • delete rows of data matrix • just delete pairs of elements where one is missing
Missing DataOften present in ecological, or other biological, data • delete columns of data matrix • delete rows of data matrix • just delete pairs of elements where one is missing • interpolate 0.12
Outliers • Statistical packages often indicate “outliers” *** WARNING *** Case 86 has large leverage (Leverage = 0.252) • If plausibly: • the result of biological, or other, processes outside the scope of the model being used, • or the results of measurement or coding error, • they may be discarded • Otherwise they should be retained • (perhaps use a different model)
Problems with Data Matrix • Missing values • Outliers • Units not independent • Not a problem unless doing tests • Many zeros • Special methods (e.g. correspondence analysis) • Not multivariate normal • Transform if possible
Uses of Multivariate Analysis • Large data sets • simplify • summarize • find patterns • Analyze groupings of units • Find groupings of units • Examine relationships between variables
Some Matrices Used inMultivariate Analysis • Data matrix: rectangular • units i=1,…,n • variables j, k • Covariance matrix between variables: symmetric (square/triangular) • cjk= Σ (xij-xj) · (xik-xk) / (n-1) [xk = mean(xik)] • Correlation matrix between variables: symmetric (square/triangular) • rjk=cjk/(Sj Sk)[Sk = SD(xik)]
Multivariate distancesbetween units or groups of units1. Euclidean distance p variables
Multivariate distancesbetween units or groups of units2. Penrose distance p variables Sk2variance of xik Corrects for different units, different ranges of units of variables
Multivariate distancesbetween units or groups of units3. Mahalanobis distance p variables vrselements of inverse of covariance matrix Corrects for correlations between variables
3 species of iris; 4 measurements • Euclidean distances: A 0 B 3.2 0 C 4.8 1.6 0 A B C • Penrose distances: A 0 B 2.8 0 C 3.9 1.5 0 A B C • Mahalanobis distances: A 0 B 89.9 0 C 179.4 17.2 0 A B C
The Standard Data Matrix Variables: Units:
The Association Matrix Units: Units:
Similarity Dissimilarity Association matrices • Social structure • association between individuals • Community ecology • similarity between species, sites • dissimilarities between species sites • Genetic distances • Correlation matrices • Covariance matrices • Distance matrices • Euclidean, Penrose, Mahalanobis
Association matricesDissimilarity/Similarity • Mahalanobis distances between iris species: • A 0 • B 89.9 0 • C 179.4 17.2 0 • A B C Genetic relatedness among bottlenose dolphins (Krutzen et al. 2003)
Association matricesSymmetric/Asymmetric Grooming rates of capuchin monkeys (Perry 1996) Genetic relatedness among bottlenose dolphins (Krutzen et al. 2003)