290 likes | 623 Views
Tables, Figures, and Equations. From: McCune, B. & J. B. Grace. 2002. Analysis of Ecological Communities . MjM Software Design, Gleneden Beach, Oregon http://www.pcord.com.
E N D
Tables, Figures, and Equations From: McCune, B. & J. B. Grace. 2002. Analysis of Ecological Communities.MjM Software Design, Gleneden Beach, Oregon http://www.pcord.com
Figure 14.1. Comparison of the line of best fit (first principal component) with regression lines. Point C is the centroid (from Pearson 1901).
Figure 14.2. The basis for evaluating “best fit” (Pearson 1901). In contrast, least-squares best fit is evaluated from vertical differences between points and the regression line.
Figure 14.3. Outliers can strongly influence correlation coefficients.
Step by step 1. From a data matrix A containing n sample units by p variables, calculate a cross-products matrix: The dimensions of S are p rows p columns.
Step by step 1. From a data matrix A containing n sample units by p variables, calculate a cross-products matrix: The dimensions of S are p rows p columns. The equation for a correlation matrix is the same as above except that each difference is divided by the standard deviation, sj.
2. Find the eigenvalues. Each eigenvalue (= latent root) is a lambda (l) that solves: │S - lI│ = 0 I is the identity matrix (App. 1). This is the "characteristic equation."
The coefficients in the polynomial are derived by expanding the determinant:
Then find the eigenvectors, Y. • For every eigenvalue li there is a vector y of length p, known as the eigenvector. • Each eigenvector contains the coefficients of the linear equation for a given component (or axis). Collectively, these vectors form a pp matrix, Y. • To find the eigenvectors, we solve p equations with p unknowns: [S - lI]y = 0
4. Then find the scores for each case (or object) on each axis: Scores are the original data matrix post-multiplied by the matrix of eigenvectors: • X=BY • npnppp • Y is the matrix of eigenvectors • B is the original data matrix • X is the matrix of scores on each axis (component)
For eigenvector 1 and entity i ... • Score1 xi= y1ai1 + y2ai2 + ... + ypaip This yields a linear equation for each dimension.
5. Calculate the loading matrix. The pk matrix of correlations between each variable and each component is often called the principal components loading matrix. These correlations can be derived by rescaling the eigenvectors or they can be calculated as correlation coefficients between each variable and scores on the components.
Geometric analog • 1. Start with a cloud of n points in a p-dimensional space. • 2. Center the axes in the point cloud (origin at centroid) • 3. Rotate axes to maximize the variance along the axes. As the angle of rotation (q) changes, the variance (s2) changes. • Variance along axis v = s2v = y'Sy • 1pppp1
At the maximum variance, all partial derivatives will be zero (no slope in all dimensions). This is another way of saying that we find the angle of rotation q such that: for each component (the lower case delta () indicates a partial derivative).
Figure 14.4. PCA rotates the point cloud to maximize the variance along the axes.
Figure 14.5. Variance along an axis is maximized when the axis skewers the longest dimension of the point cloud. The axes are formed by the variables (attributes of the objects).
Example calculations, PCA Start with a 2 2 correlation matrix, S, that we calculated from a data matrix of np items where p = 2 in this case:
We need to solve for the eigenvalues, l, by solving the characteristic equation: Substituting our correlation matrix, S:
We then solve this polynomial for the values of l that will satisfy this equation. Since a = 1, b = -2, and c = 0.84, then Solving for the two roots gives us l1 = 1.4 and l2 = 0.6.
Now find the eigenvectors, Y. For each l there is a y: For the first root we substitute l1 = 1.4, giving:
Multiplying this out gives two equations with two unknowns: Solve these simultaneous equations (y1 = 1 and y2 = 1). Setting up and solving the equations for the second eigenvector yields y1 = 1 and y2 = -1.
We now normalize the eigenvectors, rescaling them so that the sum of squares = 1 for each eigenvector. In other words, the eigenvectors are scaled to unit length. The scaling factor k for each eigenvector i is So for the first eigenvector,
Then multiply this scaling factor by all of the items in the eigenvector: The same procedure is repeated for the second eigenvector, then the eigenvectors are multiplied by the original data matrix to yield the scores (X) for each of the entities on each of the axes (X = A Y).
The broken stick eigenvalue is where p is the number of columns and j indexes axes k through p.
Addendum on randomization tests for PCA (not in McCune & Grace 2002, but in PC-ORD version 5; evaluation based on Peres-Neto et al. (2005)) The randomization: shuffle values within variables (columns), then recompute the correlation matrix and eigenvalues. Repeat many times. Compare the actual eigenvalues in several ways with the eigenvalues from the randomizations. Calculate p value as: • where • n = number of randomizations where the test statistic ≥ observed value • N = the total number of randomizations.
Rnd-Lambda – Compare eigenvalue for an axis to the observed eigenvalue for that axis. • fairly conservative and generally effective criterion • more effective with uncorrelated variables included in the data, than Avg-Rnd • performs better than other measures with strongly non-normal data. • Rnd-F – Compare pseudo-F-ratio for an axis to the observed pseudo-F for that axis. Pseudo-F-ratio is the eigenvalue for an axis divided by the sum of the remaining (smaller) eigenvalues. • particularly effective against uncorrelated variables • performs poorly with grossly nonnormal error structures • Avg-Rnd – Compare observed eigenvalue for a given axis to the average eigenvalue obtained for that axis after randomization • good when the data did not contain uncorrelated variables. • less stringent, too liberal when the data contain uncorrelated variables.