340 likes | 649 Views
Xuhua Xia. Slide 2. Typical Form of Data. A data set in a 8x3 matrix. The rows could be species and columns sampling sites.. 10097999690908075607585956240287780789291807585100. X =. . . A matrix is often referred to as a nxp matrix (n for number of rows and p for number of col
E N D
1. Xuhua Xia Slide 1 Principal Components Analysis Objectives:
Understand the principles of principal components analysis (PCA)
Recognize conditions under which PCA may be useful
Use SAS procedure PRINCOMP to
perform a principal components analysis
interpret PRINCOMP output.
2. Xuhua Xia Slide 2 Typical Form of Data
3. Xuhua Xia Slide 3 What are Principal Components? Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet three criteria
What are the three criteria?
4. Xuhua Xia Slide 4 What are Principal Components? The three criteria:
There are exactly p principal components (PCs), each being a linear combination of the observed variables;
The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated);
The components are extracted in order of decreasing variance.
5. Xuhua Xia Slide 5 A Simple Data Set
6. Xuhua Xia Slide 6 General Patterns The total variance is 3 (= 1 + 2)
The two variables, X and Y, are perfectly correlated, with all points fall on the regression line.
The spatial relationship among the 5 points can therefore be represented by a single dimension.
PCA is a dimension-reduction technique. What would happen if we apply PCA to the data?
7. Xuhua Xia Slide 7 Graphic PCA
8. Xuhua Xia Slide 8 SAS Program
9. Xuhua Xia Slide 9 A positive definite matrix When you run the SAS program, the log file will warn that “The Correlation Matrix is not positive definite.”. What does that mean?
A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non-zero vectors z with real entries, where z’ is the transpose of z.
Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite:
10. Xuhua Xia Slide 10 SAS Output
11. Xuhua Xia Slide 11 SAS Output
12. Xuhua Xia Slide 12 SAS Output
13. Xuhua Xia Slide 13 Steps in a PCA Have at least two variables
Generate a correlation or variance-covariance matrix
Obtain eigenvalues and eigenvectors (This is called an eigenvalue problem, and will be illustrated with a simple numerical example)
Generate principal component (PC) scores
Plot the PC scores in the space with reduced dimensions
All these can be automated by using SAS. When to use a correlation matrix:
1. When different units are used for different variables
2. When data are from species of very different mean densities
When to use a correlation matrix:
1. When different units are used for different variables
2. When data are from species of very different mean densities
14. Xuhua Xia Slide 14 Covariance or Correlation Matrix? We sample species along a sandy beach. If some species (e.g., Sp2) increases their abundance from lower to higher shore, while other species (e.g., Sp1) maintains their abundance, then clearly it is the variance in the abundance of Sp2 that reflects the tidal gradient. The variance of Sp1 is entirely uninformative with reference to the gradient.
We note that the variance in abundance for Sp2 is large and that for Sp1 is small. If we use a correlation matrix in PCA, then the variance of the former and that of the latter are treated with equal weight, which is not a good idea.
In such cases, we should use a covariance matrix.We sample species along a sandy beach. If some species (e.g., Sp2) increases their abundance from lower to higher shore, while other species (e.g., Sp1) maintains their abundance, then clearly it is the variance in the abundance of Sp2 that reflects the tidal gradient. The variance of Sp1 is entirely uninformative with reference to the gradient.
We note that the variance in abundance for Sp2 is large and that for Sp1 is small. If we use a correlation matrix in PCA, then the variance of the former and that of the latter are treated with equal weight, which is not a good idea.
In such cases, we should use a covariance matrix.
15. Xuhua Xia Slide 15 Covariance or Correlation Matrix? Now we are sampling the same sandy beach. We note that the variance in abundance for Sp2 is greater than that for Sp3. However, the abundance of Sp3 is in fact a better predictor of the tidal gradient than that of Sp2. If we use a covariance matrix, then the variance in Sp3, which is smaller, is given less weight in PCA, which is clearly not a good idea.
In such cases, we should use a correlation matrix so that the variance of both variables is scaled to be 1, i.e., they will carry equal weight in PCA.Now we are sampling the same sandy beach. We note that the variance in abundance for Sp2 is greater than that for Sp3. However, the abundance of Sp3 is in fact a better predictor of the tidal gradient than that of Sp2. If we use a covariance matrix, then the variance in Sp3, which is smaller, is given less weight in PCA, which is clearly not a good idea.
In such cases, we should use a correlation matrix so that the variance of both variables is scaled to be 1, i.e., they will carry equal weight in PCA.
16. Xuhua Xia Slide 16 Covariance or Correlation Matrix? What would happen if we have all three types of species in our data? This situation will almost certainly happen when you sample a lot of species.
I recommend the use of correlation matrix in such cases for the following reason. When you have many variables, it is the correlation structure among variables that matters. Sp2 and Sp3 are positively correlated and they should determine the extraction of principal components. Sp1 is not correlated with either and will have little effect on the extraction of the first few (most important) principal components.
SAS uses correlation matrix by default.What would happen if we have all three types of species in our data? This situation will almost certainly happen when you sample a lot of species.
I recommend the use of correlation matrix in such cases for the following reason. When you have many variables, it is the correlation structure among variables that matters. Sp2 and Sp3 are positively correlated and they should determine the extraction of principal components. Sp1 is not correlated with either and will have little effect on the extraction of the first few (most important) principal components.
SAS uses correlation matrix by default.
17. Xuhua Xia Slide 17 The Eigenvalue Problem
18. Xuhua Xia Slide 18 Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition:A x = ?x
In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x1 and x2.
19. Xuhua Xia Slide 19 Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x12 + x22 = 1
We therefore have
20. Xuhua Xia Slide 20 Get the PC Scores
21. Xuhua Xia Slide 21 What Are Principal Components? Principal components are a new set of variables, which are linear combinations of the observed ones, with these properties:
Because of the decreasing variance property, much of the variance (information in the original set of p variables) tends to be concentrated in the first few PCs. This implies that we can drop the last few PCs without losing much information. PCA is therefore considered as a dimension-reduction technique.
Because PCs are orthogonal, they can be used instead of the original variables in situations where having orthogonal variables is desirable (e.g., regression).
22. Xuhua Xia Slide 22 Index of hidden variables The ranking of Asian universities by the Asian Week
HKU is ranked second in financial resources, but seventh in academic research
How did HKU get ranked third?
Is there a more objective way of ranking?
An illustrative example:
23. Xuhua Xia Slide 23 A Simple Data Set School 5 is clearly the best school
School 1 is clearly the worst school
24. Xuhua Xia Slide 24 Graphic PCA
25. Xuhua Xia Slide 25 Crime Data in 50 States
28. Xuhua Xia Slide 28 Correlation Matrix
29. Xuhua Xia Slide 29 Eigenvalues
30. Xuhua Xia Slide 30 Eigenvectors Do these eigenvectors mean anything?
All crimes are positively correlated with the first eigenvector, which is therefore interpreted as a measure of overall crime rate.
The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…...
31. Xuhua Xia Slide 31 PC Plot: Crime Data