Principal Component Analysis

Principal Component Analysis Adapted by Paul Anderson from Tutorial by Doug Raiford

The Problem with Apples and Oranges • High dimensionality • Can’t “see” • If had only one, two, or three features, could represent graphically • But 4 or more…

If Could Compress Into 2 Dimensions • Apples and oranges: feature vectors • Axis of greatest variance

How? • In MatLab • evects = princomp(allFruit); • b1=evects(:,1); • b2=evects(:,2); • Z1=allFruit*b1; • Z2=allFruit*b2; • scatter(Z1,Z2);

Real World Example • 59 dimensions • 3500 genes • Very useful in exploratory data analysis • Sometimes useful as a direct tool (MCU)

But We’re Not Scared of the Details • Given • Data matrix M (feature vectors for all examples) • Generate • covariance matrix for M (Σ) • Eigenvectors (principal components) from covariance matrix M Σ Eigenvectors

Eigenvectors and Eigenvalues • Each Eigenvector is accompanied with an Eigenvalue • The Eigenvector with the greatest Eigenvalue points along the axis of greatest variance

Eigenvectors and Eigenvalues • If use only first principal component very little degradation of data • Have reduced dimensions from 2 to 1

Project data onto new axes • Once have Eigenvectors can project data onto new axis • Eigenvectors are unit vectors, so simple dot product produces the desired effect M Σ Eigenvectors Project Data

Covariance Matrix M Σ Eigenvectors Project Data

Covariance Matrix

Eigenvector • Eigenvector • Linear transform of the Eigenvector using Σ as the transformation matrix resulting in a parallel vector M Σ Eigenvectors Project Data

Eigenvector • How to find • Σ is an nxn matrix • There will be n Eigenvectors • Eigenvectors ≠ 0 • Eigenvalues ≠ 0

Eigenvector • A is invertible if and only if det(A)  0 • If (A-v) is invertible then: • But it is given that v  0 so must not be invertible • Not invertible so det(A-v) = 0

Eigenvector • First, solve for the  by performing the following operations: • If solve for  will get 2 roots, 1 and 2.

Eigenvector • Now that the Eigenvalues have been acquired we can solve for the Eigenvector (v below). • Know Σ, know , know I, so becomes homogeneous system of equations (equal to 0) with the entries of v as the variables • Already know that there is no unique solution • The only way there is a unique solution is if the trivial solution is only solution. • If this were the case it would be invertible

Back to the example

P(λ) λ’s Eigenvectors Eigenvectors (Summary) • Find characteristic polynomial using determinant • Solve for Eigenvalues (λ’s) • Solve for Eigenvectors M Σ Eigenvectors Project Data

Axis of Greatest Variance? • Equation for an ellipse • D, E, and F have to do with translation • A and C related to the ellipse’s spread along the X and Y axes, respectively • B has to do with rotation

Axis of Greatest Variance • Mathematicians discovered that any ellipse can be exactly captured by a symmetric matrix • Covariance matrix is symmetric • The Eigenvectors of the said matrix point along the principal axes of the ellipse • Origin of the name (principal components analysis) Related to spread along x axis (variance of data along x axis) Related to spread along y axis Related to rotation (covariance)

Principal Axis Theorem • Principal axis theorem holds for quadratic forms (conic sections) in higher dimensional spaces

Project Data Onto Principal Components • Eigenvectors are unit vectors M Σ Eigenvectors Project Data

Review • In MatLab • evects = princomp(allFruit); • b1=evects(:,1); • b2=evects(:,2); • Z1=allFruit*b1; • Z2=allFruit*b2; • scatter(Z1,Z2);

Practice • Covariance matrix

P(λ) λ’s Eigenvectors Practice M Σ Eigenvectors Project Data

Questions?

Why Invertible if row reducible to I?

Implication of Zero Determinants

Why Eigenvector Associated with Greatest λ Points Along Axis of Greatest Variance

Rotation • Good search terms: rotation of axes conic sections • Note that in the sections above dealing with the ellipse, hyperbola, and the parabola, the algebraic equations that appeared did not contain a term of the form xy. However, in our “Algebraic View of the Conic Sections,” we stated that every conic section is of the form • Ax2 + Bxy + Cy2 + Dx + Ey + F = 0 • where A, B, C, D, E, and F are constants. In essence, all of the equations that we have studied have had B=0. So the question arises: “What role, if any, does the xy term play in conic sections? If it were present, how would that change the geometric figure?” • First of all, the answer is NOT that the conic changes from one type to another. That is to say, if we introduce an xy term, the conic does NOT change from an ellipse to a hyperbola. If we start with the standard equation of an ellipse and insert an extra term, an xy term, we still have an ellipse. • So what does the xy term do? The xy term actually rotates the graph in the plane. For example, in the case of an ellipse, the major axis is no longer parallel to the x-axis or y-axis. Rather, depending on the constant in front of the xy term, we now have the major axis rotated. Let’s look at an example. • * Example

rotation • A is related to elongation in x direction • C is related to elongation in y direction • B is related to rotation (B is not equal to zero if and only if there is rotation) • D, E, and F related to h and k (x and y shift, (x-h), (y-k)) • D, E, and F not affected by rotation • A and C are affected

Standard equation • Standard equation of the ellipse is: • a = 5 and b = 2 Hence: • The length of major axes is: 2a = 10. • The length of minor axes is: 2b = 4.

New rotated coordinate system • Coordinate Rotation Formulas • If a rectangular xy-coordinate system is rotated through an angle θ to form an x’y’ coordinate system, then a point P(x; y) will have coordinates P(x’; y’) in the new system, where (x; y) and (x’; y’) are related by • x = x’ cos θ − y' sin θ and y = x' sin θ + y' cos θ : • and • x' = x cos θ + y sin θ and y' = −x sin θ + y cos θ :

rotation • The values of h and k give horizontal and vertical (resp.) translation distances, and t gives rotation angle (measured in degrees). Notice how changes in these transformation values affect the coefficients, and how changes in the coefficients affect the transformations. • The lines shown in green in the graph are the following key lines for the conic sections: the major and minor axes for ellipses (crossing at the center of the ellipse), the axis of symmetry and perpendicular line through the vertex for a parabola (crossing at the vertex), and the two perpendicular axes of symmetry (crossing through the center point) for a hyperbola. In all cases, the two lines cross at the point (h,k), and are rotated from the position parallel to the coordinate axes by t degrees. In graphs of hyperbolas, the asymptotes of the hyperbola are shown as orange lines. • If B2-4AC<0, then the graph is an ellipse (if B=0 and A=C in this case, then the graph is a circle) • One other important formula determines the relationship between the coefficients and the angle of rotation: tan(2t)=B/(A-C). Note that rotation has no effect on the values of the coefficients D, E, and F, and that t=0 (no rotation) if and only if B=0. The values of the coordinates of the point (h,k) are best determined from the coefficients by first reversing the effect of the rotation (so that B=0), then completing the squares.

Principal Axis Theorem • Principal axis theorem holds for quadratic forms (conic sections) in higher dimensional spaces

Principal Component Analysis