Feature Extraction

Feature Extraction 主講人：虞台文

Content • Principal Component Analysis (PCA) • Factor Analysis • Fisher’s Linear Discriminant Analysis • Multiple Discriminant Analysis

Feature Extraction Principal Component Analysis (PCA)

Principle Component Analysis • It is a linear procedure to find the direction in input space where most of the energy of the input lies. • Feature Extraction • Dimension Reduction • It is also called the (discrete) Karhunen-Loève transform, or the Hotelling transform.

x w wTx The Basis Concept Assume data x (random vector) has zero mean. PCA finds a unit vectorw to reflect the largest amount of variance of the data. That is, Demo

Remark: C is symmetric and semipositive definite. The Method Covariance Matrix

The Method maximize subject to The method of Lagrange multiplier: Define The extreme point, say, w* satisfies

The Method maximize subject to Setting

Discussion • Let w1, w2, …, wd be the eigenvectors of C whose corresponding eigenvalues are 1≧ 2 ≧ … ≧ d. • They are called the principal components of C. • Their significance can be ordered according to their eigenvalues. At extreme points w is a eigenvector of C, and is its corresponding eigenvalue.

Discussion • Let w1, w2, …, wd be the eigenvectors of C whose corresponding eigenvalues are 1≧ 2 ≧ … ≧ d. • They are called the principal components of C. • Their significance can be ordered according to their eigenvalues. At extreme points • If C is symmetric and semipositive definite, all their eigenvectors are orthogonal. • They, hence, form a basis of the feature space. • For dimensionality reduction, only choose few of them.

Applications • Image Processing • Signal Processing • Compression • Feature Extraction • Pattern Recognition

Example Projecting the data onto the most significant axis will facilitate classification. This also achieves dimensionality reduction.

Issues The most significant component obtained using PCA. • PCA is effective for identifying the multivariate signal distribution. • Hence, it is good for signal reconstruction. • But, it may be inappropriate for pattern classification. The most significant component for classification

Whitening • Whitening is a process that transforms the random vector, say, x = (x1, x2 , …,xn)T (assumed it is zero mean) to, say, z = (z1, z2 , …,zn)T with zero mean and unit variance. • zis said to be white or sphered. • This implies that all of its elements are uncorrelated. • However, this doesn’t implies its elements are independent.

Clearly, D is a diagonal matrix and E is an orthonormal matrix. Whitening Transform Let V be a whitening transform, then Decompose Cx as Set

Whitening Transform If V is a whitening transform, and U is any orthonormal matrix, show that UV, i.e., rotation, is also a whitening transform. Proof)

Why Whitening? • With PCA, we usually choose several major eigenvectors as the basis for representation. • This basis is efficient for reconstruction, but may be inappropriate for other applications, e.g., classification. • By whitening, we can rotate the basis to get more interesting features.

Feature Extraction Factor Analysis

What is a Factor? • If several variables correlate highly, they might measure aspects of a common underlying dimension. • These dimensions are called factors. • Factors are classification axis along which the measures can be plotted. • The greater the loading of variables on a factor, the more that factor can explain intercorrelations between those variables.

Verbal Skill (F2) +1 +1 1 Quantitative Skill (F1) 1 Graph Representation

What is Factor Analysis? • A method for investigating whether a number of variables of interestY1, Y2, …, Yn, are linearly related to a smaller number of unobservable factorsF1, F2, …, Fm. • For datareduction and summarization. • Statistical approach to analyze interrelationships among the large number of variables & to explain these variables in term of their common underlying dimensions (factors).

What factors influence students’ grades? Quantitative skill? unobservable Example Verbal skill? Observable Data

The Model y: Observation Vector B: Factor-Loading Matrix f: Factor Vector : Gaussian-Noise Matrix

The Model Can be obtained from the model Can be estimated from data

The Model Commuality Specific Variance Explained Unexplained

Cy BBT + Q = Example

Goal Our goal is to minimize Hence,

Uniqueness Is the solution unique? There are infinite number of solutions. Since if B* is a solution and T is an orthonormal transformation (rotation), then BT is also a solution.

Cy= Example Which one is better?

i2 i2 i1 i1 Left:each factor have nonzero loading for all variables. Example Right:each factor controls different variables.

The Method • Determine the first set of loadings using principal component method.

Cy Example

Factor Rotation Factor-Loading Matrix Rotation Matrix Factor Rotation:

Factor Rotation • Varimax • Quartimax • Equimax • Orthomax • Oblimin Criteria: Factor-Loading Matrix Factor Rotation:

. . . Criterion: Maxmize Varimax Subject to Let

Criterion: Maxmize Varimax Subject to Construct the Lagrangian

Varimax dk cjk bjk

Varimax Define is the kth column of

Varimax is the kth column of

Varimax Goal: reaches maximum once

Iteratively execute the following procedure: evaluate and You need information of B1. find and such that Next slide if stop Repeat Varimax Goal: • Initially, • obtain B0 by whatever method, e.g., PCA. • set T0 as the approximation rotation matrix, e.g., T0=I.

Initially, • obtain B0 by whatever method, e.g., PCA. • set T0 as the approximation rotation matrix, e.g., T0=I. Iteratively execute the following procedure: evaluate and You need information of B1. find and such that Next slide if stop Repeat Varimax Goal: Pre-multiplying each side by its transpose.

. . . Varimax Criterion: Maximize

Maximize Varimax Let

Feature Extraction Fisher’s Linear Discriminant Analysis

Main Concept • PCA seeks directions that are efficient for representation. • Discriminant analysis seeks directions that are efficient for discrimination.

Classification Efficiencies on Projections

m m 1 2 Criterion  Two-Category ||w|| = 1 w

m m 1 2 Between-Class Scatter Matrix Scatter ||w|| = 1 w Between-Class Scatter The larger the better

Feature Extraction