Feature extraction

Feature extraction Cluster Data synthesis Separability Feature extraction Guidelines A practical approach Feature selection Example Bibliography Alon Slapak

Each of the red circles represents an object pattern. The coordinates of the circle are the features (e.g. height and weight). 1 (152,51.5) The length of the principle axes of the ellipse is proportional to twice the square root of the eigenvalues of the covariance matrix of the patterns distribution (1, 2). 2 Females The principles axes of the ellipse coincide with the eigenvectors of the covariance matrix of the patterns distribution. The center of the ellipse  is the mean of the patterns distribution . Geometric structure of a cluster The ellipse which encircles the major part of the cluster represents the distribution of the patterns. Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

120 110 100 90 80 weight [kg] 70 60 50 40 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] Typical dimension: Mean  The center of the ellipse  is the mean of the patterns distribution. It can be estimated by: Cluster Data synthesis Separability Feature extraction Guidelines While xk - is the kth pattern in the cluster, N - is the number of patterns in the cluster. Feature selection Example The mean is sometimes referred to as the centroid of the cluster. Bibliography Alon Slapak

120 110 Larger scatter Smaller scatter 100 90 80 weight [kg] 70 60 50 40 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] Typical dimension: Scatter matrix S Scatter matrix S of a cluster is defined as: Cluster Data synthesis While xk - is the kth pattern in the cluster, N - is the number of patterns in the cluster,  - is the mean of all the patterns in the cluster Separability Feature extraction Guidelines Feature selection The scatter matrix may be interpreted as a biased estimation of the covariance matrix of the cluster. Example Bibliography Alon Slapak

120 110 100 90 80 weight [kg] 70 60 50 40 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] Data synthesis • In order to test and debug pattern recognition algorithms, it is customary to use synthesized data. • The synthesized data may be drawn from arbitrary distribution, but in most of the literature it is customary to assume a normal distribution. • But, it is not infrequent to come across applications involving other patterns distributions Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

120 110 Males 100 90 80 weight [kg] 70 60 Females 50 40 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] Example: Two clusters synthesis clear all N1 = 150; N2 = 150; E1 = [150 15; 120 20]; E2 = [100 10; 70 30]; M1 = [170,75]'; M2 = [160,50]'; [P1,A1] = eig(E1); [P2,A2] = eig(E2); y1=randn(2,N1); y2=randn(2,N2); for i=1:N1, x1(:,i) =P1*sqrt(A1)* y1(:,i)+M1; end; for i=1:N2, x2(:,i) =P2*sqrt(A2)* y2(:,i)+M2; end; figure; plot(x1(1,:),x1(2,:),'.',x2(1,:),x2(2,:),'or'); axis([120 220 30 120]); xlabel ('height [cm]'); ylabel ('weight [kg]'); Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

50 40 30 20 10 0 -10 -20 -30 -40 -50 -50 -40 -30 -20 -10 0 10 20 30 40 50 Exercise Try to synthesis two classes, which have a common centroid at (3,5). One class coincides with horizontal and is ~40 units in length and ~8 units in width. The other class is inclined at 45o from horizontal and is ~50 units in length and ~4 units in width. Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Hint: Use E=PAP-1 to build the covariance matrix. Alon Slapak

120 110 100 90 80 weight [kg] 70 60 50 40 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] height [cm] Separability Q: How does feature extraction method affect the pattern recognition performance? Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Classification algorithm Feature extraction method Separability A: More than the classification algorithm. If the patterns achieved by the feature extraction method creates nonseparate clusters, no classification algorithm can do the job. Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Separability In fact, the separability achieved by the feature extraction method is the upper limit of the pattern recognition performance. Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

120 120 120 110 110 110 100 100 100 90 90 90 80 80 80 weight [kg] weight [kg] weight [kg] 70 70 70 60 60 60 50 50 50 40 40 40 30 30 30 120 130 140 150 160 170 180 190 200 210 220 120 130 140 150 160 170 180 190 200 210 220 120 130 140 150 160 170 180 190 200 210 220 height [cm] height [cm] height [cm] Separability Q: How can one assess the separability achieved by the feature extraction method? Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Good? Hopeless? Bad? Alon Slapak

120 110 Larger scatter Smaller scatter 100 90 80 weight [kg] 70 60 50 40 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] Separability A: Several separability criterias are exist, most are involving scatters matrices Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Separability If we define the within-class scatter matrixas: Cluster Data synthesis While xki - is the kth pattern in the ith cluster, Ni - is the number of patterns in the ith cluster, i - is the mean of all the patterns in the ith cluster C - is the number of the clusters Pi- is the a priori probability of the ith cluster, and may be estimated by Separability Feature extraction Guidelines Feature selection and the between-class scatter matrixas: Example While  - is the mean of all the patterns in all the clusters Bibliography and the total-class scatter matrixas: Alon Slapak

Separability function J We can find in the literature several functions which represent the separability of clusters by a scalar: Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Pay attention to the fact thatJis a scalar. It is of utmost importance because scalar function is necessary for later optimization (we will try to maximize J of course) Bibliography Alon Slapak

Feature extraction guidelines Q: What will be regarded as a reasonable feature? Example: Features of a student: • Number of eyes • Hair color • Wear glasses or not • Hair length • Show size • Height • Weight Cluster Data synthesis Useless. Includes no information on gender. Separability Useless. Very poor correlation with gender. Feature extraction Guidelines Useless. Very poor correlation with gender. Feature selection Effective, but is hard to measure. Example Effective and simple. Bibliography Effective and simple. Effective and simple. Alon Slapak

Features extraction guidelines “When we have two or more classes, feature extraction consist of choosing those features which are most effective for preserving class separability” (Fukunaga p. 441) • Correlation with the classification feature • Easy to measure • Maximizing the separability function Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Features extraction guidelines Q: How can one determine what feature follows the guidelines? Cluster Data synthesis Separability A1: There is an extensive literature dealing with features to specific applications (e.g. symmetry for face recognition) Feature extraction Guidelines Feature selection Example A2: A widespread approach is to emulate human mechanism (e.g. treating all the pixels as features in face recognition) Bibliography Alon Slapak

Example - handwritten recognition The following 20 figures depict examples of handwritten digits. Since almost everyone can recognize each of the digits, it is reasonable to assume that treating all the pixels as features will assure a successful pattern recognition process. Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Feature selection • While dealing with 28X28 pixels handwriting digits, the feature space dimensions is 784. • While dealing with face recognition, we may have about 200X200 pixels. It means that the feature space dimensions is 40000. • While dealing with voice recognition, we may have about 1[Sec]X16[kHz]. It means that the feature space dimensions is 16000. Cluster Complexity problem !!! Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Feature selection Because of: • The complexity of most of the classification algorithms is O(n2). • Learning time • “Course of dimensionality” Cluster Omit features !!! Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Feature extraction Feature Selection Object High dimension Pattern Low dimension Pattern Feature selection “Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called curse of dimensionality.” (WIKIPEDIA) Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Feature selection Q: How can one identify the irrelevant and the redundant features? Cluster Data synthesis • A: Several options: • Heuristics • In most applications, the relevant information resides in the low frequency range (e.g. images), hence it is logical to reduce dimensionality by taking the first coefficients of the Fourier/DCT transform. • 2. Optimization approach (KLT, FLD) • We may select a subset of the features (or linear combinations of the features) which best contribute to the separability of the clusters. • 3. Grouping • By grouping features one can represent every set of features by a small set of features. Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Males Females Feature selection - Example The separability of these two class is pretty good. But, do we really need two features? Cluster Data synthesis Separability 120 110 Feature extraction Guidelines 100 90 Feature selection 80 weight [kg] 70 60 Example 50 40 Bibliography 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] Alon Slapak

120 110 100 Males 90 80 weight [kg] 70 60 50 40 Females 30 120 130 140 150 160 170 180 190 200 210 220 height [cm] Feature selection - Example No! The same separability can be achieved by projecting the patterns onto the blue axis i.e. only one-dimension feature-space is needed. Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Feature selection But how can we find this “blue axis” or the feature subspace on which we should project the patterns? Cluster Data synthesis Separability Please refer to: Lec7.pdf - PCA and the course of dimensionality Lec8.pdf in - Fisher Linear Discriminant http://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html Of Prof. Olga Veksler http://www.csd.uwo.ca/faculty/olga/ Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Example – Handwriting recognition The following example describes the application of FLD to handwriting recognition. The characters were derived from the MNIST database (http://yann.lecun.com/exdb/mnist/). Cluster Data synthesis Separability Feature extraction Guidelines Feature selection The example demonstrates the separability of two classes while reducing the dimensionality from 784 (28X28 pixels) to 2. Example Bibliography Alon Slapak

The img0 and img1 files contain N 28X28 arrays of grayscale images of the handwriting characters. Reshaping the 28X28 arrays into a 784X1 features vector (pattern) DCT transform is recommended while dealing with mostly black&white images. The numerousity of identical numbers (0 and 255) damages the FLD computations, and DCT transform may alleviate this phenomena. Taking only the first 50 DCT coefficients decrease dimensionality (most of the energy resides in the low frequency range, but it is not essential for this example. Example – Handwriting recognition clear all; load img0 load img1 N1=length(img0); N2=length(img1); D = 50; % Low pass filtering Dd = 2; % Desired pattern dimension %----------------------------------------------------- % Test set synthesis %----------------------------------------------------- for i=1:N1, x = reshape(squeeze(img0(i,:,:)), 28*28,1); X=dct(x); l1(:,i) = X(1:D); end; for i=1:N2, x = reshape(squeeze(img1(i,:,:)), 28*28,1); X=dct(x); l2(:,i) = X(1:D); end; Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Compute the mean  of each cluster, and the mean  of all the patterns. Compute the within scatter matrices Example – Handwriting recognition %-------------------------------------------------------------------------- % Compute Sb amd Sw %-------------------------------------------------------------------------- Mu1 = mean(l1')'; % Mean of cluster 1 Mu2 = mean(l2')'; % Mean of cluster 2 Mu = (Mu1*N1 + Mu2*N2)/(N1+N2); % Total mean of all pattens Sw = zeros(D); for i=1:N1, Sw = Sw + (l1(:,i)-Mu1)*(l1(:,i)-Mu1)'; end; for i=1:N2, Sw = Sw + (l2(:,i)-Mu2)*(l2(:,i)-Mu2)'; end; Sb = N1*(Mu1-Mu)*(Mu1-Mu)' + N2*(Mu2-Mu)*(Mu2-Mu)'; Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Compute the between scatter matrix. Example Bibliography Alon Slapak

Construct the projection matrix V such that its column vectors are the eigenvectors corresponding to the largest eigenvalues. Example – Handwriting recognition %-------------------------------------------------------------------------- % Compute V % [W,D] = EIG(A,B) produces a diagonal matrix D of generalized % eigenvalues and a full matrix W whose columns are the % corresponding eigenvectors so that A*W = B*W*D. %-------------------------------------------------------------------------- [W,D]=eig(Sb,Sw); % Solve the generalized eigenvalue problem Lambda = diag(D); % Sort the eigenvalues. [Lam,p]=sort(Lambda); V=[]; % Build V form the eigen vectors for i=1:Dd, % which are corresponding to the V= [V W(:,p(end+1-i))]; % biggest eigenvalues end; Solve the generalized eigenvalues problem Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Project the initial patterns onto the reduced space. Plot 2-dim clusters Example – Handwriting recognition %-------------------------------------------------------------------------- % Project the initial patterns onto the reduced space %-------------------------------------------------------------------------- for i = 1:N1, r1(:,i) = V' * l1(:,i); end; for i = 1:N2, r2(:,i) = V' * l2(:,i); end; figure; plot(r1(1,:),r1(2,:),'.',r2(1,:),r2(2,:),'or'); xlabel ('feature1'); ylabel ('feature2'); Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

-6 x 10 3 2.5 2 1.5 feature2 1 0.5 0 -0.5 -1 -4 -2 0 2 4 6 8 10 feature1 -5 x 10 Example – Handwriting recognition The figure below depicts the 2-dim clusters. It is easy to see that the separability of the clusters is perfect even in one dimension, which is not surprising because there are only two clusters. Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Bibliography • Prof. Olga Veksler pattern recognition course: http://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html • Handwriting database: http://yann.lecun.com/exdb/mnist/ Cluster Data synthesis Separability Feature extraction Guidelines Feature selection Example Bibliography Alon Slapak

Feature extraction