1 / 25

Feature space tansformation methods

Feature space tansformation methods. Basic idea. Instead of using the original features as they are, we will transform the features into a new feature space

felipeayers
Download Presentation

Feature space tansformation methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature space tansformation methods

  2. Basic idea • Instead of using the original features as they are, we will transform the features into a new feature space • Similar to what we how the φ() function worked for SVM: instead of the original feature vector x, we create a new feature vector y for example: • We hope that this transformation • Will reduce the number of features, so k<d • It makes training faster, and helps avoid the “curse of dimensionality” • But retains the important information from the original features while reduces the redundancy • So it also helps the subsequent classification algorithm • Here we will study only linear transformation methods

  3. Simple transformation methods • The simplest transformation methods transform each feature independently • Centering (or zero-centering) : from each feature we substract the mean of that feature • This simply shifts the mean of the data to the origin • Scaling: we multiply each feature with a constant: xy=f(x)=ax • For example, min-max scaling means that we guarantee that the value of x for each sample will fall between some given limits. For example, for [0,1]:

  4. Simple transformation methods • Min-max scaling is very sensitive to outliers, i.e. a data sample that takes very large or small values (perhaps only by mistake) • So it is more reliable if instead of scaling the min and max of x, we scale its variance • Centering each feature’s mean to 0 and scaling its variance to 1 is also known as data standardization (sometimes called data normalization)

  5. Linear feature space transformation methods • While data normalization shifted and scaled each feature independently, we now move on to methods that apply a general linear transformation to the features, so that the new feature vector y is obtained from x as • A general linear transformation can also rotate the feature space • For example, we will see how to get decorrelated features • Whitening: decorrelation plus standardization

  6. Principal components analysis (PCA) • Main idea: we want to reduce the number of dimensions (features) • But we also want to minimize the information loss caused by the dimension reduction • Example: let’s suppose we have two features and our data has a Gaussian distribution with a diagonal covariance matrix • The data has a much largervariance in x than in y (x2is much larger than y2) • We want to drop one of thefeatures to reduce the dimensionfrom 2 to 1 • We will keep x and drop y, as this choice results in the smaller information loss

  7. Principal components analysis (PCA) • Example with 3 dimensions: • Original 3D data best 2D approximation best 1D approximation • PCA will be similar, but instead of keeping or dropping features separately, it applies a linear transformation, so it is also able to rotate the data • Gaussian example: when the covariance matrix is not diagonal, PCA will create the new feature set by finding the directions with the largest variance • Although the above example is for Gaussian distributions, PCA does not require the data to have Gaussian distribution

  8. Principal components analysis (PCA) • Let’s return to the example where we want to reduce the number of dimensions from 2 to 1 • But instead of discarding one of the features, now we look for a projection into 1D • The projection axis will be our new 1D feature y • The projection will be optimal if it preserves the direction along which the data has the largest variance • As we will see, it isequivalent to looking for the projection direction which minimizes the projection error

  9. Formalizing PCA • We start with centering the data • And examine only such projection directions that go through the origin • Any d-dimensional linear space can be represented by a set of d orthonormal basis vectors • Orthonormal: they are orthogonal and their length is 1 • If our basis vectors are denoted by v1, v2, .., vd • Then any x point in this space can be written as x1v1+ x2v2+…+xnvn • x are scalar, v are vectors! • PCA seeks for a new orthonormal basis (i.e. coordinate system) which has a dimension k<d • So it will define a subspace in the original space

  10. Formalizing PCA • We want to find the most accurate representation of our training data D={x1,…,xn} is some subspace W for which k<d • Let {e1,…ek} denote the orthonormal basis vectors of this subspace. Than any data point in this subspace can be represented as • However, the original data points are in a higher dimensional space (remember that k<d), so they cannot be represented in this subspace • Still, we are looking for the representation that has the smallest error: • E.g. for data point x1:

  11. Formalizing PCA • Any training sample xj can be written as • And we want to minimize the error of the representation in the new space over all the training samples • So the total error can be written as • We have to minimize J, and also take care that the resulting e1,…ek vectors should be orthonormal • The solution can be found by setting the derivative to zero, and using Lagrange multipliers to satisfy the constraints • For the derivation see http://www.csd.uwo.ca/~olga/Courses/CS434a_541a/Lecture7.pdf

  12. PCA – the result • After a long derivation, J reduces to • Where • Apart from a constant multiplier, it is equivalent to the covariance matrix of the data (remember that μ is zero, as we centered our data): • Minimizing J is equivalent to maximizing • This is maximized by defining the new basis vectors e1,…,ek as those eigenvectors of S which have the largest corresponding eigenvalues λ1,…,λ k

  13. PCA – the result • The larger the eigenvalue λ1 is, the larger is the variance in he direction corresponding to the eigenvector • So the solution is indeed what we intuitively expected: PCA projects the data into a subspace of dimension k along the directions for which the data shows the largest variance • So PCA can indeed be interpreted as finding a new orthogonal basis by rotating the coordinate system until the directions of maximum variance are found

  14. How PCA approximates the data • Let e1,…,ed be the eigenvectors of S, sorted in decreasing order according to the corresponding eigenvalues • In the original space, any x point can be written as • In the transformed space, we keep only the first k components and drop the remaining ones – so we obtain only an approximation of x • The αj components are called the principal components of x • The larger the value of k, the better the approximation is • The components are arranged in the order of importance, the more important components come first • PCA keeps the k most important principal components of x as the approximation of x

  15. The last step – transforming the data • Let’s see how to transform the data into the new coordinate system • Let’s define the linear transformation matrix E as E=[e1,…,ek] • Then each data item x can be transformed as y=Ex • So instead of the original d-dimensional feature vector x we can now use the k-dimensional transformed feature vector y

  16. PCA – summary

  17. Linear discriminant analysis (LDA) • PCA keeps the directions where the variance of the data is maximal • However, it ignores the class labels • The directions with the maximal variance may be useless for separating the classes • Example: • Linear discriminant analysis (also known as Fisher’s linear discriminant in the 2-class case) seeks to project the data in directions which are useful for separating the classes • PCA is an unsupervised feature space transformation method (no class label required) • LDA is a supervised feature space transformation method (takes the class labels into consideration)

  18. LDA – main idea • The main idea is to find projection directions along which the samples that belong to the different classes are well separated • Example (2D, 2 classes): • How can we measure the separation between the projections of the two classes? • Let’s try to use the distance between the projection of the means μ1and μ2 • This will be denoted by

  19. LDA – example • How good is as the measure of class separation? • In the above example projection to the vertical axis separates the classes better than the projection to the horizontal axis, yetis larger for the latter • The problem is that we should also take the variances into consideration! • Both classes have a larger variance along the horizontal axis than along the vertical axis

  20. LDA – formalization for 2 classes • We will look for a projection direction that maximizes the distance of the class means while minimizing the class variances at the same time • Let yi denote the projected samples • Let’s define the scatter of the two projected classes as (scatter is the same as variance, within a constant multiplier) • Fisher’s linear discriminant seeks a projection direction v which maximizes

  21. LDA – interpretation • A large J(v) value • Means that • To solve this maximization we have to express J(v) explicitly, and then maximize it

  22. LDA – derivation • We define the scatter matrices of class 1 and class 2 (before projection) as: • We define the “within class” scatter matrix as • And the “between class” scatter matrix as It measures the separation between the means of the two classes (before projection) • It can be shown that • And

  23. LDA – derivation • Using the above notation, we obtain that • We can maximize it by taking the derivative and setting it to zero • Similar to PCA, it leads to an eigenvalue problem with the solution • This is the optimal direction when we have two classes • The derivation can be extended to multiple classes as well • In the case of c classes we will obtain c-1 directions • So instead of a projection vector v, we will obtain a projection matrix V

  24. LDA – multi-class case • We are looking for the projection matrix V, so our data samples xi will be converted into a new feature vector b by a linear transformation

  25. LDA – multi-class case • The function to be maximized is now given as • Solving the maximization again requires eigenvalue decomposition • The decomposition results in v1, ..,vc-1 eigenvectors • The optimal projection matrix into a subspace of k dimensions is given by the k eigenvectors that correspond to the largest k eigenvalues • As the rank of Sb is c-1 (at most), LDA cannot produce more than c-1 directions, so the number of the projected features obtained by LDA is limited to c-1 • (remember that the number of projected features was at most d in the case of PCA, where d is the number of the dimensions of the original feature space)

More Related