170 likes | 283 Views
Parameter transformation. Feature extraction and selection. Dimensionality reduction: use of large number of features yields a large number of statistical parameters to estimate and requires a lot of data for system training
E N D
Feature extraction and selection • Dimensionality reduction: use of large number of features yields a large number of statistical parameters to estimate and requires a lot of data for system training • feature selection: select the most effective features according to given optimal criterion (ranking the features according to their contribution to recognition results) • feature extraction: transformation is applied to the features vectors to reduce dimensionality but preserving their information content; transformation may be linear or non-linear • linear: principal component analysis, linear discriminant analysis • non-linear: vector quantization (but causes information loss), ANN • new method are under investigation: Karhounen-Loeve transformation, support vector machines, etc.
Feature extraction: LDA • Linear transformation is applied and maps the feature space into lower-dimensional space such that a separability criterion is optimized. • Let {xn}Nn=1 be a D-dimensional feature vector categorized into one of L classes {wi}Ll=1 • covariance matrix and mean vector are defined as: • separability criterion is usually defined as a function of scatter matrices: within- and between- class scatter within-class: SW represents scatter around their class mean vector, while the between class scatter matrix Sb describes the scatter of the class means around the total mean m • m is mean vector and • S is the covariance matrix of class • wl containing Nl samples • Two measures of separation among classes: • where tr() is the trace and det -determinant of a matrix • generalization: Fisher ratio measure • between class variance/average within-class variance • Fi=Bi/Wi, where B and W are I-th diagonal elements of Sb and Sw
LDA • Linear transformation from original D-dimensional feature space into D’<D can be expressed as y=Utx, where U is a D x D’ matrix of independent column vectors • Fukunaga showed that separability measure J1 can be expressed as follows and matrix U has to be determined in order to maximize J1(D’) • it can be shown, that such a matrix are built of D’ eigenvectors corresponding to D’ largest eigenvalues of the matrix S-1WSb. • Therefore UT represents a linear transformation that projects a D-dimensional feature vector into D’-dimensinal feature space spanned by D’ eigenvectors (principal discriminants of S-1WSb • matrix U is determined so that the projection of the training data onto its first column vector provides a maximum Fisher ratio equal to the largest eigenvalue of S-1WSb,second column the second largest, etc. • The quality of feature reduction through projection along the reduced set of principal discriminants depends on how well the adopted class separability measures J reflects the structure of the data
Principal component analysis PCA • PCA is known also as Karhunen-Loeve Expansion • Let {xn}n=1N is a vector of features of dimension D • in PCA feature vectors are projected into a feature space with coordinate axes are oriented in the directions of maximum variance, it can be shown that principal components correspond to the orthogonal eigenvectors of sample covariance matrix S • a linear transformation y=Utx with U having the column vectors the orthogonal eigenvectors of S , after transformation the coordinate corresponds to eigenvectors of S, the S is diagonal and variance along each axis is equal to the eigenvalues • D’ < D can be selected to preserve most of the variance of the original features • http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html • 2x2 rotations are applied here and the process of partial rotation of data is shown; also PCA coefficients for the eigenvectors displayed are shown as small arrows. • Look at planets, countries and images.
PCA & LDA summary • Feature transformation can be used either to uncorrelate features or to maximise class separability in terms of a separability criterion. • PCA performs a coordinates rotation in order to uncorrelate features, while LDA aligns the directions of maximum separability with the axes so that a certain class separability criterion is maximized. • The result is a new representation of the features and, provided that all of them are kept, there is no information loss. • After PCA transformation, it is often assumed that the features with highest variance in the transformed space carry important discriminatory information, but this assumption may not be always true. In the LDA case, information is concentrated within the first features, while last features are noisy and can be discarded. • In both cases the features with eigenvalues equal to zero can be discarded, but in practice it is quite usual that no eigenvalue is equal to zero. • If the researcher is not confident on the classification capabilities of the transformed features, a feature filter can be always used after the transformation.
Application of PCA and LDA to speech recognition • to reduce dimension and decorrelate features • for LDA is necessary to assign training feature vectors to the acoustic classes to discriminate: time-aligned phonetic transcription needed: SR can be used for determining phonetic units or SR states; after alignment within-class and between-class statistics are computed and LDA matrix transformation is estimated • usually starting from wide feature vectors and then reducing them • scatter matrix estimated using dynamic features
Feature selection • The objective is to select the most effective subset of features • effectiveness measure = recognition performance • discriminative feature selection (Boccheri and Wilpon, 93), Paliwal (92) • feature set can be reduced even on a half • rather heuristic approach, the optimal number of features to be included in the reduced set has to be empirically determined • rare used
Vector quantization • Used to obtain discrete parametric representation: features are transformed into symbols belonging to a finite alphabet. This can be done in two ways: • by separate discretization of feature vector components (scalarquantization) • by partitioning the whole feature space and assigning the same symbol to all vectors contained in a given element of the partition (vectorquantization) • VQ is used to produce symbolic description of speech patterns to be used as observations to HMMs. This reduces substantially memory and computational power requirements, because in this case the probability densities of HMMs are discrete and likelihood values can be simply evaluated with lookup tables, also training is more effective (maximum mutual information training, Normandin, 91) • VQ is used for speech coding, especially for very low bit rates • VQ is a mapping of multidimensional feature vector x (x e R) onto vector q belonging to a finite set Q={qk}k=1K, where Q is called codebook, while qk are codewords, the total number K of codewords is called size of the codebook • The quantization map q= L(x): L :RD -> Q is estimated from a training data set and designed to meet a predefined criterion (e.g. to minimize a distortion or an error measure) • This mapping yields a subdivision of feature space RD into non-overlapping regions called cells which contain feature vectors mapped onto the same codeword. Thus, a codebook of size K is specified both by set {qk}k=1K of codewords as well as by set of cells ={ck}k=1K
VQ 2 • cells can have different shapes; Two problems to solve: • define a method for designing the codebook • specify the transformation L • Let find a distortion measure over the pattern space to judge the similarity between two vectors. Then the goal will be to find such a transformation which minimize a distortion d(x,q) between feature vector x and the codewords. Hence, the codeword q is selected as (nearest neighbor rule): Cells
VQ 3 • Minimizing the average distortion of cells. We aim to minimize the quantity Dk: where E denotes expectation. The codeword qk is called centroid of cell Ck. Since a finite set of vectors X is available for codebook design the average cell distortion Dk is estimated as Nk is the number of vectors contained in Ck and q^k is an estimate of corresponding codeword • if the distance measure d corresponds to mean squared error the vector q^k which minimizes Dk is simply the sample mean of vectors in Ck • What are distance or distortion measures between two multidimensional vectors x,y E R ?
Spectral distances • for p=1: absolute error • for p=2: mean squared error (MSE) • for p=2 weighted mean square error is used, where weight matrix W (of dimension D) is the estimated covariance matrix of the pattern space (Mahalanobis distance) • log-spectral distance for comparing log amplitudes of two LPC spectra is quite complicated, usually only mean-squared log-spectral distance is used and can be approximated as: where M is order of the model, C are cepstral coefficients • Itakura-Saito distance shows the distance between a given signal spectrum S(ejw) and corresponding autoregressive spectrum SA(ejw) • practically simpler Itakura distance is used, b is vector of B(z) coefficients, is the autocorrelation matrix of the xa(n)
Codebook design- k-means algorithm • Iterative clustering algorithm which for a given set of training data X={xn}n=1N searches for a partition of the pattern space into K clusters whose centroids {qk}k=1K are the codewords: 1. Codeword is initialized 2. The clusters associated to the codewords are evaluated ( ) 3. A new set of codewords and cell distortions are computed applying MSE criterion to the clusters obtained in the previous step. 4. Repeat 2 & 3 until the overall distortion of the clusters is below a predefined threshold • Future trends and applications: • use codebooks for selected groups of features (MFCC, D, DD) • use NN for VQ • supervised methods • use for speaker identification: codebooks for all speakers and distortion mesure for identification • PCA and LDA used in LVCSR systems • VQ rather in small systems