Advisor ： Dr. Hsu Graduate ： Ching-Lung Chen Author ： Pabitra Mitra Student Member

國立雲林科技大學National Yunlin University of Science and Technology • Unsupervised Feature Selection Using Feature Similarity • Advisor ：Dr. Hsu • Graduate：Ching-Lung Chen • Author ：Pabitra Mitra • Student Member

Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Feature Similarity Measure • Feature Selection method • Feature Evaluation indices • Experimental Results and Comparisons • Conclusions • Personal Opinion • Review

Motivation • N.Y.U.S.T. • I.M. • Conventional method of feature selection have high-computational complexity problem in both dimension and size.

Objective • N.Y.U.S.T. • I.M. • Propose an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size.

Introduction 1/3 • N.Y.U.S.T. • I.M. • The sequential floating searches provide better results, though at the cost of a higher computational complexity. • Broadly classified existing methods into two categories: • Maximization of clustering performance • Sequential unsupervised feature selection、maximum entropy、neuro-fuzzy approach… • Based on feature dependency and relevance • Correlation coefficients、measures of statistical redundancy、linear dependence

Introduction 2/3 • N.Y.U.S.T. • I.M. • We propose an unsupervised algorithm which uses feature dependency/similarity for redundancy reduction, but requiring no search. • A new similarity measure call maximal information compression index, is used in clustering. Its comparison with correlation coefficient and least-square regression error is made.

Introduction 3/3 • N.Y.U.S.T. • I.M. • The proposed algorithm is geared toward to two goals: • Minimizing the information loss. • Minimizing the redundancy present in the reduced feature subset. • The feature selection algorithm unlike most conventional algorithms, search for best subset, its can be computed in much less time compared to many indices used in other supervised and unsupervised feature selection method.

Feature Similarity Measure • N.Y.U.S.T. • I.M. • There are two approaches for measuring similarity between two random variables: • To nonparametrically test the closeness of probability distributions of the variables. • To measure the amount of functional dependency between the variables. • We discuss below two existing linear dependency measures: • Correlation Coefficient • Least Square Regression Error(e)

Feature Similarity Measure • N.Y.U.S.T. • I.M. • Correlation Coefficient ( ) var() the variance of a variable cov() the covariance between two variables. • . • if x and y are linearly related. • (symmetric). • if and for some constants a,b,c,d,then the measure is invariant to scaling and translation of the variables • the measure is sensitive to rotation of the scatter diagram in (x,y) plane

Feature Similarity Measure • N.Y.U.S.T. • I.M. • Least Square Regression Error (e) the error predicting y from the linear model y = a + bx. a andb are the regression coefficients obtained by minimizing the mean square error. The coefficients are given by , and and the mean square error e(x,y) is given by

Feature Similarity Measure • N.Y.U.S.T. • I.M. • Least Square Regression Error (e) • . • e(x,y)=0 if x and y are linearly related • (unsymmetric). • if u=x/c and v = y/d for some constant a,b,c,d, then e(x,y)=d2e(u,v). the measure e is sensitive to scaling of the variables. • the measure e is sensitive to rotation of the scatter diagram in x-y plane.

Feature Similarity Measure • N.Y.U.S.T. • I.M. • maximal information compression index ( ) Let be the covariance matrix of random variables x and y. Define maximal information compression index as smallest eigenvalue of =0 when the features are linearly dependent and increases as the amount of dependency decreases

Feature Similarity Measure • N.Y.U.S.T. • I.M. • The corresponding loss of information in reconstruction of the pattern is equal to the eigenvalue along the direction normal to the principal component. • hence, is the amount of reconstruction error committed if the data is projected to a reduced dimension in the best possible way. • there fore , it’s a measure of the minimum amount of information loss or maximum amount of information compression.

Feature Similarity Measure • N.Y.U.S.T. • I.M. • the significance of can also be explained geometrically in terms of linear regression. • the value of is equal to the sum of the squares of theperpendicular distance of the points (x,y) to the best fit line • The coefficients of such a best fit line are given by and where

Feature Similarity Measure • N.Y.U.S.T. • I.M. • has the following properties: • . • . • . • . • .

Feature Similarity Measure • N.Y.U.S.T. • I.M.

Feature Selection method • N.Y.U.S.T. • I.M. • The task of feature selection involves two step: • partition the original feature set into a number of homogeneous subsets (clusters) • selecting a representative feature from each such cluster • The partition of the features is based on K-NN principle • compute thek nearest features of each feature. • among them the feature having the most compact subset is selected, and itsk neighboring features are discarded. • the process is repeated for the remaining features until all of them are either selected or discarded

Feature Selection method • N.Y.U.S.T. • I.M. • Determining the k nearest-neighbors of features, we assign a constant error threshold ( ) which is set equal to the distance of the kth nearest-neighbor of the feature select in first iteration. • if greater than , then we decrease the value of k.

k r i Feature Selection method • N.Y.U.S.T. • I.M. • D : original number of features • the original feature set be O={Fi, i=1,…,D} • the dissimilarity between features Fi and Fj represent by S(Fi,Fj). • Let represent the dissimilarity between feature Fi and its kth nearest-neighbor feature in R.

Feature Selection method • N.Y.U.S.T. • I.M.

Feature Selection method • N.Y.U.S.T. • I.M. • with respect to the dimension (D), the method has complexity O(D2) • evaluation of the similarity measure for a feature pair is of complexity O(l), thus, the feature selection scheme has overall complexity O(D2l) • k acts as a scale parameter which controls the degree of details in a more direct manner. • this algorithm is nonmetric nature of similarity measure.

Feature Evaluation indices • N.Y.U.S.T. • I.M. • Now se describe some indices below: • need class information • class seperability • K-NN classification accuracy • naïve Bayes classification accuracy • do not need class information • entropy • fuzzy feature evaluation index • representation entropy

Feature Evaluation indices • N.Y.U.S.T. • I.M. Class Separability • Sw is the within class scatter matrix • Sb is the between class scatter matrix. • is the a priori probability that a pattern belongs to class wj. • is he sample mean vector of class wj.

Feature Evaluation indices • N.Y.U.S.T. • I.M. K-NN Classification Accuracy • use the K-NN rule for evaluating the effectiveness of the reduced set for classification. • we randomly select 10% of data as training set and classify the remaining 90% point. • Ten such independent runs are performed and average accuracy on test set.

Feature Evaluation indices • N.Y.U.S.T. • I.M. Naïve Bayes Classification Accuracy • Used Bayes maximum likelihood classifier ,assuming normal distribution of classes to evaluating the classification performance. • Mean and covariance of the classes are estimated from a randomly selected 10% training sample and the remaining 90% used as test set.

Feature Evaluation indices • N.Y.U.S.T. • I.M. Entropy • xp,jdenotes feature value for p along jth direction. • similarity between p,q is given by • is a positive constant, a possible value of is • is the average distance between data points computed over the entire data set. • if the data is uniformly distributed in the feature space, entropy is maximum.

Feature Evaluation indices • N.Y.U.S.T. • I.M. Fuzzy Feature Evaluation Index (FFEI) • are the degree that both patterns p and q belong to the same cluster in the feature spaces respectively • membership function may be defined as • the value of FFEI decreases as the intercluster distances increase.

Feature Evaluation indices • N.Y.U.S.T. • I.M. Representation Entropy • let the eigenvalues of the d*d covariance matrix of a feature set of size d be • has similar properties like probability, and • this is equivalent to the amount of redundancy present in that particular representation of the data set.

Experimental Results and Comparisons • N.Y.U.S.T. • I.M. • Three categories of real-life public domain data sets are used: • low-dimensional (D<=10) • medium-dimensional (10<D<=100) • high-dimensional (D>100) • Use nine UCI data set include : • Isolet • Multiple Features • Arrhythmia • Spambase • Waveform • Ionosphere • Forest Cover Type • Wisconsin Cancer • Iris

Experimental Results and Comparisons • N.Y.U.S.T. • I.M. • We use four indices to measure classification and clustering performance: • Branch and Bound Algorithm (BB) • Sequential Forward Search (SFS) • Sequential Floating Forward Search (SFFS) • Stepwise Clustering (SWC) * using correlation coefficient • in our experiments, we have mainly used entropy as the feature selection criterion with first three search algorithm.

Experimental Results and Comparisons • N.Y.U.S.T. • I.M.

Conclusions • N.Y.U.S.T. • I.M. • An algorithm for unsupervised feature selection using feature similarity measures is described. • our algorithm is based on pairwise feature similarity measure , which are fast to compute. It unlike other approaches, which are based on optimizing either classification or clusteringperformance explicitly . • We have defined a feature similarity measure called maximal information compression index. • It also demonstrated through extensive experiments that representation entropy can be used as an index for quantifying both redundancy reduction and information loss in a feature selection method.

Personal Opinion • N.Y.U.S.T. • I.M. • We can learning this method to help our experimental of feature selection. • This similarity measure is valid only for numeric features, we can think about how to use in categorical.

Review • N.Y.U.S.T. • I.M. compute the k nearest features of each feature. Among them the feature having the most compact subset is selected, and its k neighboring features are discarded. repeated this process for the remaining feature until all of them are either selected or discarded.

Advisor ： Dr. Hsu Graduate ： Ching-Lung Chen Author ： Pabitra Mitra Student Member