Feature Selection, Dimensionality Reduction, and Clustering

Summer Course: Data Mining Feature Selection, Dimensionality Reduction, and Clustering Presenter: Georgi Nalbantov August 2009

Structure • Dimensionality Reduction • Principal Components Analysis (PCA) • Nonlinear PCA (Kernel PCA, CatPCA) • Multi-Dimensional Scaling (MDS) • Homogeneity Analysis • Feature Selection • Filtering approach • Wrapper approach • Embedded methods • Clustering • Density estimation and clustering • K-means clustering • Hierarchical clustering • Clustering with Support Vector Machines (SVMs)

Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)

Feature Selection • In the presence of millions of features/attributes/inputs/variables, select the most relevant ones.Advantages: build better, faster, and easier to understand learning machines. m features X m’ n

Feature Selection • Goal: select the two best features individually • Any reasonable objective J will rank the features • J(x1) > J(x2) = J(x3) > J(x4) • Thus, features chosen [x1,x2] or [x1,x3]. • However, x4 is the only feature that provides complementary information to x1

Feature Selection • Filtering approach: ranks features or feature subsets independently of the predictor (classifier). • …using univariate methods: consider one variable at a time • …using multivariate methods: consider more than one variables at a time • Wrapper approach: uses a classifier to assess (many) features or feature subsets. • Embedding approach:uses a classifier to build a (single) model with a subset of features that are internally selected.

Feature Selection: univariate filtering approach • Issue:determine the relevance of a given single feature. m m density, P(Xi| Y) density, P(Xi| Y) -1 s- xi xi s-

Feature Selection: univariate filtering approach • Issue:determine the relevance of a given single feature. • Under independence: • P(X, Y) = P(X) P(Y) • Measure of dependence (Mutual Information): • MI(X, Y) =  P(X,Y) log dX dY • = KL( P(X,Y) || P(X)P(Y) )

Feature Selection: univariate filtering approach • Correlation and MI • Note: Correlation is a measure of linear dependence

Feature Selection: univariate filtering approach • Correlation and MI under the Gaussian distribution

Feature Selection: univariate filtering approach. Criteria for measuring dependence.

Feature Selection: univariate filtering approach m- m+ m+ m-, Legend: Y=1Y=-1 DensityP(Xi| Y=-1)P(Xi| Y=1) -1 s- s- s+ xi xi s+

Feature Selection: univariate filtering approach P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1) Legend: Y=1Y=-1 density -1 s- s- s+ xi xi s+

Feature Selection: univariate filtering approach m- m+ Is this distance significant? T-test • Normally distributed classes, equal variance s2 unknown; estimated from data as s2within. • Null hypothesis H0: m+ = m- • T statistic: If H0 is true, then • t= (m+ - m-)/(swithin1/m++1/m-) ~ Student(m++m--2 d.f.) -1 s- s+ xi

Feature Selection: multivariate filtering approach Guyon-Elisseeff, JMLR 2004; Springer 2006

Feature Selection: search strategies Kohavi-John, 1997 N features, 2N possible feature subsets!

Feature Selection: search strategies • Forward selection or backward elimination. • Beam search: keep k best path at each step. • GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps. • PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times. • Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far.

Feature Selection: filters vs. wrappers vs. embedding • Main goal: rank subsets of useful features

m1 m2 m3 Feature Selection: feature subset assessment (wrapper) N variables/features Split data into 3 sets: training, validation, and test set. • 1) For each feature subset, train predictor on training data. • 2) Select the feature subset, which performs best on validation data. • Repeat and average if you want to reduce variance (cross-validation). • 3) Test on test data. M samples • Danger of over-fitting with intensive search!

Feature Selection via Embedded Methods:L1-regularization sum(|beta|) sum(|beta|)

Univariate Multivariate Linear T-test, AUC, feature ranking RFE with linear SVM or LDA Non-linear Mutual information feature ranking Nearest Neighbors Neural Nets Trees, SVM Feature Selection: summary

Dimensionality Reduction • In the presence of may of features, select the most relevant subset of (weighted) combinations of features. Feature Selection: Dimensionality Reduction:

Dimensionality Reduction:(Linear) Principal Components Analysis • PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal. Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first.

Dimensionality Reduction:Nonlinear (Kernel) Principal Components Analysis Original dataset X Map X to a HIGHER-dimensional space, and carry out LINEAR PCA in that space (If necessary,) map the resulting principal components back to the origianl space

Dimensionality Reduction:Multi-Dimensional Scaling • MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space. • MDS attempts to retain pairwise Euclidean distances in the low-dimensional space . • Error on the fit is measured using a so-called “stress” function • Different choices for a stress function are possible

Dimensionality Reduction:Multi-Dimensional Scaling • Raw stress function (identical to PCA): • Sammon cost function:

Dimensionality Reduction:Multi-Dimensional Scaling (Example) Input: Output:

Dimensionality Reduction:Homogeneity analysis • Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA.

Clustering:Similarity measures for hierarchical clustering • k-th Nearest Neighbour • Parzen Window • Unfolding, Conjoint Analysis, Cat-PCA Clustering ClassificationRegression + X 2 X 2 + + + + + + + + + + + - + + + + + + - + + + - - + + + + + + + + - + - + X 1 X 1 X 1 • Linear Discriminant Analysis, QDA • Logistic Regression (Logit) • Decision Trees, LSSVM, NN, VS • Classical Linear Regression • Ridge Regression • NN, CART

Clustering • Clustering in an unsupervised learning technique. • Task: organize objects into groups whose members are similar in some way • Clustering finds structures in a collection of unlabeled data • A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters

Density estimation and clustering Bayesian separation curve (optimal)

Clustering:K-means clustering • Minimizes the sum of the squared distances to the cluster centers (reconstruction error) • Iterative process: • Estimate current assignments (construct Voronoi partition) • Given the new cluster assignments, set cluster center to center-of-mass

Clustering:K-means clustering • Step 1 • Step 2 • Step 3 • Step 4

Clustering:Hierarchical clustering • Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster • Agglomerative HC: starts with N clusters and combines clusters iteratively • Divisive HC: starts with one cluster and divides iteratively • Disadvantage: wrong division cannot be undone Dendrogram

Clustering:Nearest Neighbor algorithm for hierarchical clustering 1. Nearest Neighbor, Level 2, k = 7 clusters. 2. Nearest Neighbor, Level 3, k = 6 clusters. 3. Nearest Neighbor, Level 4, k = 5 clusters.

Clustering:Nearest Neighbor algorithm for hierarchical clustering 4. Nearest Neighbor, Level 5, k = 4 clusters. 5. Nearest Neighbor, Level 6, k = 3 clusters. 6. Nearest Neighbor, Level 7, k = 2 clusters.

Clustering:Nearest Neighbor algorithm for hierarchical clustering 7. Nearest Neighbor, Level 8, k = 1 cluster.

Clustering:Similarity measures for hierarchical clustering

Clustering: Similarity measures for hierarchical clustering • Pearson Correlation: Trend Similarity

Clustering: Similarity measures for hierarchical clustering • Euclidean Distance

Clustering: Similarity measures for hierarchical clustering • Cosine Correlation +1  Cosine Correlation  – 1

Clustering: Similarity measures for hierarchical clustering • Cosine Correlation: Trend + Mean Distance

Clustering: Similarity measures for hierarchical clustering

Clustering: Similarity measures for hierarchical clustering Similar?

Clustering: Grouping strategies for hierarchical clustering C1 Merge which pair of clusters? C2 C3

Clustering: Grouping strategies for hierarchical clustering Single Linkage Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “long chains”

Clustering: Grouping strategies for hierarchical clustering Complete Linkage Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “clumps”

Clustering: Grouping strategies for hierarchical clustering Average Linkage Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). + + C2 C1

Clustering: Grouping strategies for hierarchical clustering Average Group Linkage Dissimilarity between two clusters = Distance between two cluster means. + + C2 C1

Clustering: Support Vector Machines for clustering The not-noisy case Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Feature Selection, Dimensionality Reduction, and Clustering