Spectral Feature Selection for Handling Very Large Scale Problems

Spectral Feature Selection for Handling Very Large Scale Problems Zheng (Alan) Zhao SAS Institute

Motivation • Petabyte datasets are rapidly becoming normal in data mining applications • Google: processing 20 PB data per day (2008) • eBay's Green-plum data warehouse: 6.5 PB user data containing 170 trillion records and growing by 150 billion new records per day (2009) • The efficiency of existing feature selection algorithms significantly downgrades, if not totally inapplicable, when data size exceeds hundreds of gigabytes • Distributed computing techniques, such as MPI and MapReduce, can be applied to handle very large data. • Most existing feature selection algorithms are designed for centralized computing architecture.

Large Scale Spectral Feature Selection • Spectral feature selection is a general framework for both supervised and unsupervised feature selection • Unifies many existing supervised and unsupervised feature selection algorithms • ReliefF, Fisher-score, Laplacian-score, Trace-ratio, etc… • Can be used to derive families of new algorithms • Can be extended to solve many novel problems • Semi-supervised feature selection • Multi-source feature selection • We study how to implement spectral feature selection in distributed computing environments, such as MapReduce and SAS Grid, to exploit the power of parallel processing techniques for tackling very large scale data in feature selection • We focus on MapReduce techniqe in this talk

MapReduce • A technique for processing massive data on large scale computer cluster • A programming model • An execution framework • The idea of bringing code to the data • Hide system-level details from the developers • Two key components of Map-Reduce • Mapper • Reducer

The Key Ideas • The training of many existing algorithms can be decomposed to the computations of a serials of • sufficient statistics • gradient steps • These summation forms can be grouped according to the location of the samples and done locally on each cluster node through the mappers. And the obtained local results can be aggregated via the reducer to obtain the final global results A summation over the data points + + Reducer + + + + + + + + + + + + + + Mapper Mapper Mapper

Linear Regression • The objective: • Solution: • Decomposition: Each column of X is an sample

Linear Regression (cont.) • Mapper & Reducer Reducer + + Mappers + + + + + +

Spectral Feature Selection • The basic idea: a good feature should not randomly assign values to the samples A Motivating Example In feature selection, we want to select features that assign similar values to the samples that are similar to each other. 8

The Spectrum of The Similarity Matrix • The eigenvectors of the similarity matrix carry the distribution information of the data

Univariate vs. Multivariate Formulations • Measuring features’ consistency by comparing features to the Eigenvectors • Univariate formulation • Multivariate formulation

Efficient Computation

Efficient Computation (Cont.)

MapReduce Adaptation

Similarity Matrix

Large Scale MRSF

Large Scale MRSF (cont.)

Large Scale Nesterov

Thank You! Any Questions? Questions are guaranteed in life; Answers aren't.

Spectral Feature Selection for Handling Very Large Scale Problems