Graph-based Iterative Hybrid Feature Selection

Graph-based Iterative Hybrid Feature Selection Erheng Zhong† Sihong Xie† Wei Fan‡ Jiangtao Ren† Jing Peng# Kun Zhang$ †Sun Yat-sen University ‡IBM T. J. Watson Research Center #Montclair State University $Xavier University of Louisiana

Where we are • Supervised Feature Selection • Unsupervised Feature Selection • Semi-supervised Feature Selection • Hybrid: • Supervised to include key features • Improve with semi-supervised approach

Supervised Feature Selection Only feature 2 will be selected, but feature 1 is also useful! sample selection bias problem

Toy example (1) Labeled data A(1,1,1,1;red) B(1,-1,1,-1;blue) Unlabeled data C(0,1,1,1;red) D(0,-1,1,1;red) Both feature 2 & 4 are correlated to class based on A and B. They are selected by supervised fs.

Semi-supervised Feature Selection

Toy example (2) A semi-supervised approach “Spectral Based Feature Selection”. Features are ranked according to the smoothness between data points and consistency with label information. Feature 2 will be selected if only one feature is desired.

Solution  Hybrid • Labeled data insufficient  Sample selection bias Supervised fail • Unlabeled data indistinct  Data from different class are not separated Semi-supervised fail

Hybrid Feature Selection [IteraGraph_FS] ?

Toy example (3)

Properties of feature selection • The distance between any two examples is approximately the same under the high-dimension feature space. [Theorem 3.1] • Feature selection can obtain a more distinguishable distance measure which lead to a better confidence estimate. [Theorem 3.2]

Theorems 3.1 and 3.2 3.1 Dimensionality increases  Nearest neighbor approaches the farthest neighbor 3.2 More distinguishable similarity measure  Better classification confidence matrix

Semi-supervised Feature Selection • Graph-based [Label Propagation] • Expand the labeled set by adding unlabeled data and their prediction labels which have high confidence (s%). • Perform feature selection on the new labeled set.

Confidence and Margin (Lemma 3.2)

Selection Strategy Comparison (Theorem 3.3)

Experiments setup • Data Set • Handwritten Digit Recognition Problem • Biomedical and Gene Expression Data • Text Documents [Reuters-21578] • Comparable Approach • Supervised Feature selection: SFFS • Semi-supervised approach: sSelect [SDM07]

Data Set -- Description

Feature Quality Study

Conclusions • Labeled information  Critical features, better confidence estimates • Unlabeled data  Improve this chosen feature set • Flexible • Can incorporate many feature selection methods which aim at revealing the relationship between data points.

Graph-based Iterative Hybrid Feature Selection

Graph-based Iterative Hybrid Feature Selection

Presentation Transcript

Feature selection

Feature Selection

Feature selection

Semi-Supervised Feature Selection for Graph Classification

Feature Grouping-Based Fuzzy-Rough Feature Selection

Feature Selection

Multi-Label Feature Selection for Graph Classification

Feature Selection

Genetic-Algorithm-Based Instance and Feature Selection

Feature Selection

FEATURE SELECTION = GENE SELECTION

Feature selection

Feature Selection

Feature Selection

Dual Active Feature and Sample Selection for Graph Classification

Feature Selection, Feature Extraction

Feature Selection

Feature selection

Feature Selection

Feature Selection

Feature selection

Genetic-Algorithm-Based Instance and Feature Selection