180 likes | 341 Views
Graph-based Iterative Hybrid Feature Selection. Erheng Zhong † Sihong Xie † Wei Fan ‡ Jiangtao Ren † Jing Peng # Kun Zhang $ † Sun Yat-sen University ‡ IBM T. J. Watson Research Center # Montclair State University $ Xavier University of Louisiana. Where we are.
E N D
Graph-based Iterative Hybrid Feature Selection Erheng Zhong† Sihong Xie† Wei Fan‡ Jiangtao Ren† Jing Peng# Kun Zhang$ †Sun Yat-sen University ‡IBM T. J. Watson Research Center #Montclair State University $Xavier University of Louisiana
Where we are • Supervised Feature Selection • Unsupervised Feature Selection • Semi-supervised Feature Selection • Hybrid: • Supervised to include key features • Improve with semi-supervised approach
Supervised Feature Selection Only feature 2 will be selected, but feature 1 is also useful! sample selection bias problem
Toy example (1) Labeled data A(1,1,1,1;red) B(1,-1,1,-1;blue) Unlabeled data C(0,1,1,1;red) D(0,-1,1,1;red) Both feature 2 & 4 are correlated to class based on A and B. They are selected by supervised fs.
Toy example (2) A semi-supervised approach “Spectral Based Feature Selection”. Features are ranked according to the smoothness between data points and consistency with label information. Feature 2 will be selected if only one feature is desired.
Solution Hybrid • Labeled data insufficient Sample selection bias Supervised fail • Unlabeled data indistinct Data from different class are not separated Semi-supervised fail
Properties of feature selection • The distance between any two examples is approximately the same under the high-dimension feature space. [Theorem 3.1] • Feature selection can obtain a more distinguishable distance measure which lead to a better confidence estimate. [Theorem 3.2]
Theorems 3.1 and 3.2 3.1 Dimensionality increases Nearest neighbor approaches the farthest neighbor 3.2 More distinguishable similarity measure Better classification confidence matrix
Semi-supervised Feature Selection • Graph-based [Label Propagation] • Expand the labeled set by adding unlabeled data and their prediction labels which have high confidence (s%). • Perform feature selection on the new labeled set.
Experiments setup • Data Set • Handwritten Digit Recognition Problem • Biomedical and Gene Expression Data • Text Documents [Reuters-21578] • Comparable Approach • Supervised Feature selection: SFFS • Semi-supervised approach: sSelect [SDM07]
Conclusions • Labeled information Critical features, better confidence estimates • Unlabeled data Improve this chosen feature set • Flexible • Can incorporate many feature selection methods which aim at revealing the relationship between data points.