290 likes | 518 Views
Dual Active Feature and Sample Selection for Graph Classification. Xiangnan Kong 1 , Wei Fan 2 , Philip S. Yu 1. 1 Department of Computer Science University of Illinois at Chicago 2 IBM T. J. Watson Research. KDD 2011. Graph Classification. Traditional Classification:. Feature Vector.
E N D
Dual Active Feature and Sample Selection for Graph Classification Xiangnan Kong1, Wei Fan2, Philip S. Yu1 1 Department of Computer Science University of Illinois at Chicago 2 IBM T. J. Watson Research KDD 2011
Graph Classification • Traditional Classification: Feature Vector input output label x • Graph Classification: Graph Object label input output
Cheminformatics: Drug Discovery - + + Graph Object Anti-cancer activity H H N Chemical Compound H H C C C +/- label C C C - - + H H C H N O H Testing data Training data ? ? ? ? ? ?
Applications: System Call Graph Program Flows Normal softare/ Virus? XML Documents Category Error? label label label
Graph Classification • Given a set of graph objects with class labels • how to predict the labels of unlabeled graphs Challenge: • complex structure • lack of features • Subgraph Feature Mining
Subgraph Features Subgraph Features F3 F1 F2 H H N H C H C … C C C C G1 C C H H How to extract a set of subgraph features for a graph classification? C C O O C C N C H H C x1 C Classifier … H N O 1 0 1 H x1 H … x2 0 1 1 H H C C C H C C G2 H C H C C C x2 C C C C C H O H O Graph Objects Feature Vectors Classifiers
Subgraph Feature Selection • Existing Methods • Mining discriminative subgraph features for a graph classification task • Focused on supervisedsettings - + + C H H C C N C C C F2 F1 - - +
Labeling Cost • Supervised Settings • Require a large number of labeled graphs • Labeling cost is high ? We can only afford to label a few graph objects -> Feature selection -> Classification Accuracy
Active Sample Selection • Given a set of candidate graph samples • We want to select the most important graph to query the label + ? ? ? + - ? ? ?
Active Sample Selection • Given a set of candidate graph samples • We want to select the most important graph to query the label + ? ? ? - ? ? + ?
Two parts of the problem ? ? • Active SampleSelection • select most important graph in the pool to query label ? ? • Subgraph FeatureSelection • Select relevant features to the classification task C C H H O O N C C C O Correlated ! C C C C
Active Sample Selection No feature Subgraph enumeration is NP-hard Representative Informative
Active Sample Selection View depend on which subgraph features are used
Example F1 F2 Subgraph Features C C C O C C C C Graphs Very Similar G1 H H G2 H N H H H C H C C C H C C C C H C C C H C C C C H H C C C C C C H O H N O H O H
Example F1 F2 Subgraph Features C C H H O O N G1 G2 H H H N H H C Very Different H H C C C H C C C C H C Graphs C C H C C C C H H C C C C C C H O H N O H O H
Subgraph Feature Selection Subgraph Feature Feature Selection View Graph Object Active Sample Selection View
Dual Active Feature and Sample Selection Query & Label Active Sample Selection + - C C C LabeledGraphs C C H H O O C C N C Subgraph Feature Selection ? ? ? • Perform active sample selection & feature selectionsimultaneously UnlabeledGraphs
gActive Method • Max-min Active Sample Selection • Maximizing theReward for querying a graph query + min. - Worst Case max.
gActive Method • Dependence MaximizationGraphs’ features match with their labels • InformativeQuery graph far away from labeled graphs • RepresentativeQuery graph close to unlabeled graphs • Max-min Active Sample Selection • Maximize the reward + • Feature Selection • Max. an utility function
Example: - + More Details in the paper: Branch& Bound Subgraph Mining (speed up)
Experiments: Data Sets balanced with 500 positive + 500 negative samples • Anti-Cancer Activity datasets (NCI & AIDS) • Graph: chemical compounds • Label: anti-cancer activities
Experiments: Compared Methods • Unsupervised feature selection + Random Sampling • Freq. + Random frequent subgraphs+ random query • Unsupervised feature selection + Margin-based • Freq. + Marginfrequent subgraphs + close to margin • Unsupervised feature selection + TED • Freq.+ TEDfrequent subgraphs + transductive experimental design • Supervised feature selection + Random Sampling • IG + Random information gain + random query • Supervised feature selection + Margin-base • IG + Margininformation gain + close to margin • Dual active feature and sample selection • gActivethe proposed method in this paper
Experiment Results (NCI-47) gActive Dual Active Feature & Sample selection Accuracy (higher is better) I.G. + Random I.G. + Margin Freq. + TED Freq. + Margin Freq. + Random Supervised < Unsupervised Supervised > Unsupervised # Queried Graphs ( #features=200, NCI-47 )
Experiment Results • gActive wins consistently
Conclusions • Dual ActiveFeature and Sample Selection for Graph Classification • Perform subgraph feature selection and active sample selection simultaneously • Future works • other data and applications • itemset and sequence data Thank you!