260 likes | 606 Views
Multi-Label Feature Selection for Graph Classification. Xiangnan Kong, Philip S. Yu. Department of Computer Science University of Illinois at Chicago. Outline. Introduction Multi-Label Feature Selection for Graph Classification Experiments Conclusion. Introduction: Graph Data .
E N D
Multi-Label Feature Selection for Graph Classification Xiangnan Kong, Philip S. Yu Department of Computer Science University of Illinois at Chicago
Outline • Introduction • Multi-Label Feature Selection for Graph Classification • Experiments • Conclusion
Introduction: Graph Data • Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x1, x2, …, xd) - y • In real apps, data are not directly represented as feature vectors, but graphs with complex structures. • E.g. G(V, E, l) - y Chemical Compounds Program Flows XML Docs
Introduction: Graph Classification • Graph Classification: • Construct a classification model for graph data • Example:drug activity prediction • Given a set of chemical compounds labeled with activitiesto one type of disease or virus • Predict active / inactive for a testing compound Training Graphs Testing Graph + ? -
Graph Classification using Subgraph Features Subgraph Patterns H H g1 g2 g3 N H C H C … C C C C C C H H How to find a set of subgraphfeatures in order to effectively perform graph classification? C C O O C C N x1 C H H G1 C Classifier C … H N O 1 0 1 x1 H x2 H … 0 1 1 H H C C C H G2 x2 Feature Vectors C C H C H C C C C C C C C H O H O Graph Objects Feature Vectors Classifiers
Existing Methods for Subgraph Feature Selection • Feature Selection for Graph Classification • Find a set of useful subgraph features for classification • Existing Methods • Select discriminative subgraph features • Focused on single-label settings • Assume one graph can only have one label C C C C C H H + - O O C C N C Graphs + Lung Cancer Useful Subgraphs • Graph • Label
Multi-Label Graphs • In many real apps, one graph can have multiple labels. + Breast Cancer - Lung Cancer + Melanoma • Graph • Labels • Anti-Cancer Drug Prediction
Multi-Label Graphs • Other Applications: • XML Document Classification • (One document -> multiple tags) • Program flow error detection • (One program -> multiple types of errors) • Kinase Inhibitor Discovery • (One chemical -> multiple types of kinase) • …
Multi-Label Feature Selection for Graph Classification Evaluation Criteria F(p) b x a a a b b Multi-label Classification Multi-LabelGraphs Subgraph features c c c • Find useful subgraph features for graphs with multiple labels
Two Key Questions to Address • Evaluation: How to evaluate a set of subgraph features using multiple labels of the graphs? (effective) • Search Space Pruning: How to prune the subgraph search space using multiple labels of the graphs? (efficient)
What is a good feature? • Dependence Maximization Maximize dependence between the features and the multiple labels of graphs • Assumption Graphs with similar label sets should have similar features. a d a a d d b e b c e c f 1 f 2
Dependence Measure • Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al. 05] • Evaluates the dependence between input feature and label vectors in kernel space. • Empirical Estimate is easy to calculate a a b c c HSIC = • KS : kernel matrix for graphs • KS [i, j] : measures the similarity between graph i and j on the common subgraph features they contain (in S) • L : kernel matrix for label vectors • L[i, j] : measures the similarity between label sets of graph i and graph j • H = I – 11T/n : centering matrix using common subgraph features in S using label vectors in {0,1}Q
Optimization -> gHSIC Criterion • gHSIC Score: • Objective: MaximizeDependence (HSIC) H good H N bad • (the sum over all • selected features) gHSICScore C C C C represents the i-th subgraph feature
Two Key Questions to Address • How to evaluate a set of subgraph features with multiple labels of the graphs? (effective) • How to prune the subgraph search space using multiple labels of the graphs? (efficient)
Finding a Needle in a Haystack Pattern Search Tree • gSpan[Yan et. al ICDM’02] • An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support) ┴ 0-edges 1-edge 2-edges … • Too many frequent subgraph patterns • Find the mostuseful one(s) usingmultiple labels How to find the Best node(s) in this tree without searching all the nodes? (Branch and Bound to prune the search space) not frequent
gHSIC Upper Bound • gHSIC: represents the i-th subgraph feature • An Upper-Bound of gHSIC: gHSIC-UB = • Upper-Bound of gHSIC scores for all supergraphs of the • Anti-monotonic with subgraph frequency • ----> Pruning
Pruning Principle gHSIC Pattern Search Tree best subgraph so far current node … best score so far H H N upper bound current score H H sub-tree If best score ≥upper bound We can prune the entire sub-tree C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C … …
Experiment Setup • Four methods are compared: • Multi-labelfeature selection +Multi-label classification • gMLC[This Paper] + BoosTexter [Schapire & Singer 00] • Multi-label feature selection +Binary classification • gMLC[This Paper]+ BR-SVM [Boutell et al 04](Binary Relevance) • Single-label feature selection +Binary classification • BR (Binary Relevance)+ Information Gain + SVM • Top-k frequent subgraphs+Multi-label classification • gSpan[Yan & Han 02] + BoosTexter[Schapire & Singer 00]
Data Sets • Three multi-label graph classification tasks: • Anti-cancer activity prediction • Toxicology prediction of chemical compounds • Kinase inhibitor prediction
Evaluation • Multi-Label Metrics [Elisseef&Weston NIPS’02] • Ranking Loss ↓ • Average number of label pairs being ranked incorrectly • The smaller the better • Average Precision ↑ • Average fraction of correct labels in top ranked labels • The larger the better • 10 times 10-fold cross-validation
Experiment Results Ranking Loss 1 – AvePrec Anti-Cancer dataset PTC dataset Kinase Inhibition dataset
Experiment Results Anti-Cancer Dataset • Our approach with multi-label classifier performed best at NCI and PTC datasets Single-Label FS + Single-label Classifiers Ranking Loss (lower is better) Multi-Label FS+ Single-label Classifiers Unsupervised FS + Multi-label Classifier Multi-Label FS + Multi-label Classifier # Selected Features
Pruning Results Running Time #Subgraph Explored
Pruning Results Without gHSIC pruning Running time (seconds) (lower is better) gHSIC pruning (anti-cancer dataset)
Pruning Results Without gHSIC pruning # Subgraphs explored (lower is better) gHSIC pruning (anti-cancer dataset)
Conclusions • Multi-Label Feature Selection for Graph Classification • Evaluating subgraph features using multiple labels of the graphs (effective) • Branch&boundpruning the search space using multiple labels of the graphs (efficient) Thank you!