1 / 21

Semi-Supervised Feature Selection for Graph Classification

Xiangnan Kong, Philip S. Yu. Semi-Supervised Feature Selection for Graph Classification. Department of Computer Science University of Illinois at Chicago. KDD 2010. Graph Classification - why should we care?.

zaria
Download Presentation

Semi-Supervised Feature Selection for Graph Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Xiangnan Kong, Philip S. Yu Semi-Supervised Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago KDD 2010

  2. Graph Classification - why should we care? • Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x1, x2, …, xd) - y • In real apps, data are not directly represented as feature vectors, but graphs with complex structures. • E.g. G(V, E, l) - y Chemical Compounds Program Flows XML Docs

  3. Example: Graph Classification • Drug activity prediction problem • Given a set of chemical compounds labeled with activities • Predict the activities of testing molecules Training Graphs Testing Graph + ? - Program Flows XML Docs Also Program Flows and XML docs

  4. Subgraph-based Graph Classification Subgraph Patterns H H g1 g2 g3 N H C H C … C C C C C C H H How to mine a set of subgraph patterns in order to effectively perform graph classification? C C O O C C N x1 C H H G1 C C Classifier … H N O 1 0 1 x1 H x2 H … 0 1 1 H H C C C H G2 x2 Feature Vectors C C H C H C C C C C C C C H O H O Graph Objects Feature Vectors Classifiers

  5. Conventional Methods - Two Components • Two Components: • Evaluation(effective) • whether a subgraph feature is relevant to graph classification? • Search space pruning (efficient) • how to avoid enumerating • all subgraph features? C C C C C H H + - O O C C N C Labeled Graphs Discriminative Subgraphs

  6. One Problem C C C H H + - C C N C Labeled Graphs • Labeling • a graph • is hard ! Discriminative Subgraphs • Supervised Settings • Require a large set of labeled training graphs • However…

  7. Lack of labels -> problems • Supervised Methods: • Evaluation effective? • require large amount of label information • Search space pruning efficient? • pruning performances rely on • large amount of label information + - C C C C C H H O O C C N Labeled Graphs C Discriminative Subgraphs

  8. Semi-Supervised Feature Selection for Graph Classification Evaluation Criteria + F(p) - C C C LabeledGraphs C C H H O O C C N C ? ? + + Subgraph Patterns Classification - ? • Mine useful subgraph patterns using labeledandunlabeled graphs UnlabeledGraphs

  9. Two Key Questions to Address • Evaluation: How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective) • Search Space Pruning: How to prune the subgraph search space using both labeled and unlabeled graphs? (efficient)

  10. What is a good feature? + + Labeled Labeled - Labeled - 1 Unlabeled 3 Labeled 2 Unlabeled Cannot-LinkGraphs in different classes should be far away Must-Link Graphs in the same class should be close SeparabilityUnlabeled graphs are able to be separated from each other

  11. Optimization • Evaluation Function: + + + - ? ? Cannot-Link Graphs in different classes should be far away Must-Link Graphs in the same class should be close Separability Unlabeled graphs are able to be separated from each other

  12. Evaluation: gSemi Criterion • In matrix form: H H N gSemi Score • (the sum over all • selected features) C C good C C bad represents the k-th subgraph feature gSemi Score:

  13. Experiment Results #labeled Graphs =30 #labeled Graphs =50 #labeled Graphs =70 MCF-3 dataset NCI-H23 dataset OVCAR-8 dataset PTC-MM dataset PTC-FM dataset

  14. Experiment Results (#labeled graphs=30, MCF-3) gSemi (Semi-Supervised) Information Gain (Supervised) Accuracy (higher is better) Frequent (Unsupervised) # Selected Features Our approach performed best at NCI and PTC datasets

  15. Two Key Questions to Address • How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective) • How to prune the subgraph search space using both labeled and unlabeled graphs? (efficient)

  16. Finding a Needle in a Haystack • gSpan[Yan et. al ICDM’02] • An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support) ┴ 0-edges 1-edge 2-edges … • Too many frequent subgraph patterns • Find the mostuseful one(s) not frequent How to find the Best node(s) in this tree without searching all the nodes? (Branch and Bound to prune the search space) Pattern Search Tree

  17. Pruning Principle gSemi score Pattern Search Tree best subgraph so far current node … best score so far H H N upper bound current score H H sub-tree If best score ≥upper bound We can prune the entire sub-tree C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C … …

  18. Pruning Results Without gSemi pruning Running time (seconds) (lower is better) gSemi pruning (MCF-3 dataset)

  19. Pruning Results Without gSemi pruning # Subgraphs explored (lower is better) gSemi pruning (MCF-3 dataset)

  20. Parameters best (α=1, β=0.1) Semi-Supervised close to Supervised Accuracy (higher is better) close to Unsupervised α β (cannot-link constraints) (must-link constraints) (MCF-3 dataset, #label=50, #feature=20)

  21. Conclusions Thank you! • Semi-Supervised Feature Selection for Graph Classification • Evaluating subgraph features using both labeled and unlabeled graphs (effective) • Branch&boundpruning the search space using labeled and unlabeled graphs (efficient)

More Related