Graph Indexing: A Frequent Structure-based Approach

Graph Indexing: A Frequent Structure-based Approach 指導老師：曾新穆教授組員：李彥寬、洪世敏、丁鏘巽、黃冠霖、詹博丞日期：2013/11/14

Outline • Ch1 Introduction • Ch2 Preliminaries • Ch3 Frequent Fragment • Ch4 Discriminative Fragment • Ch5 gIndex • Ch6 Experimental Result • Improvement • Maintenance

Ch1 Introduction

Ch1 Introduction • The classical graph query problem • Given a graph database and a graph query , find all the graphs in which is a subgraph.

Ch1 Introduction • Build graph index • Path-based indexis inefficient. Too many paths

Ch1 Introduction • Build graph index • Graph-based index is suitable. Only one result

Ch2 Preliminaries

Ch2 Preliminaries • The graph feature set is denoted by .For any graph feature , is the set of graphscontaining , .

Ch2 Preliminaries • Query processing, which consists of two substeps: • (1)Search. compute the candidate query answer set, ; each graph in contains all ‘s features in the feature set. Therefore, is a subset of . • (2) Verification, which checksgraph g in to verify whether is really a subgraphof .

Ch2 Preliminaries • Cost Analysis • Query Response Time: • the index size is approximately proportional to the size of the feature set .

Ch3 Frequent Fragment

Ch3 Frequent Fragment minSup: 2 indexed

Ch3 Frequent Fragment If query Q is frequent, We can easily find Q indexed

Ch3 Frequent Fragment If query Q is not frequent?

Ch3 Frequent Fragment Find the frequent subgraphs of Q!

Ch3 Frequent Fragment We find all q’s subgraphs Sort them in the support decreasing order There is a boundary that

Ch3 Frequent Fragment Conpute candidate answer set by

Ch3 Frequent Fragment If minSup high less may be too large! If minSuplow too many Low High

Ch3 Frequent Fragment Advantages of : (1) Less frequent fragment than lowest uniform (2) There are so many small subgraphs. Low-support large fragment may be indexed by them. But the smaller subgraphs may be too large because is low! We will design a distillation procedure in the next section!

Ch3 Frequent Fragment Typical setting , It continues until exhaust fragments up to size of maxL with

Ch4 Discriminative Fragment

Ch4 Discriminative Fragment • Do we need to index every frequent fragment? • If there are two frequent fragment that is a supergraph of . is not more discriminative than is redundant.

Ch4 Discriminative Fragment We use discriminative ratio to find if x is discriminative.

Ch4 Discriminative Fragment Property of When x is completely redundant. When,x is more discriminative. is related to the fragments which are already in the feature set. So, we need to mine discriminative fragments.

Ch4 Discriminative Fragment If we set , We get Discriminative Fragments above. Since fragment (b) is a sub-graph of fragment (c), its discriminative ratio of fragment (c) is 2 / 1 = 2.0.

Ch5 gIndex

Ch5 gIndex 5.1 Discriminative fragment selection 5.2 Index construction 5.3 Search

5.1 Discriminative fragment selection

5.2 Index construction 5.2.1 Graph Sequentialization 5.2.2 gIndexTree 5.2.3 Remark on gIndex Tree Size 5.2.4 gIndexTree Implementation

5.2.1 Graph Sequentialization • Adjacency matrices • DFS code Discovery time : Forward edge : Backward edge : DFS code : 5-tuple :

5.2.2 gIndex Tree Root : Level 0 : graphs with only one vertex and no edge …

5.2.3 Remark on gIndex Tree Size 0 Discriminative features on one path: 1 K-1 2 …

5.2.4 gIndexTree Implementation • Using Hash table • If two graph and are isomorphicthen

5.3 Search 5.3.1 Apriori Pruning 5.3.2 Maximum Discriminative Fragments

5.3.1 AprioriPruning • If a fragment is not in thegIndextree, we need not check its super-graphs any more. • A hash table H is used to facilitate the Apriori pruning.

5.3.2 Maximum Discriminative Fragments • If , then

Ch6 Experimental Result

Experimental Result • The performance of gIndex is compared with that of GraphGrep • GraphGrep is a path-based approach • two kinds of datasets in the experiments • one real dataset • a series of synthetic datasets

Dataset • The real dataset is that of an AIDS antiviral(抗病毒藥物) screen dataset containing chemical compounds • the dataset contains 43,905 classified chemicalmolecules • The synthetic data generator was provided by Kuramochiet al. • allows the user tospecifythe number of graphs (D), their average size(T), the number of seed graphs (S), the average size of seed graphs (I), and the number of distinct labels(L)

Experiment Background • experiments are performed on a 1.5GHZ, 1GB-memory, Intel PC running RedHat8.0 • Both GraphGrep and gIndex are compiled with gcc/g++

AIDS Antiviral Screen Dataset

Experimental Result the index size of gIndex is at least 10 times smaller than that of GraphGrep two salient properties of gIndex: its index size is small and stable

Experimental Result • the size of candidate answer set Cq : | Cq | • AVG(|Dq|) : the lower bound of AVG(|Cq|) • An algorithm achieving this lower bound actually matches the queries in the graph dataset precisely

Experimental Result Q4 (Query answer set size 較少) queries in Q4 are more likely path-structured

Experimental Result (Query answer set size 較多)

Experimental Result

Experimental Result The scalability of gIndex

Graph Indexing: A Frequent Structure-based Approach