Applying gSPAN on Social Network Graphs

Applying gSPAN on Social Network Graphs -Abhik Ray WSU ID: 11199134

Graph Mining • Extension of traditional data mining techniques to graph data. • Focus on extracting patterns from relationships between entities rather than from entities themselves. • Ex. of graph data are Social Networks (Facebook), Chemical Compounds, Biological Networks etc.

Frequent Subgraph Discovery • Extension of Frequent Pattern Discovery. • Unsupervised Data Mining Approach. • Given a set of graphs GD find all subgraphs that are present in this set above a given frequency threshold. • Two main paradigms: • Candidate Generation Approach • Pattern Growth Approach

gSpan Terminology • graph-based Substructure patternmining • Backward extension: An edge is added between two nodes already present in the subgraph being considered. • Forward extension: An edge is added between a node already in the subgraph and another not in the subgraph. • General graph pattern growth proceeds by taking each discovered subgraph ‘g’ and performing extensions recursively until all frequent subgraphs which have ‘g’ embedded in them have been discovered.

gSpan • Creates a DFS for a frequent subgraph from a seed vertex. • Builds a linear order among the visited vertices by using subscripts. • The starting vertex becomes v0 and the ending vertex becomes vn(also called rightmost vertex). The path from v0 to vn is the rightmost path. • A new edge is now added between the rightmost vertex and any other vertex on the rightmost path or a new vertex is created and connected to any of the vertices on the rightmost path. • Duplicate generation is avoided by converting the DFS trees to DFS codes, choosing the minimum code and performing extensions only on that code.

Experiments • Wiki Vote: Wikipedia Request For Adminship who-votes-on-whom dataset. • 0.005 random edge sample taken 100 times. • Vertices in the samples sorted and renumbered in sequential order. • Conversion to gSpan format • gSpan run on dataset with 10% frequency. • Top four subgraphs selected based on score, where scored = orderd * frequencyd

0 1 10 2 9 3 8 4 7 5 6 Results • Characteristics of social networks like triangle closing edges not found.

Improvements • After each sample is taken, throw away the edges in that sample from the main graph. • Extend gSpan to handle directed edges. • Use more sophisticated sampling techniques like Forest Fire Sampling Techniques.

???? Thank you

Applying gSPAN on Social Network Graphs