230 likes | 347 Views
Presented by Ozgur D. Sahin. Outline. Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions. Introduction & Motivation. Graph-based data is becoming more importatnt
E N D
Outline • Introduction • Neighborhood Functions • ANF Algorithm • Modifications • Experimental Results • Data Mining using ANF • Conclusions
Introduction & Motivation • Graph-based data is becoming more importatnt • Internet modeling, academic citations, phone records, movie databases, CAD circuits • Example Questions: • How robust is the Internet to failures? • What are the most influential database papers? • What is the best opening move in tic-tac-toe? • Are phone call patterns in Asia similar to those in the U.S.? • Goal: Quickly answer questions on graph- represented data
Answering Questions • We can answer these questions if we can compute following three properties related to connectivity and neighborhood structure: • Graph Similarity: Decide if two graphs have similar connectivity/neighborhood structure • Subgraph Similarity: Compare how two subgraphs of a given graph are connected • Vertex Importance: Assign an importance to each node based on its connectivity • This paper provides such a tool: ANF (Approximate Neighborhood Function)
Challenges • Following properties should be satisfied: • Error Guarantees:Accurate estimates • Fast: Scale linearly with n (# of nodes) and m (# of edges) • Low Storage • Adapts to available memory • Parallelizable • Sequential scan of the edge file • Estimates per node
Definitions - Neighborhood Functions dist(u,v): # of edges on the shortest path from u to v Define following neighborhood functions:
Definitions - Neighborhood Functions Generalize these two definitions to deal with subgraphs:
Basic ANF Algorithm • N(h) can be computed by a graph traversal • Graph traversal accesses edges in random order • Running time is O(nm) • Access edges in sequential order: • M(x,h) is the set of nodes within distance h of node x
Basic ANF Algorithm • How to compute the number of distinct elements in the set M(x,h): • A dictionary data structure: O(n2log n) time/space • Use bits to mark membership: O(n2) space • Use ‘probabilistic counting algorithm’ • Approximate set sizes using ‘log n+r’ bits
Probabilistic counting algorithm • Approximate set sizes using ‘log n+r’ bits • Instead of one bit per node, give half the nodes bit 0, a quarter of them bit 1, and so on (A node is given bit i with probability 1/2i+1) • The approximation of the size of a set is proportional to 2b, where b is the least bit that has not been set in the bit representation of this set • Use k parallel approximations • M(x,h) is represented by k(log n+r) bits
Basic ANF Algorithm • Consider a ring with 5 nodes • Example for k=3 and r=0 • Bit 0 is the leftmost bit in each 3-bit mask • M(2,1) is the union of M(2,0), M(1,0), and M(3,0): • M(2,1)=M(2,0) OR M(1,0) OR M(3,0) • IN(2,1) is computed from the average of the least zero bit positions: • Avg=(2+1+1)/3=4/3 IN(2,1) = (24/3)/0.77359 = 3.25
Modifications • M(x,h) uses M(y,h-1) but not M(y,h-2), so just keep the M(y,h-1) during iteration h. • Include a mark bit to handle generalized neighborhood functions • Break bit masks into smaller pieces if they are larger than the available memory
Leading Ones Compression • As ANF runs, most bit masks will have many leading 1’s • Compress bit masks by including a counter of the leading ones • Bit shuffling of k parallel bit masks enables further compression: • 11010,11100 1111011000 • Provides up to 23% speed-up
Experiments • Data Sets: 3 real (Router, Cornell, Cora) and 4 synthetic • Evaluation Metric:
Experiments - Accuracy k=64: - ANF achieves less than 7% error - ANF’s error is independent of the data set
Data Mining with ANF • ANF tool can be used to answer graph mining problems: • Best opening move for Tic-Tac-Toe game • Clustering movie classes • Measuring the robustness of the Internet • Use summarized statistics derived from neighborhood function: • Many real graphs follow a power law: • N(h) µ hH, where H is defined as the ‘hop exponent’ • Use ‘individual hop exponent’ as a measure of importance
Tic-Tac-Toe • Show: The best opening move is the center square • Each possible board configuration is a node and there is an edge from board x to board y if it is a possible move • Compute individual neighborhood functions for each of the 9 possible first moves
Clustering Movies • Consider IMDB (Internet Movie Data Base) where each movie is identified as being in one or more classes (such as documentaries, dramas, comedies, etc) • Construct a graph for each class and cluster similar ones
Internet Router Data • How robust the Internet is to router failures • Delete some number of routers and measure connectivity • Random failures do not disrupt the Internet • Targeted failures can dramatically disrupt it
Conclusions • ANF uses an efficient and accurate approximation algorithm • ANF tool provides several advantages including following: • Accurate • Fast • Low storage requirements • Parallelizable • ANF makes it possible to answer many interesting questions