1 / 23

Presented by Ozgur D. Sahin

This paper presents the ANF (Approximate Neighborhood Function) algorithm, a tool for answering questions on graph-represented data. It computes graph similarity, subgraph similarity, and vertex importance for efficient data mining. The algorithm is fast, has low storage requirements, and offers error guarantees. Experimental results demonstrate its accuracy and scalability. The ANF tool can be used for various applications such as measuring the robustness of the Internet and clustering movie classes.

hwhitehurst
Download Presentation

Presented by Ozgur D. Sahin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented by Ozgur D. Sahin

  2. Outline • Introduction • Neighborhood Functions • ANF Algorithm • Modifications • Experimental Results • Data Mining using ANF • Conclusions

  3. Introduction & Motivation • Graph-based data is becoming more importatnt • Internet modeling, academic citations, phone records, movie databases, CAD circuits • Example Questions: • How robust is the Internet to failures? • What are the most influential database papers? • What is the best opening move in tic-tac-toe? • Are phone call patterns in Asia similar to those in the U.S.? • Goal: Quickly answer questions on graph- represented data

  4. Answering Questions • We can answer these questions if we can compute following three properties related to connectivity and neighborhood structure: • Graph Similarity: Decide if two graphs have similar connectivity/neighborhood structure • Subgraph Similarity: Compare how two subgraphs of a given graph are connected • Vertex Importance: Assign an importance to each node based on its connectivity • This paper provides such a tool: ANF (Approximate Neighborhood Function)

  5. Challenges • Following properties should be satisfied: • Error Guarantees:Accurate estimates • Fast: Scale linearly with n (# of nodes) and m (# of edges) • Low Storage • Adapts to available memory • Parallelizable • Sequential scan of the edge file • Estimates per node

  6. Definitions - Neighborhood Functions dist(u,v): # of edges on the shortest path from u to v Define following neighborhood functions:

  7. Definitions - Neighborhood Functions Generalize these two definitions to deal with subgraphs:

  8. Basic ANF Algorithm • N(h) can be computed by a graph traversal • Graph traversal accesses edges in random order • Running time is O(nm) • Access edges in sequential order: • M(x,h) is the set of nodes within distance h of node x

  9. Basic ANF Algorithm • How to compute the number of distinct elements in the set M(x,h): • A dictionary data structure: O(n2log n) time/space • Use bits to mark membership: O(n2) space • Use ‘probabilistic counting algorithm’ • Approximate set sizes using ‘log n+r’ bits

  10. Probabilistic counting algorithm • Approximate set sizes using ‘log n+r’ bits • Instead of one bit per node, give half the nodes bit 0, a quarter of them bit 1, and so on (A node is given bit i with probability 1/2i+1) • The approximation of the size of a set is proportional to 2b, where b is the least bit that has not been set in the bit representation of this set • Use k parallel approximations • M(x,h) is represented by k(log n+r) bits

  11. Basic ANF Algorithm • Consider a ring with 5 nodes • Example for k=3 and r=0 • Bit 0 is the leftmost bit in each 3-bit mask • M(2,1) is the union of M(2,0), M(1,0), and M(3,0): • M(2,1)=M(2,0) OR M(1,0) OR M(3,0) • IN(2,1) is computed from the average of the least zero bit positions: • Avg=(2+1+1)/3=4/3  IN(2,1) = (24/3)/0.77359 = 3.25

  12. Basic ANF Algorithm

  13. Modifications • M(x,h) uses M(y,h-1) but not M(y,h-2), so just keep the M(y,h-1) during iteration h. • Include a mark bit to handle generalized neighborhood functions • Break bit masks into smaller pieces if they are larger than the available memory

  14. Leading Ones Compression • As ANF runs, most bit masks will have many leading 1’s • Compress bit masks by including a counter of the leading ones • Bit shuffling of k parallel bit masks enables further compression: • 11010,11100  1111011000 • Provides up to 23% speed-up

  15. Experiments • Data Sets: 3 real (Router, Cornell, Cora) and 4 synthetic • Evaluation Metric:

  16. Experiments - Accuracy k=64: - ANF achieves less than 7% error - ANF’s error is independent of the data set

  17. Experiments - Time

  18. Experiments - Scalability

  19. Data Mining with ANF • ANF tool can be used to answer graph mining problems: • Best opening move for Tic-Tac-Toe game • Clustering movie classes • Measuring the robustness of the Internet • Use summarized statistics derived from neighborhood function: • Many real graphs follow a power law: • N(h) µ hH, where H is defined as the ‘hop exponent’ • Use ‘individual hop exponent’ as a measure of importance

  20. Tic-Tac-Toe • Show: The best opening move is the center square • Each possible board configuration is a node and there is an edge from board x to board y if it is a possible move • Compute individual neighborhood functions for each of the 9 possible first moves

  21. Clustering Movies • Consider IMDB (Internet Movie Data Base) where each movie is identified as being in one or more classes (such as documentaries, dramas, comedies, etc) • Construct a graph for each class and cluster similar ones

  22. Internet Router Data • How robust the Internet is to router failures • Delete some number of routers and measure connectivity • Random failures do not disrupt the Internet • Targeted failures can dramatically disrupt it

  23. Conclusions • ANF uses an efficient and accurate approximation algorithm • ANF tool provides several advantages including following: • Accurate • Fast • Low storage requirements • Parallelizable • ANF makes it possible to answer many interesting questions

More Related