560 likes | 1.06k Views
Techniques for Graph Analytics on Big Data. Usman Nisar , Arash Fard , John A. Miller Computer Science Department University of Georgia, Athens, GA. Outline. Introduction Subgraph Pattern Matching Types of Subgraph Pattern Matching Models of Computation Distributed Algorithms
E N D
Techniques for Graph Analytics on Big Data UsmanNisar, ArashFard, John A. Miller Computer Science Department University of Georgia, Athens, GA
Outline • Introduction • Subgraph Pattern Matching • Types of Subgraph Pattern Matching • Models of Computation • Distributed Algorithms • Performance Evaluation • Graph Partitioning • Results Analysis • Conclusion
Introduction • Directed graph G(V, E, l) • V is a set ofvertices • E ⊆ V x V is a set of directed edges • l: V →𝓝 is a function mapping vertices to labels • Versatility and expressivity • Social networks, web search engines, genome sequencing, etc. • Data sizes are growing rapidly • Facebook: 1 billion + users, average degree of 140 • Twitter: 200 million + users, 400 million tweets each day • Bigger datasets mean bigger graphs
Subgraph Pattern Matching • A graphG’(V’, E’, l’) is a subgraph of G(V, E, l) if: • V’ ⊆ V • E’ ⊆ V’ x V’ ⊆ E • ∀v ∈ V’ , l’(v) = l(v) • A pattern (or query) is a graph that we want to find in a bigger graph • Subgraph pattern matching: Given a query graph, find all subgraphs of another graph (the datagraph) that are similar to the query based on certain criteria
Types of Subgraph Pattern Matching • Exact • SubgraphIsomorphism • Heuristic • Graph Simulation • Dual Simulation • Strong Simulation
Subgraph Isomorphism Exact matching from the pattern to data graph Labels must be the same Ullmann’salgorithm, VF2 NP-hard problem in general case Current solutions not practical for large graphs
Example: SubgraphIsomorphism Matches: {4, 8, 6, 7}, {5, 8, 6, 7}
Graph Simulation [1] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke, “Computing simulations on finite and infinite graphs,” in Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995, pp. 453–462. • A vertex in a data graph G matches a vertex in query graph Q via graph simulation iff • both have the same label • a subset of its children match all the children of its corresponding vertex in the query graph • HHK algorithm[1]
Dual Simulation • Adds a duality condition to graph simulation • A vertex in a data graph becomes a match with a vertex in the query graph iff • Both have the same label • A subset of its children matches all the children of its corresponding vertex in the data graph • A subset of its parents matches all the parents of its corresponding vertex in the data graph • O((|Vq| + |Eq|) (|V| + |E|)) • More restrictive than graph simulation
Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB
Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB
Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB • O(|V|(|V| + (|Vq| + |Eq|) (|V| + |E|)) • Query graph Q(Vq,Eq,l) matches data graph G(V, E, l) via strong simulation if there exists a vertex v ∈ Vs.t. • Q matches G[v, dq] via dual simulation with match relation Rb • v is contained in Rb
Other Distributed Graph Algorithms • Dual simulation • Similar to graph simulation • In addition to storing the match sets of its children, a vertex also stores the match sets of its parents • During match set evaluation, a vertex takes both of these into account • Strong simulation (optimized) • First run dual simulation to obtain match relation R • For each matching vertex v in R: • Create a ball centered at v with radius dqcontaining only vertices in R • Perform dual simulation on the ball
Performance Evaluation • Experimental setup • Cluster with 12 machines • Each with two 2Ghz Intel Xeon E5-2620 CPUs, each with six cores • Ethernet: 1 Gb/s • Set up HDFS/GPS on all of the machines
Runtime Evaluation Synthesized, |V|= 108
Runtime Evaluation uk-2005, |V|= 3.9 x 107
Runtime Evaluation enwiki-2013, |V|= 4.2 x 106
Speedup Evaluation Synthesized, |V|= 108
Speedup Evaluation uk-2005, |V|= 3.9 x 107
Speedup Evaluation enwiki-2013, |V|= 4.2 x 106
Conclusions • The three algorithms exhibit excellent scalable behavior in terms of speedup and efficiency as we increase the number of workers • Distributed implementations (GPS): • Graph simulation • Dual simulation • Strong simulation • Ongoing work: • Distributed implementations with message passing in Akka • Our own sequential/distributed isomorphism algorithm
Depth-First Ball Algorithm works in a depth-first fashion. Message is generated at the center which is then propagated through the system for ballSizesupersteps. The approach results in an exponential number of messages that slows down the whole system and renders the approach impractical
Breadth-First Ball Works on a simple ping-reply model. Center vertex starts of by sending a ping message to all of its adjacent nodes in the first superstep. In the second superstep, all the recipient nodes reply back with their label and the ids of their children and parents. Center vertex upon receiving this information in the third superstep, saves the returned labels and then ping the boundary nodes. This process is repeated till we have a ball of size dq.
Efficiency • Calculated as Efficiencyk = Speedupk/k Synthesized, |V|=108 uk-2005, |V|=3.9x107 enwiki-2013, |V|=4.2x106
Graph Partitioning 1 http://glaros.dtc.umn.edu/gkhome/metis/metis/overview • By default, the data graph is partitioned in a round-robin fashion among workers • The goals of min-cut partitioning are two-fold: • to create well-balanced partitions • to reduce the inter-partition edges • Number of other algorithms have been shown to use min-cut graph partitioning successfully for speed-ups • METIS1 – graph partitioning tool • Written in C • Takes an undirected graph and outputs the partitions
Performance Evaluation • Experimental Setup • Cluster with 5 machines • Each with two 2Ghz Intel Xeon E5-2620 CPU, each with six cores • Ethernet: 1 Gb/sec • Setup HDFS/GPS on all the machines • Datasets: • Labels(l) = 200
Results - Runtime Graph Simulation Dual Simulation Strict Simulation Synthesized Dataset, |V|=107, = 1.2
Results - Runtime Graph Simulation Dual Simulation Strict Simulation uk-2002, |V|=1.8x107
Results – Network I/O Graph Simulation Dual Simulation Strict Simulation Synthesized Dataset, |V|=107,
Results – Network I/O Graph Simulation Dual Simulation Strict Simulation uk-2002, |V|=1.8x107