Techniques for Graph Analytics on Big Data

Techniques for Graph Analytics on Big Data UsmanNisar, ArashFard, John A. Miller Computer Science Department University of Georgia, Athens, GA

Outline • Introduction • Subgraph Pattern Matching • Types of Subgraph Pattern Matching • Models of Computation • Distributed Algorithms • Performance Evaluation • Graph Partitioning • Results Analysis • Conclusion

Introduction • Directed graph G(V, E, l) • V is a set ofvertices • E ⊆ V x V is a set of directed edges • l: V →𝓝 is a function mapping vertices to labels • Versatility and expressivity • Social networks, web search engines, genome sequencing, etc. • Data sizes are growing rapidly • Facebook: 1 billion + users, average degree of 140 • Twitter: 200 million + users, 400 million tweets each day • Bigger datasets mean bigger graphs

Subgraph Pattern Matching • A graphG’(V’, E’, l’) is a subgraph of G(V, E, l) if: • V’ ⊆ V • E’ ⊆ V’ x V’ ⊆ E • ∀v ∈ V’ , l’(v) = l(v) • A pattern (or query) is a graph that we want to find in a bigger graph • Subgraph pattern matching: Given a query graph, find all subgraphs of another graph (the datagraph) that are similar to the query based on certain criteria

Types of Subgraph Pattern Matching • Exact • SubgraphIsomorphism • Heuristic • Graph Simulation • Dual Simulation • Strong Simulation

Subgraph Isomorphism Exact matching from the pattern to data graph Labels must be the same Ullmann’salgorithm, VF2 NP-hard problem in general case Current solutions not practical for large graphs

Example: SubgraphIsomorphism Matches: {4, 8, 6, 7}, {5, 8, 6, 7}

Graph Simulation [1] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke, “Computing simulations on finite and infinite graphs,” in Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995, pp. 453–462. • A vertex in a data graph G matches a vertex in query graph Q via graph simulation iff • both have the same label • a subset of its children match all the children of its corresponding vertex in the query graph • HHK algorithm[1]

Graph Simulation

Dual Simulation • Adds a duality condition to graph simulation • A vertex in a data graph becomes a match with a vertex in the query graph iff • Both have the same label • A subset of its children matches all the children of its corresponding vertex in the data graph • A subset of its parents matches all the parents of its corresponding vertex in the data graph • O((|Vq| + |Eq|) (|V| + |E|)) • More restrictive than graph simulation

Example: Graph  Dual Simulation

Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB

Strong Simulation • Extends dual simulation with a locality condition • A ball G[v, r]is a subset of a graph G containing: • all vertices VBwithin an undirected distance r of the vertex v • all the edges between the vertices in VB • O(|V|(|V| + (|Vq| + |Eq|) (|V| + |E|)) • Query graph Q(Vq,Eq,l) matches data graph G(V, E, l) via strong simulation if there exists a vertex v ∈ Vs.t. • Q matches G[v, dq] via dual simulation with match relation Rb • v is contained in Rb

Example: Dual  Strong Simulation

Models of Computation

Distributed Graph Simulation

Other Distributed Graph Algorithms • Dual simulation • Similar to graph simulation • In addition to storing the match sets of its children, a vertex also stores the match sets of its parents • During match set evaluation, a vertex takes both of these into account • Strong simulation (optimized) • First run dual simulation to obtain match relation R • For each matching vertex v in R: • Create a ball centered at v with radius dqcontaining only vertices in R • Perform dual simulation on the ball

Performance Evaluation • Experimental setup • Cluster with 12 machines • Each with two 2Ghz Intel Xeon E5-2620 CPUs, each with six cores • Ethernet: 1 Gb/s • Set up HDFS/GPS on all of the machines

Runtime Evaluation Synthesized, |V|= 108

Runtime Evaluation uk-2005, |V|= 3.9 x 107

Runtime Evaluation enwiki-2013, |V|= 4.2 x 106

Speedup Evaluation Synthesized, |V|= 108

Speedup Evaluation uk-2005, |V|= 3.9 x 107

Speedup Evaluation enwiki-2013, |V|= 4.2 x 106

Conclusions • The three algorithms exhibit excellent scalable behavior in terms of speedup and efficiency as we increase the number of workers • Distributed implementations (GPS): • Graph simulation • Dual simulation • Strong simulation • Ongoing work: • Distributed implementations with message passing in Akka • Our own sequential/distributed isomorphism algorithm

Questions?

Thanks

Depth-First Ball Algorithm works in a depth-first fashion. Message is generated at the center which is then propagated through the system for ballSizesupersteps. The approach results in an exponential number of messages that slows down the whole system and renders the approach impractical

Breadth-First Ball Works on a simple ping-reply model. Center vertex starts of by sending a ping message to all of its adjacent nodes in the first superstep. In the second superstep, all the recipient nodes reply back with their label and the ids of their children and parents. Center vertex upon receiving this information in the third superstep, saves the returned labels and then ping the boundary nodes. This process is repeated till we have a ball of size dq.

Breadth-First Ball

Efficiency • Calculated as Efficiencyk = Speedupk/k Synthesized, |V|=108 uk-2005, |V|=3.9x107 enwiki-2013, |V|=4.2x106

Graph Partitioning 1 http://glaros.dtc.umn.edu/gkhome/metis/metis/overview • By default, the data graph is partitioned in a round-robin fashion among workers • The goals of min-cut partitioning are two-fold: • to create well-balanced partitions • to reduce the inter-partition edges • Number of other algorithms have been shown to use min-cut graph partitioning successfully for speed-ups • METIS1 – graph partitioning tool • Written in C • Takes an undirected graph and outputs the partitions

Performance Evaluation • Experimental Setup • Cluster with 5 machines • Each with two 2Ghz Intel Xeon E5-2620 CPU, each with six cores • Ethernet: 1 Gb/sec • Setup HDFS/GPS on all the machines • Datasets: • Labels(l) = 200

Results - Runtime Graph Simulation Dual Simulation Strict Simulation Synthesized Dataset, |V|=107, = 1.2

Results - Runtime Graph Simulation Dual Simulation Strict Simulation uk-2002, |V|=1.8x107

Results – Network I/O Graph Simulation Dual Simulation Strict Simulation Synthesized Dataset, |V|=107,

Results – Network I/O Graph Simulation Dual Simulation Strict Simulation uk-2002, |V|=1.8x107

Planned Pattern Matching Implementations

Techniques for Graph Analytics on Big Data

Techniques for Graph Analytics on Big Data

Presentation Transcript

Big Data Analytics

Big Data + Data Analytics

Pentaho Analytics for Big Data

Big Data analytics

Big data analytics for DEVELOPMENT

Big Data Analytics

Big data analytics

Big Data Appliance for Graph Analytics

Graph Data Analytics

Big data analytics for Development

Data Analytics for Big Data

Big Data Analytics

Big Data analytics

analytics platform for big data

Big Data Analytics

Big Data Analytics

Golang For Big Data Analytics

Big Data Analytics

Data Analytics for Big Data

Big (graph) data analytics

Big (graph) data analytics

Big Data Analytics Solutions for Businesses | Big Data