Shuai Ma , Yang Cao, Jinpeng Huai , Tianyu Wo

Distributed Graph Pattern Matching Shuai Ma, Yang Cao, JinpengHuai, TianyuWo

Graphs are everywhere, and quite a few are huge graphs! Social Networks File systems Databases World Wide Web Graph searching is a key to social searching engines!

Graph Pattern Matching • Given two graphs G1 (pattern graph) and G2 (data graph), • decide whether G1 “matches” G2 (Boolean queries); • identify “subgraphs” of G2 that match G1 • Applications • Web mirror detection/ Web site classification • Complex object identification • Software plagiarism detection • Social network/biology analyses • … • Matching Semantics • Traditional: Subgraph Isomorphism • Emerging applications: Graph Simulation and its extensions, etc.. A variety of emerging real-life applications!

Distributed Graph Pattern Matching • Real-life graphs are typically way too large: • Yahoo! web graph: 14 billion nodes • Facebook: over 0.8 billion users • Real-life graphs are naturally distributed: • Google, Yahoo and Facebook have large-scale data centers It is NOT practical to handle large graphs on single machines Distributed graph processing is inevitable It is nature to study “distributed graph pattern matching”!

Distributed Graph Pattern Matching • Given pattern graph Q(Vq, Eq) and fragmented data graph F = (F1, … , Fk) of G(V, E) distributed over k sites, • the distributed graph pattern matching problem is to find the maximum matchin G for Q, via graph simulation. There exists a unique maximum match for graph simulation!

Graph Simulation • Given pattern graph Q(Vq, Eq) and data graph G(V, E), a binary relation R ⊆ Vq × V is said to be a match if • (1) for each (u, v) ∈ R, u and v have the same label; and • (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′) in E such that (u′, v′) ∈ R. • Graph G matches pattern Q via graph simulation, if there exists a total match relation M • for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M. • Intuitively, simulation preserves the labels and the child relationship of a graph pattern in its match. • Simulation was initially proposed for the analyses of programs; and simulation and its extensions were recently introduced for social networks. Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))!

Graph Simulation Set up a team to develop a new software product Graph simulation returns F3, F4 and F5; Subgraph isomorphism returns empty! Subgraph Isomorphism is too strict for emerging applications!

Properties of Graph Simulation Impacts of connected components (CCs) • Let pattern Q = {Q1, . . . , Qh} (hCCs). For any data graph G, • if Miis the maximum match in G for Qi, • then M1 ∪ … ∪ Mhis the maximum match in G for Q. • Let data graph G = {G1, . . . , Gh} (hCCs). For any pattern graph G, • if Mi is the maximum match in Gi for Q, • then M1 ∪ … ∪ Mhis the maximum match in G for Q. Even if data graph G is connected, R(G) might be highly disconnected, by removing useless nodes and edges from G. • Any binary relation R ⊆ Vq × V on pattern graph Q(Vq,Eq) and data graph G(V,E) that contains the maximum match M in G for Q. • IfMiis the maximum match in R(G)i for Q, • then M1 ∪ … ∪ Mh is exactly the maximum match in G for Q, • where R(G) consists of h CCs R(G)1, . . . , R(G)h).

Properties of Graph Simulation What can be computed locally? • The matchedsubgraph of Q1 and G1 is Gs = F3 ∪ F4 ∪ F5; • Removing any node or edge from Gs makes Q1 NOT match Gs. Graph simulation has poor data locality

Properties of Graph Simulation We turn to the data locality of single nodes • Checking whether data node v in G matches pattern node u in Q can be determined locallyiffsubgraphdesc(Q, u) is a DAG. • desc(Q1, SA) is the subgraphin • Q1 with nodes SA, SD and ST What we have learned from the static analysis? • Treat each connected component in Q and G separately; • Use the data locality to check whether a node in G can be determined locally.

Complexity Analysis of Distributed Algorithms Model of Computation: • A cluster of identical machines (with one acted as coordinator); • Each machine can directly send arbitrary number of messages to another one; • All machines co-work with each other by local computations and message-passing. Complexity measures: • 1. Visit times: the maximum visiting times of a machine (interactions) • 2. Makespan: the evaluation completion time (efficiency) • 3. Data shipment: the size of the total messages shipped among distinct • machines (network band consumption)

Complexity Analysis of Distributed Algorithms Specifications for the distributed algorithms: • For each machine Si (1 ≤ i ≤ k) , • Local information: ( 1) pattern graph Q; (2) subgraphGs,iof G; and (3) a marked binary relation Ri ⊆ Vq × V, where • each match (u; v) 2 Ri is marked as true, false or unknown; and • Rican be updated by either messages or local computations. • Message: only local information is allowed to be exchanged • Local computations: update Riby utilizing the semantics of graph simulation. • local algorithms execute only local computations without involving message-passing during the computation, • run in time of a polynomial of |Q| and |Gs,i|. Complexity bounds: 1. The optimal data shipment is |G| - 1, and it is tight. 2. The optimal visit times are 1, and it is tight. 3. The minimum makespan problem is NP-complete. Remarks: 1. Data shipment, visit times and makespan are controversial with each other. 2. A well-balanced strategy between makespan and the other two measures.

Distributed Evaluation of Graph Simulation • Stage 1: Coordinator SQ broadcasts Q to all k sites; • Stage 2: All sites, in parallel, partially evaluate Q on local fragments – partial match; • Stage 3: Ship those CCs across different machines to single machines, while minimizing data shipment and makespan; • Stage 4: Compute the maximum matches in those CCs originally across multiple machines in parallel; • Stage 5: Collect and assemble partial matches in the coordinator. Performance guarantees: 1. The total computational complexity is the same to the best-known centralized algorithm, while it invokes 4 rounds of message-passing and local evaluation only; 2. Total data shipment is bounded by |G| + 4|B| + |Q||G| + (k - 1) |Q|; 3. Each machine except coordinator SQ is visited with g + 2 times (g is the maximum number machines at which a CC resides in Stage2, and SQ is visited 2(k -1) times. Sacrifice data shipment and visit times for makspan!

Scheduling Data Shipment - Stage 3 The Scheduling Problem: Given h connected components, C1, … ,Ch, and an integer k, find an assignment of the connected component to k identical machines, so that both the makespan and the total data shipmentareminimized. Approximation Hardness (data shipment, makespan): The scheduling problem is not approximablewithin (ε, max(k − 1, 2)) for any ε > 1. Performance guarantees of algorithm dSchedule: Algorithm dSchedule produces an assignment of the scheduling problem such that the makespan is within a factor (2 − 1/k) of the optimal one. Remarks: A heuristic is used to minimize the data shipment, A greedy approach is adopted to guarantee the performance of the makesapn. The algorithm runs in O(kh), and is very efficient. Hence, its evaluation could not cause a bottleneck.

Optimization Techniques Using data locality Determine whether (u, v) belongs to the maximum match M in G for Q: Case 1: when there are no boundary nodes in desc(G, v) of fragmented graph Gj; Case 2: when there are boundary nodes in fragmented graph Gj, but subgraphdesc (Q, u) of Q is a DAG • (SA, SA2): Case 1 (BA, BA2): Case 2 • Minimization Minimizing pattern graphs (Q ≡ Qm) Given pattern graph Q, we compute a minimized equivalent pattern graph Qm such that for any data graph G, G matches QiffG matches Qm, via graph simulation.

Experimental Study Real life datasets: Google Web graph: 875,713 nodes and 5,105,039 edges Amazon product co-buy network: 548,552 nodes and 1,788,725 edges Synthetic graph generator: (108 nodes and 3,981,071,706 edges) Three parameters: 1. The number n of nodes; 2. The number nαof edges; and 3. The number l of node labels Algorithms: AlgorithmdisHHK and its optimized version disHHK+ Optimal algorithms naiveMatchds(data shipment) and naiveMatchvt(visit times) Machines: The experiments were run on a cluster of 16 machines, all with 2 Intel Xeon E5620 CPUs and 64GB memory

Experimental Study 1. All algorithms scale well exceptnaiveMatchds and naiveMatchvt 2.disHHK+ consistently reduces about [1/5, 1/4] running time of disHHK

Experimental Study 1. All algorithms ship about 1/10000 of the data graphs 2.disHHK+ and disHHKeven ship less data than naiveMatchdswhen data graphs are large and sparse

Experimental Study disHHK+ and disHHKhave [30%, 53%] more visit times than naiveMatchds, as expected

Conclusion We have formulated and investigated the distributed graph pattern matching problem, via graph simulation. • We have given a static analysis of graph simulation • Utility of connected components • Study of data locality • We have studied the complexity of a large class of distributed algorithms for graph simulation. • A message-passing computation model • Makespan, data shipment, and visit times (controversial with each other) • We have proposed a distributed algorithm for graph simulation • The scheduling problem • Optimization techniques • Experimental verification A first step towards the big picture of distributed graph pattern matching

Shuai Ma , Yang Cao, Jinpeng Huai , Tianyu Wo

Shuai Ma , Yang Cao, Jinpeng Huai , Tianyu Wo

Presentation Transcript

Shuai Ma , Yang Cao, Wenfei Fan, Jinpeng Huai , Tianyu Wo

Josh Waters Ty Fenn Tianyu Chen

Shuai Ma

Shuai Ma

Shuai Ma

Nguyen Cao Dat, MA Cuu Long Univeristy

P. Huai, Feb. 18, 2005

Lei Cao; Zhen Tan; Liqun Li; Shengtian Yang; Yinghong Chen

Cao

CAO

Mingyi Hung T.J. Wong Tianyu Zhang

CAO