360 likes | 374 Views
Explore the innovative Tri-Fly algorithm for distributed estimation of global and local triangle counts in graph streams. This powerful algorithm addresses the challenges of real-world dynamic graphs, leveraging distributed and streaming approaches for efficient and accurate results.
E N D
Tri-Fly Distributed Estimation of Global and Local Triangle Counts in Graph Streams Kijung Shin1 Mohammad Hammoud1 Euiwoong Lee1 Jinoh Oh2 Christos Faloutsos1 1 Carnegie Mellon University 2 Adobe Systems
Introduction Problem Algorithm Analysis Experiments Conclusion Triangles in a Graph Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) • Graphs are everywhere! • social networks, the web, citation networks • Triangles are a fundamental primitive • 3 nodes connected to each other • Counting triangles has many applications • community detection, anomaly detection, query optimization
Introduction Problem Algorithm Analysis Experiments Conclusion Application: Anomaly Detection [LJK18] [KMF11] # Incident Triangles # Incident Triangles Telemarketer Degree Degree Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Remaining Challenges online social networks Web Citation networks Call networks • Counting triangles in real-world graphs, such as • Real-world graphs are • Large: not fitting in main memory • Dynamic: growing with new nodes and edges Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Previous Approaches • Distributed algorithms [SS11] [PC13] [PPK18] • pros: utilize multiple machines • cons:inapplicable to dynamic graphs • Streaming algorithms [DERU16] [Shi17] [LJK18] • pros: applicable to dynamic graphs • cons:limited to a single machine Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Our Approach and Goal Fast and Accurate: outperforming competitors Scalable: with linear data scalability Theoretically Sound: with unbiased estimates • Can we have the best of both worlds? • for dynamic graphs • on multiple machines • We design a distributed streaming algorithm Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses • Experiments • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Problem Definition • Given: graph stream • a sequence of new edges in a dynamic graph • Estimate:counts of global and local triangles • Using:multiple machines with limited memory • up to edges can be stored in each machine • to Minimize: estimation error Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Problem Definition (cont.) 3 2 3 • Global triangles: all triangles in the graph • Local triangles: the triangles incident to each node 1 2 1 4 3 2 1 Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) • Given: graph stream • a sequence of new edges in a dynamic graph • Estimate:counts of global and local triangles • Using:multiple machines with limited memory • up to edges can be stored in each machine • to Minimize: estimation error
Road Map • Problem Definition • Algorithm: Tri-Fly << • Theoretical Analyses • Experiments • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
discover triangles with limited memory aggregate estimates Introduction Problem Algorithm Analysis Experiments Conclusion Overview of Tri-Fly Inputs: new edges streamed from source(s) source(s) aggregator(s) master(s) worker(s) Outputs: estimated counts of global and local triangles Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) Processes each new edge when it arrives Updates estimated counts for each edge
discover triangles with limited memory aggregate estimates Introduction Problem Algorithm Analysis Experiments Conclusion Overview of Tri-Fly (cont.) new edge unicast broadcast shuffle counts by (node) ( ) ( ) ( )=( ) aggregator(s) master(s) worker(s) source(s) aggregate counts & update outputs count new triangles using local memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Challenge: Limited Memory ( ) ( ) ( )=( ) aggregator(s) master(s) worker(s) source(s) count new triangles using local memory aggregate counts & update outputs How should we ‘count’ and ‘aggregate’ for accurate estimation when each machine has limited memory? Our solution adapts Triest-IMPR [DERU16] Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Details Introduction Problem Algorithm Analysis Experiments Conclusion Workers in Detail (c) Sampling (a) Edge arrival (b) Discovering new edge memory Runs three steps for each received edge Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Details Introduction Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont.) (a) Edge arrival new edge memory Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) • (a) Edge arrival step • receives a new edge
Details Introduction Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont.) (a) Edge arrival (b) Discovering discovered !! new edge send to aggregator to aggregator to aggregator to aggregator memory • (b) Discovering step • discovers new triangles in its local memory • sends updates to the aggregators • 1 / discovering prob. of the triangle Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Details Introduction Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont.) (a) Edge arrival (b) Discovering discovered !! new edge send to aggregator to aggregator to aggregator to aggregator memory • (b) Discovering step • discovers new triangles in its local memory • sends updates to the aggregators • 1 / discovering prob. of the triangle Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Details Introduction Problem Algorithm Analysis Experiments Conclusion Workers in Detail (cont.) (c) Sampling (a) Edge arrival (b) Discovering new edge memory • (c) Sampling step • stores or discards the new edge • follows the standard reservoir sampling Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Details Introduction Problem Algorithm Analysis Experiments Conclusion Aggregators in Detail Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) • Maintain estimates • in for the global triangle count • in for the local triangle count of node • Update estimates • for each update , increase by • for each update , increase by
discover triangles with limited memory aggregate estimates Introduction Problem Algorithm Analysis Experiments Conclusion Summary of Tri-Fly new edge unicast broadcast shuffle counts by (node) ( ) ( ) ( )=( ) aggregator(s) master(s) worker(s) source(s) count new triangles in its local memory aggregate counts & update outputs Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses << • Experiments • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion THM1: Unbiasedness • Tri-Fly maintains estimates satisfying the following: True Count Frequency Estimates Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) For each node ,
Introduction Problem Algorithm Analysis Experiments Conclusion THM2: Linear Drop of Variance • Tri-Fly maintains estimates satisfying the following: log(Variance) log(#Workers) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) For each node ,
Introduction Problem Algorithm Analysis Experiments Conclusion THM3: Linear Scalability Tri-Fly Running Time # Edges With a fixed per-worker memory budget , Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Properties of Tri-Fly Fast and accurate: outperforming competitors Scalable: with linear data scalability (THM 3) Theoretically sound: with unbiased estimates (THM 1) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses • Experiments << • Conclusion Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Experimental Settings ER Synthetic (100B) Social (1.8B+) Social (22M+) Patent citation (16M+) Web (6M+) • Competitors: MASCOT [LJK18] & Triest-IMPR [DERU16] • state-of-the-art single-machine streaming algorithms • for both global and local triangle counts • Implementations: • C++ & MPICH (asynchronous communication) • 1 master & 1 aggregator & up to 40 workers • Datasets: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion EXP1. Bias Analysis “Does Tri-Fly give unbiasedestimates?” (THM 1) True Count Tri-Fly (10 workers) Tri-Fly (5 workers) Tri-Fly (1 worker) - of edges - Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion EXP2. Variance Analysis “How rapidly does the variancedecrease w.r.t. the number of workers?” (THM 2) MASCOT Triest-IMPR Tri-Fly Slope= - of edges - Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin) 29/36
Introduction Problem Algorithm Analysis Experiments Conclusion EXP3. Speed and Accuracy “Does Tri-Fly outperform single-machine baselines?” Tri-Fly workers, of edges, Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion EXP3. Speed and Accuracy “Does Tri-Fly outperform single-machine baselines?” Tri-Fly Root Mean Square Error workers, of edges, Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion EXP4. Scalability “Does Tri-Fly scale linearly with the size of the input stream?” (THM 3) Tri-Fly Linear Increase (slope) 100B edges (800GB) ER workers, , Dataset: Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Properties of Tri-Fly Fast and accurate: outperforming competitors (EXP 3) Scalable: with linear data scalability (EXP 4) Theoretically sound: with unbiased estimates (EXP 1) Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Road Map • Problem Definition • Algorithm: Tri-Fly • Theoretical Analyses • Experiments • Conclusion << Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion Download Conclusion Tri-Fly • Fast & Accurate • Scalable • Theoretically Sound • We proposeTri-Fly • the firstdistributed streaming algorithm • for counting global and local triangles • Code and datasets: • https://github.com/kijungs/trifly Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)
Introduction Problem Algorithm Analysis Experiments Conclusion References [SV11]SiddharthSuri, Sergei Vassilvitskii, “Counting triangles and the curse of the last reducer” WWW 2011 [KMF11] U Kang, Brendan Meeder, Christos Faloutsos, “Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation” PADD 2011 [PC13] Ha-Myung Park, Chin-Wan Chung, “An Efficient MapReduce Algorithm for Counting Triangles in a Very Large graph”, CIKM 2013 [DERU16] Lorenzo De Stefani et al., “TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size.” KDD 2016 [Shi17] Kijung Shin, “WRS: Waiting Room Sampling for Accurate Triangle Counting in Real Graph Streams”, ICDM 2017 [LJK18]Yongsub Lim, Minsoo Jung, U Kang, “Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams: From Simple to Multigraphs”, TKDD 2018 [PPK18] Ha-Myung Park, Chiwan Park, U Kang, “PegasusN: A Scalable and Versatile Graph Mining System”, AAA 18 Distributed Estimation of Global and Local Triangle Counts in Graph Streams (by Kijung Shin)