250 likes | 370 Views
Graph Analysis With High-Performance Computing. Jonathan Berry Scalable Algorithms Department Sandia National Laboratories April 18, 2008. Informatics Datasets Are Different. Informatics : The analysis of datasets arising from “information” sources such as the WWW (not physical simulation)
E N D
Graph Analysis With High-Performance Computing Jonathan Berry Scalable Algorithms Department Sandia National Laboratories April 18, 2008
Informatics Datasets Are Different • Informatics: The analysis of datasets arising from “information” sources such as the WWW (not physical simulation) • Motivating Applications: • Homeland security • Computer security (DOE emphasis) • Biological networks, etc. “One of the interesting ramifications of the fact that the PageRank calculation converges rapidly is that the web is an expander-like graph” Page, Brin, Motwani,Winograd 1999 From UCSD ‘08 Broder, et al. ‘00 Primary HPC Implication: Any partitioning is “bad”
Informatics Algorithms Are Different As Well “The single largest performance bottleneck in the distributed connected components algorithm is the effect of poor vertex distribution…Several methods…have been implemented but none has been successful as of yet.” D. Gregor, from Parallel Boost Graph Library documentation on connected components Connected Components:find groupings of vertices such that all vertices within a group can reach each other S-T Connectivity:find a short path from vertex S to vertex T Single-Source Shortest Paths (SSSP):from a given vertex, find the shortest paths to all other vertices “[in power law graphs] there is a giant component…of size O(n)” Aiello, Chung, Lu, 2000
Informatics Problems Demand New Architectures Multithreaded architectures show promise for informatics problems, but more work is necessary…
MTGL ADAPTER We Are Developing The MultiThreaded Graph Library • Enables multithreaded graph algorithms • Builds upon community standard (Boost Graph Library) • Abstracts data structures and other application specifics • Hide some shared memory issues • Preserves good multithreaded performance S-T connectivity scaling (MTA-2) SSSP scaling (MTA-2) Solve time (sec) Solve time (sec) MTGL C MTGL C MTA-2 Processors MTA-2 Processors
“MTGL-ized” Code for S-T Connectivity • “u” is the next vertex out of the queue • “numEdges” is an index • “endVertex” is an array of endpoints • “adjVertices” is a generic MTGL function • “myIter” is a thread-local random-access iterator The compiler inlines away most of the apparent extra work, and (perhaps) data copying relieves a hot spot (we trap less)
Initial Algorithmic Impacts of MTGL on XMT Are Promising • S-T Connectivity • Gathering evidence for 2005prediction • This plot show results for ≤ 2 billion edges • Even with Threadstorm 2.0 limitations, XMT predictions look good • Connected Components • Simple SV is fast, but hot-spots • Multilevel Kahan algorithm scales (but XMT data incomplete) • No current competitors for large power-law graphs 32000p Blue Gene\L MTGL/XMT 10p Time (s) MTGL/MTA 10p # Edges MTGLKahan’s algorithm MTGLShiloach-Vishkin algorithm Time (s) MGTL on XMT sets performance standard for informatics problems # XMT Processors
PBGL SSSP Time (s) MTA SSSP # Processors A Recent Comparison With PBGL Finds Efficiency Gap • Parallel Boost Graph Library (PBGL) • Run Boost GL on clusters • Some graph algorithms can scale on some inputs • PBGL - MTA Comparison on SSSP • PBGL SSSP can scale on non-power law graphs • We compared to a pre-MTGL C code on the MTA-2 • 1 order of magnitude raw speed • 1 order of magnitude processor efficiency [Parallel Processing Letters ’07] Even when distributed memory approaches scale, massively multithreaded approaches are currently faster and more efficient.
1 2 3 4 Multithreading Means Thinking Differently (e.g. MatVec) Attempt #1 Hot spot 1/4 No hot spot; Compiler handled 11/6 1/3 7/12
1 2 3 4 Multithreading Case Study: MatVec Attempt #2 1/4 11/6 = 1/3 7/12 1/4 1 1/2 1/3 1/3 1/3 1/4
MatVec Case Study MTA-2 Results • Simple example: compute the in-degree of each vertex • Dotted line: attempt 1 (hot spot) • Solid line: attempt 2 (more memory, but not hot) • Instance • 2^25 vertices • 2^28 directed edges • “R-MAT” graph
The MTGL And Integration • Algorithms/architectures/visualization integration • Sandia architects profiled early MTGL to predict performance on XMT • Titan open-source visualization framework uses MTGL • Qthreads/MTGL
Qthreads • Lightweight, scalable threading API • Efficient memory synchronization primitives • Explicit “loop parallelization” • Development platform for future multi-threaded architectures • POSIX Threads implementation • Sandia Structural Simulation Toolkit support, for emerging architecture analysis • MTA port planned • Develop code today, efficient execution tomorrow
MTGL Code Example template <typename T> T mt_incr(T& target, T inc) { #ifdef __MTA__ Tres = int_fetch_add(&(target), inc); #elif USING_QTHREADS T res = qthread_incr((aligned_t*)&target, inc) - inc; #else T res = target; target += inc; #endif return res; } T mt_readff(T& toread) { #ifdef __MTA__ returnreadff(&(toread)); #elif USING_QTHREADS T ret; qthread_readFF(qthread_self(), &ret, &toread); return ret; #else return toread; #endif }
Community Detection in the MTGL • Community detection in large networks is an active field • We have an algorithm that iteratively groups vertices with representatives • A subroutine of our algorithm is Unconstrained Facility Location • We’ve ported an open-source code for a Lagrangian-based subgradient algorithm (“Volume”) to the MTGL • In serial, it’s as fast as the current fastest algorithm for community detection – and we’re working on MTA-2 scaling
Unconstrained Facility Location • Servers S, Customers C • Cost to open server i is f(i) • Cost for server i to serve customer j is cij • Indicator variables: facility opening: yi, service xij
Lagrangian Relaxation for Facility Location Problems • What is the biggest challenge in solving this formulation well?: “Every customer must be served.”
Lagrangian Relaxation for UFL • Solution strategy: lift those tough constraints out of the constraint matrix into the objective. (e.g., Avella, Sassano, Vasil’ev, 2003 for p-median) Good news: remaining problem easy to solve! Bad news: some of the original constraints might not be satisfied.
Lifting the Service Constraints An example violated constraint (customer j not fully served): For jth service constraint (violation), j is a Lagrange multiplier: Multiplier j weights cost of violating jth service constraint: New objective:
Solving the Relaxed Problem New problem: Set yi = 1 for locations with negative values of (i). Set xij = 1 if yi=1 and cij-j < 0, xij=0 otherwise. • Gives valid lower bound on the best UFL cost • Linear space and time.
MTA/XMT Programming: Use the Compiler • Here, we sum the reduced costs of the neighbors of one vertex • The removal of the reduction of “sum” prevents a “hot spot” • This output is from “canal,” an MTA/XMT compiler analysis tool
An Early Attempt to Scale on the MTA-2 • Instance: 262K vertices, 2M edges • Pseudo inverse power law degree distribution • 1 node of degree 262K, 64 of degree ~4000, 256K of degree ~5 • Preliminary MTA-2 implementation w/o profiling/tuning • Linux serial time: 573s In serial, runtimes are comparable to “CNM,” the O(n log^2 n) community finder from Clauset, Newman, Moore – but good MTA scaling is expected
Current MTGL Algorithms • Connected components (psearch, visit_edges, visit_adj) • Strongly-connected components (psearch) • Maximal independent set (visit_edges) • Typed subgraph isomorphism (psearch, visit_edges) • S-t connectivity (bfs) • Single-source shortest paths (psearch) • Betweenness centrality (bfs-like) • Community detection (all kernels) • Connection subgraphs (bfs, sparse matrix, mt-quicksort) • Find triangles (psearch) • Find assortativity (psearch) • Find modularity (psearch) • PageRank (matvec) • Network Simplex for MaxFlow Under development: • Motif detection • more Berkeley Open-Source Licence pending
Acknowledgements MultiThreading Background Simon Kahan (Google (formerly Cray)) Petr Konecny (Google (formerly Cray)) MultiThreading/Distributed Memory Comparisons Bruce Hendrickson (1415) Douglas Gregor (Indiana U.) Andrew Lumsdaine (Indiana U.) MTGL Algorithm Design and Development Vitus Leung (1415) Kamesh Madduri (Georgia Tech.) William McLendon (1423) Cynthia Phillips (1412) MTGL Integration Brian Wylie (1424) Kyle Wheeler (Notre Dame) Brian Barrett (1422)