1 / 25

Graph Analysis With High-Performance Computing

Graph Analysis With High-Performance Computing. Jonathan Berry Scalable Algorithms Department Sandia National Laboratories April 18, 2008. Informatics Datasets Are Different. Informatics : The analysis of datasets arising from “information” sources such as the WWW (not physical simulation)

yetty
Download Presentation

Graph Analysis With High-Performance Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph Analysis With High-Performance Computing Jonathan Berry Scalable Algorithms Department Sandia National Laboratories April 18, 2008

  2. Informatics Datasets Are Different • Informatics: The analysis of datasets arising from “information” sources such as the WWW (not physical simulation) • Motivating Applications: • Homeland security • Computer security (DOE emphasis) • Biological networks, etc. “One of the interesting ramifications of the fact that the PageRank calculation converges rapidly is that the web is an expander-like graph” Page, Brin, Motwani,Winograd 1999 From UCSD ‘08 Broder, et al. ‘00 Primary HPC Implication: Any partitioning is “bad”

  3. Informatics Algorithms Are Different As Well “The single largest performance bottleneck in the distributed connected components algorithm is the effect of poor vertex distribution…Several methods…have been implemented but none has been successful as of yet.” D. Gregor, from Parallel Boost Graph Library documentation on connected components Connected Components:find groupings of vertices such that all vertices within a group can reach each other S-T Connectivity:find a short path from vertex S to vertex T Single-Source Shortest Paths (SSSP):from a given vertex, find the shortest paths to all other vertices “[in power law graphs] there is a giant component…of size O(n)” Aiello, Chung, Lu, 2000

  4. Informatics Problems Demand New Architectures Multithreaded architectures show promise for informatics problems, but more work is necessary…

  5. MTGL ADAPTER We Are Developing The MultiThreaded Graph Library • Enables multithreaded graph algorithms • Builds upon community standard (Boost Graph Library) • Abstracts data structures and other application specifics • Hide some shared memory issues • Preserves good multithreaded performance S-T connectivity scaling (MTA-2) SSSP scaling (MTA-2) Solve time (sec) Solve time (sec) MTGL C MTGL C MTA-2 Processors MTA-2 Processors

  6. “MTGL-ized” Code for S-T Connectivity • “u” is the next vertex out of the queue • “numEdges” is an index • “endVertex” is an array of endpoints • “adjVertices” is a generic MTGL function • “myIter” is a thread-local random-access iterator The compiler inlines away most of the apparent extra work, and (perhaps) data copying relieves a hot spot (we trap less)

  7. Initial Algorithmic Impacts of MTGL on XMT Are Promising • S-T Connectivity • Gathering evidence for 2005prediction • This plot show results for ≤ 2 billion edges • Even with Threadstorm 2.0 limitations, XMT predictions look good • Connected Components • Simple SV is fast, but hot-spots • Multilevel Kahan algorithm scales (but XMT data incomplete) • No current competitors for large power-law graphs 32000p Blue Gene\L MTGL/XMT 10p Time (s) MTGL/MTA 10p # Edges MTGLKahan’s algorithm MTGLShiloach-Vishkin algorithm Time (s) MGTL on XMT sets performance standard for informatics problems # XMT Processors

  8. PBGL SSSP Time (s) MTA SSSP # Processors A Recent Comparison With PBGL Finds Efficiency Gap • Parallel Boost Graph Library (PBGL) • Run Boost GL on clusters • Some graph algorithms can scale on some inputs • PBGL - MTA Comparison on SSSP • PBGL SSSP can scale on non-power law graphs • We compared to a pre-MTGL C code on the MTA-2 • 1 order of magnitude raw speed • 1 order of magnitude processor efficiency [Parallel Processing Letters ’07] Even when distributed memory approaches scale, massively multithreaded approaches are currently faster and more efficient.

  9. 1 2 3 4 Multithreading Means Thinking Differently (e.g. MatVec) Attempt #1 Hot spot 1/4 No hot spot; Compiler handled 11/6 1/3 7/12

  10. 1 2 3 4 Multithreading Case Study: MatVec Attempt #2 1/4 11/6 = 1/3 7/12 1/4 1 1/2 1/3 1/3 1/3 1/4

  11. MatVec Case Study MTA-2 Results • Simple example: compute the in-degree of each vertex • Dotted line: attempt 1 (hot spot) • Solid line: attempt 2 (more memory, but not hot) • Instance • 2^25 vertices • 2^28 directed edges • “R-MAT” graph

  12. The MTGL And Integration • Algorithms/architectures/visualization integration • Sandia architects profiled early MTGL to predict performance on XMT • Titan open-source visualization framework uses MTGL • Qthreads/MTGL

  13. Qthreads • Lightweight, scalable threading API • Efficient memory synchronization primitives • Explicit “loop parallelization” • Development platform for future multi-threaded architectures • POSIX Threads implementation • Sandia Structural Simulation Toolkit support, for emerging architecture analysis • MTA port planned • Develop code today, efficient execution tomorrow

  14. MTGL Code Example template <typename T> T mt_incr(T& target, T inc) { #ifdef __MTA__ Tres = int_fetch_add(&(target), inc); #elif USING_QTHREADS T res = qthread_incr((aligned_t*)&target, inc) - inc; #else T res = target; target += inc; #endif return res; } T mt_readff(T& toread) { #ifdef __MTA__ returnreadff(&(toread)); #elif USING_QTHREADS T ret; qthread_readFF(qthread_self(), &ret, &toread); return ret; #else return toread; #endif }

  15. Qthreads / MTGL Scalability

  16. Community Detection in the MTGL • Community detection in large networks is an active field • We have an algorithm that iteratively groups vertices with representatives • A subroutine of our algorithm is Unconstrained Facility Location • We’ve ported an open-source code for a Lagrangian-based subgradient algorithm (“Volume”) to the MTGL • In serial, it’s as fast as the current fastest algorithm for community detection – and we’re working on MTA-2 scaling

  17. Unconstrained Facility Location • Servers S, Customers C • Cost to open server i is f(i) • Cost for server i to serve customer j is cij • Indicator variables: facility opening: yi, service xij

  18. Lagrangian Relaxation for Facility Location Problems • What is the biggest challenge in solving this formulation well?: “Every customer must be served.”

  19. Lagrangian Relaxation for UFL • Solution strategy: lift those tough constraints out of the constraint matrix into the objective. (e.g., Avella, Sassano, Vasil’ev, 2003 for p-median) Good news: remaining problem easy to solve! Bad news: some of the original constraints might not be satisfied.

  20. Lifting the Service Constraints An example violated constraint (customer j not fully served): For jth service constraint (violation), j is a Lagrange multiplier: Multiplier j weights cost of violating jth service constraint: New objective:

  21. Solving the Relaxed Problem New problem: Set yi = 1 for locations with negative values of (i). Set xij = 1 if yi=1 and cij-j < 0, xij=0 otherwise. • Gives valid lower bound on the best UFL cost • Linear space and time.

  22. MTA/XMT Programming: Use the Compiler • Here, we sum the reduced costs of the neighbors of one vertex • The removal of the reduction of “sum” prevents a “hot spot” • This output is from “canal,” an MTA/XMT compiler analysis tool

  23. An Early Attempt to Scale on the MTA-2 • Instance: 262K vertices, 2M edges • Pseudo inverse power law degree distribution • 1 node of degree 262K, 64 of degree ~4000, 256K of degree ~5 • Preliminary MTA-2 implementation w/o profiling/tuning • Linux serial time: 573s In serial, runtimes are comparable to “CNM,” the O(n log^2 n) community finder from Clauset, Newman, Moore – but good MTA scaling is expected

  24. Current MTGL Algorithms • Connected components (psearch, visit_edges, visit_adj) • Strongly-connected components (psearch) • Maximal independent set (visit_edges) • Typed subgraph isomorphism (psearch, visit_edges) • S-t connectivity (bfs) • Single-source shortest paths (psearch) • Betweenness centrality (bfs-like) • Community detection (all kernels) • Connection subgraphs (bfs, sparse matrix, mt-quicksort) • Find triangles (psearch) • Find assortativity (psearch) • Find modularity (psearch) • PageRank (matvec) • Network Simplex for MaxFlow Under development: • Motif detection • more Berkeley Open-Source Licence pending

  25. Acknowledgements MultiThreading Background Simon Kahan (Google (formerly Cray)) Petr Konecny (Google (formerly Cray)) MultiThreading/Distributed Memory Comparisons Bruce Hendrickson (1415) Douglas Gregor (Indiana U.) Andrew Lumsdaine (Indiana U.) MTGL Algorithm Design and Development Vitus Leung (1415) Kamesh Madduri (Georgia Tech.) William McLendon (1423) Cynthia Phillips (1412) MTGL Integration Brian Wylie (1424) Kyle Wheeler (Notre Dame) Brian Barrett (1422)

More Related