Parallel Prim’s Algorithm on dense graphs with a novel extension

Parallel Prim’s Algorithm on dense graphs with a novel extension Ekaterina Gonina and Laxmikant Kale Parallel Programming Lab, Department of Computer Science, UIUC

Problem Statement • Parallel implementation of Prim's algorithm on 100% dense graph • > 10,000 vertices • all vertices connected • Find minimum spanning tree • Goals: • Get speedup from running on multiple processors • Process large problems that don't fit on one processor

Graph Representation • Symmetric adjacency matrix • Distributed across processors • pseudo-random function to generate edge weights • Consistent on any number of processors

Baseline Parallel Implementation • Data is generated across processors • Vertex 0 is taken to be “in tree” set • Each processor finds closest vertex to the “in tree” set • MIN reduction • Broadcast to all processors of new vertex • Repeat until all vertices have been processed

Limitations of the Baseline Approach • There is some speedup on running the algorithm on small graphs • Communication time dominates computation resulting in no speedup for larger graphs

New Approach • The cost of reducing one number is about the same as reducing a small collection of numbers • New approach to increase efficiency: add multiple vertices in each iteration

“Checks” • Guarantee that adding each vertex in the K array will yield an MST Checks: 1. Check if any of the vertices in the array have a shorter edge to the current vertex in question 2. Check if any of the vertices in the array have a shorter edge to a vertex not in the tree than the current vertex’s edge. A B C D E F G

Enhanced Algorithm function primsMST(graph G){ numVerticesInTree = 0 add vertex 0 to VerticesInTree while(VerticesInTree<vertices in G){ // on each processor array[K] = findKclosestVertices(VerticesInTree) globalArray[K] = AllReduce(array[K]) // globalArray now has K globally closest vertices // to the tree K1 = determineValidVertices(globalArray) //checks // K1=number of vertices valid for this processor K2 = Allreduce(K1) // K2 = number of globally valid vertices add globalArray[1..K2] to VerticesInTree, numVerticesInTree += K2 } }

Vertices added/reduction vs. iteration number 50,000 vertices “ceilings out” at K=8 -> trying to add more vertices simply wastes work

Cray XT3 results • 100,000 vertices • 50,000 vertices

200,000 vertices run

Summary of Final Results

Additional Experiments • Do work during the reduction – update data structures as if all K vertices were added • Increased CPU utilization from 25% to 40%

Work during reduction - some improvement

MPI vs Charm

Summary • Parallel Prim’s algorithm allows solving large problems that do not fit on one processor • No speedup due to communication > computation • Significant improvement from adding multiple vertices per iteration • More interesting work can be done with overlapping communication with computation

Thank you! Questions?

Parallel Prim’s Algorithm on dense graphs with a novel extension

Parallel Prim’s Algorithm on dense graphs with a novel extension

Presentation Transcript

Principles of Parallel Algorithm Design

Graphs and Graph Theory in Computational Biology

Parallel Processing

Chapter 27 Graph Applications

Hungarian Algorithm

Cohesive devices… .

Principles of Parallel Algorithm Design

Introduction to Parallel Computing Intel Math Kernel Library

Final Presentation

Chapter 8: Graphs

C S

Graphs

Distributed Parallel Computing

Outline

Advanced Algorithm Design and Analysis

Graphs

Top-k and Skyline Computation

Combinatorial Mathematics

Graphs

Parallel and Distributed Algorithms Spring 2007

Parallel Computing Final Exam Review