Galois Performance

Galois Performance Mario Mendez-Lojo Donald Nguyen

Overview • Galois system is a test bed to explore opts • Safe but not fast out of the box • Important optimizations • Select least transactional overhead • Select right scheduling • Select appropriate data structure • Quantify optimizations on applications

Algorithms general graph 1. Barnes-Hut topology grid 2. Delaunay Mesh Refinement tree 3. Preflow-push morph irregular algorithms local computation operator reader unordered ordering ordered

Methodology • Time Threads Serial • Idle GC Compute • Abort Ratio: Aborted It/Total it • GC options • UseParallelGC • UseParallelOldGC • NewRatio=1

Terms • Base • Default scheduling, Default graph • Serial • Galois classes => No concurrency control classes • Speedup • Best mean performance of a serial variant • Throughput • # Serial Iterations / time

Numbers • Runtime • Last of 5 runs in same VM • Ignore time to read and construct initial graph • Other statistics • Last of 5 runs

Test Environment • 2 x Xeon X5570 (4 core, 2.93 GHz) • Java 1.6.0_0-b11 • Linux 2.6.24-27 x86_64 • 20GB heap size

Barnes-hut Most Distant Galaxy Candidates in the Hubble Ultra Deep Field

Barnes-Hut • N-body algorithm • Oct-tree acceleration structure • Serial • Tree build, center of mass, particle update • Parallel • Force computation • Structure • Reader on tree • Variants • Splash2, Reader Galois

Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE);

ParaMeter Profile

Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X 100,000 points, 1 time step

Barnes-Hut Scalability

Delaunay Mesh Refinement

Delaunay Mesh Refinement • Refine “bad” triangles • Maintained in worklist • Structure • Cautious operator on graph • Variants • Flag optimized, locallifo base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)

Cautious Optimization • No need to save undo info • Only check conflicts up to first write mesh.contains(item); ... mesh.remove(preNodes.get(i)); ... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT); ... mesh.remove(preNodes.get(i), MethodFlag.NONE); ... mesh.add(node, MethodFlag.NONE);

LIFO Optimization GaloisRuntime.foreach( ..., Priority.defaultOrder()); GaloisRuntime.foreach( ..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));

DMR Results Best serial: locallifo.flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4.5X 0.5M triangles, 0.25M bad triangles

Preflow-Push

Preflow-push • Max-flow algorithm • Nodes push flow downhill • Structure • Cautious, local computation • Variants • Flag optimized, local computation graph • base (discharge): • Priority.first(Bucketed.class, numHeight+1, false, indexer). • then(FIFO.class) • base (relabel): • Priority.first(ChunkedFIFO.class, 8)

Local Computation Optimization graph = ... • graph = ... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create()

Preflow-push Results C: 11450 ms Java: 30234 ms Best serial: lc.flagopt Serial time: 57121 ms Best // time: 18242 ms Best speedup: 3.1X From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges http://avglab.com/andrew/CATS/maxflow_synthetic.htm

Preflow-push Scalability

What performance did we expect? • Time Measured Indirectly Threads Error //Compute Serial GC • Idle • Miss-Speculation • Synchronization, …

What performance did we expect? • Naïve: r(x) = t1 / x r(x) = tp / x + ts • Amdahl: t1 = tp + ts ts = tidle+ tgc+ tserial • Simple: r(x) = (tp(ix / i1)) / x + ts

Barnes-Hut

Delaunay Mesh Refinement

Preflow-push

Summary • Many profitable optimizations • Selecting among method flags, worklists, graph variants • Open topics • Automation • Static, dynamic and performance analysis • Efficient ordered algorithms

Galois Performance