360 likes | 521 Views
Galois Performance. Mario Mendez- Lojo Donald Nguyen. Overview. Galois system is a test bed to explore opts Safe but not fast out of the box Important optimizations Select least transactional overhead Select right scheduling Select appropriate data structure
E N D
Galois Performance Mario Mendez-Lojo Donald Nguyen
Overview • Galois system is a test bed to explore opts • Safe but not fast out of the box • Important optimizations • Select least transactional overhead • Select right scheduling • Select appropriate data structure • Quantify optimizations on applications
Algorithms general graph 1. Barnes-Hut topology grid 2. Delaunay Mesh Refinement tree 3. Preflow-push morph irregular algorithms local computation operator reader unordered ordering ordered
Methodology • Time Threads Serial • Idle GC Compute • Abort Ratio: Aborted It/Total it • GC options • UseParallelGC • UseParallelOldGC • NewRatio=1
Terms • Base • Default scheduling, Default graph • Serial • Galois classes => No concurrency control classes • Speedup • Best mean performance of a serial variant • Throughput • # Serial Iterations / time
Numbers • Runtime • Last of 5 runs in same VM • Ignore time to read and construct initial graph • Other statistics • Last of 5 runs
Test Environment • 2 x Xeon X5570 (4 core, 2.93 GHz) • Java 1.6.0_0-b11 • Linux 2.6.24-27 x86_64 • 20GB heap size
Barnes-hut Most Distant Galaxy Candidates in the Hubble Ultra Deep Field
Barnes-Hut • N-body algorithm • Oct-tree acceleration structure • Serial • Tree build, center of mass, particle update • Parallel • Force computation • Structure • Reader on tree • Variants • Splash2, Reader Galois
Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE);
Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X 100,000 points, 1 time step
Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X 100,000 points, 1 time step
Delaunay Mesh Refinement • Refine “bad” triangles • Maintained in worklist • Structure • Cautious operator on graph • Variants • Flag optimized, locallifo base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)
Cautious Optimization • No need to save undo info • Only check conflicts up to first write mesh.contains(item); ... mesh.remove(preNodes.get(i)); ... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT); ... mesh.remove(preNodes.get(i), MethodFlag.NONE); ... mesh.add(node, MethodFlag.NONE);
LIFO Optimization GaloisRuntime.foreach( ..., Priority.defaultOrder()); GaloisRuntime.foreach( ..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));
DMR Results Best serial: locallifo.flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4.5X 0.5M triangles, 0.25M bad triangles
Preflow-push • Max-flow algorithm • Nodes push flow downhill • Structure • Cautious, local computation • Variants • Flag optimized, local computation graph • base (discharge): • Priority.first(Bucketed.class, numHeight+1, false, indexer). • then(FIFO.class) • base (relabel): • Priority.first(ChunkedFIFO.class, 8)
Local Computation Optimization graph = ... • graph = ... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create()
Preflow-push Results C: 11450 ms Java: 30234 ms Best serial: lc.flagopt Serial time: 57121 ms Best // time: 18242 ms Best speedup: 3.1X From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges http://avglab.com/andrew/CATS/maxflow_synthetic.htm
What performance did we expect? • Time Measured Indirectly Threads Error //Compute Serial GC • Idle • Miss-Speculation • Synchronization, …
What performance did we expect? • Naïve: r(x) = t1 / x r(x) = tp / x + ts • Amdahl: t1 = tp + ts ts = tidle+ tgc+ tserial • Simple: r(x) = (tp(ix / i1)) / x + ts
Summary • Many profitable optimizations • Selecting among method flags, worklists, graph variants • Open topics • Automation • Static, dynamic and performance analysis • Efficient ordered algorithms