Communication-Aware Processor Allocation for Supercomputers

Communication-Aware Processor Allocation for Supercomputers Michael Bender, SUNY Stony Brook David Bunde, University of Illinois Urbana Erik Demaine, MIT Sandor Fekete, Braunschweig University of Technology Vitus Leung, Sandia National Laboratories Henk Meijer, Queen’s University, Ontario Cynthia Phillips, Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

Computational Plant (Cplant) • Commodity-based supercomputers at Sandia National Laboratories (off-the-shelf components) • Up to 1500 processors • Production computing environment • Our Job: Improve parallel node allocation on Cplant to optimize performance.

The Cplant System • DEC alpha processors • Myrinet interconnect (Sandia modified) • MPI • Different sizes/topologies: usually 2D or 3D grid with toroidal wraps • Ross = ~1500 proc, 3D mesh • Zermatt = 128-proc 2D mesh • Alaska = ~600, heavily-augmented 2D mesh (cannibalized). • Modified Linux OS (now public domain) • Four processors/switch (compute, I/O, service nodes)

Scheduling Environment • Users submit jobs to queue (online) • Users specify number of processors and runtime estimate • If a job runs past this estimate by 5 min, it is killed • No preemption, no migration, no multitasking (security) • Actual runtime depends on set of processors allocated and placement of other jobs Goals: • User - minimum response time • Bureaucracy (GAO) - high utilization

Scheduler Allocator Scheduler/Allocator Association Scheduler and allocator effect each others’ performance. Performance dependencies

Scheduler/Allocator Dissociation Job: User Executable # processors Requested time • Scheduler enforces policy • Management sets priorities for access, utilization policy • Allocator can optimize performance Node Allocator PBS Scheduler Cplant . . . queue Job

What’s a Good Allocation? Objective: Allocate jobs to processors to minimize network contention  processor locality. • Especially important for commodity networks Good allocation For 2D mesh Bad allocation For 2D mesh

Quantitative Effect of Processor Locality But, speed-up anomaly = 2  faster than = empty processor

Communication Hops on a 2D grid • L1 distance = # hops (~ # switches) between 2 processors on grid 5 4

Allocation Problem • Given n available points on grid (some unavailable) • Find a set of k available points with minimum average (or total) L1 distance. • Example: green allocation: 3(2) + 3(1) = 9

Empirical Correlation Leung et al, 2002 Related support: Mache and Lo, 1996

Previous Work • Various Work forcing a convex set • Insufficient processor utilization • Mache, Lo, Windisch MC algorithm • Krume et al 2-approximation, NP-hard w/general metric • Complexity open for grids • Dispersion problem (max distance) linear time for fixed k (Fekete and Meijer)

Optimal Unconstrained Shape[Bender,Bender,Demaine,Fekete 2004] Almost a circle but not quite. Only .05 percent difference in area. 0.650 245 952 951

Our Results • 7/4-approximation (2 - in d dimensions) • PTAS ((1+)-approximation in time poly(n, ) • MC is a 4-approximation • Linear-time exact dynamic program 1D • O(n log n) time for k=3 • Simulations (performance on job streams)

An L1 Ball on a 2D Grid (0,1) y - x = 1 x + y = 1 (-1,0) (1,0) x + y = -1 x - y = 1 (0,-1)

Possible medians of selected set • A median will always share x coordinate with an available point and y coordinate with a (possibly different) available point.

Manhattan Median (MM) Algorithm • For each possible median p • Pick k free processors closest to p (in L1) • Compute total pairwise L1 distance Return set with the smallest total distance. • Krumke et al (1997) previously showed this is a 2-approximation in arbitrary metric spaces. • We proved it is a 7/4-approximation for L1. This is tight.

Lower Bound Instance (7/4)

Upper Bound Techniques • WLOG assume the origin is a median of OPT • Let M be the k points closest to the origin • Candidate point set for algorithm MM • Set returned by MM can only be better • Compare M to optimal • Assume M is the worst-case example

Upper Bound Techniques • Transform optimal and M to point placements that have the same performance ratio, but are easy to analyze • Transform in steps • Argue the ratio gets worse if we deviate from this form (impossible if M is the worst case) All points of Opt and M at these 5 points

Simulations: Performance on a Job Stream We’ve analyzed a greedy algorithm for placing a single job How well does it do for a stream of jobs? Consider two types of algorithms: • Situation algorithm: Places job stream prefix (system normal/default) • Decision algorithm: Places current job (can be a 1-time override)

Simulation Set up • Job stream from LLNL Cray T3D Trace • 21323 jobs, 256 processors Situation Algorithm Job stream Current Allocation 1-time decision Algorithm

Simulations: Alternative Placement Algorithm MC • Search in shell from minimum-size region of preferred shape. • Weight processors by shells • Return processor set with minimum weight.

Alternative: One-Dimensional Reduction rlrubin: illustrate algorithms unlikely to be efficiently solvable more motivation - why default is not good enough • Order processors so that close in linear order  close in physical processor graph • Consider one-dimensional processor allocation • Pack jobs onto the line (or ring), allowing fragmentation

Hilbert (Space-Filling) Curves • For 2D and 3D grids • Previous applications • I/O efficient and cache-oblivious computation • Compression (images) • Domain decomposition

Four Algorithms for Simulation • MM • MM + Incremental improvement • Hilbert curve with best fit • MC

Results • Ordering in a row consistent with proven approximation performance MM+Inc, MM, MC1x1, HilbertBF • Ordering on diagonal (normal operation): approximately opposite

Results • MM “paints into a corner on streams” • But good for single high-priority job • Thoughts: rectangles pack better than circles

New System Red Storm • 10,368 AMD Opteron 2Ghz • 31.2 TB Memory, 240 TB disk • 41.47 TF peak performance • 3D Mesh

Impact • Changed the node allocator on Cplant • 1D default allocator • Carried over to Red Storm system software • 1D algorithms current default • 2D algorithms implemented on Red Storm • Awaiting testing for use • R&D 100 submission (must win internal competition)

Questions • What’s the right allocation for a stream (online)? • Scheduling + Allocation • Simulation issues • Nondeterminism • Credit for good placement in timing

Communication-Aware Processor Allocation for Supercomputers

Communication-Aware Processor Allocation for Supercomputers

Presentation Transcript

Supercomputers 2

Performance-Driven Processor Allocation

Inter-Processor Communication (IPC)

Performance and Power Aware CMP Thread Allocation

Network Aware Resource Allocation in Distributed Clouds

CALLING-CONVENTION-AWARE GLOBAL REGISTER ALLOCATION

Context-aware communication

Supercomputers

Context-Aware Communication

Supercomputers

Enterprise Supercomputers

Dynamic Processor Allocation for Adaptively Parallel Jobs

QoS-Aware Resource Allocation for Slowly Time-Varying Channels

Topology-Aware Overlay Networks for Group Communication

Processor Co-Allocation in Multicluster Systems

Supercomputers

Supercomputers

Supercomputers 2

PROCESSOR ALLOCATION