320 likes | 418 Views
User-Guided Symbiotic Space-Sharing on SDSC’s DataStar System. Jonathan Weinberg University of California, San Diego San Diego Supercomputer Center. Professor Snavely, University of California. Resource Sharing on DataStar. L3. L3. L2. L2. P0. L1. P0. L1. I/O. L1. P1. L1. P1. L2.
E N D
User-Guided Symbiotic Space-Sharing on SDSC’s DataStar System Jonathan Weinberg University of California, San Diego San Diego Supercomputer Center Professor Snavely, University of California
Resource Sharing on DataStar L3 L3 L2 L2 P0 L1 P0 L1 I/O L1 P1 L1 P1 L2 L2 L1 P2 L1 P2 L1 MEM L1 P3 MEM P3 L2 L2 Others (e.g. WAN Bandwidth) L1 L1 P4 P4 L1 L1 P5 P5 I/O I/O L2 L2 P6 L1 P6 L1 L1 L1 P7 P7
DataStar Scheduling • Heavy use of shared resources among processes may lead to performance degradation. • Avoid inter-application resource sharing • Allocate by the node • Processors left idle for small jobs • Users encouraged to use fewer nodes
Symbiotic Space-Sharing • Symbiosis: from Biology meaning the graceful coexistence of organisms in close proximity • Space-Sharing: Multiple jobs use a machine at the same time, but do not share processors (vs time-sharing) • Symbiotic space-sharing: improve system throughput by executing applications in symbiotic combinations and configurations that alleviate pressure on shared resources MEM I/O
Can Symbiotic Space-Sharing Work? • To what extent and why do jobs interfere with themselves and each other? • If this interference exists, how effectively can it be reduced by alternative job mixes? • Are these alternative job mixes feasible for parallel codes and what is the net gain? • How can a job scheduler create symbiotic schedules?
Effects of Resource Sharing (Memory) • GUPS: Giga-Updates-Per-Second measures the time to perform a fixed number of updates to random locations in main memory.(main memory bandwidth) • STREAM: Performs a long series of short, regularly-strided accesses through memory (cache performance) • EP: Embarrassingly Parallel, one of the NAS Parallel Benchmarks is a compute-bound code.(CPU-bound)
Effects of Resource Sharing (Memory) • Slowdown while N independent instances of each benchmark run on a single node • 15-30% slowdown • EP has no significant slowing • Slowdown is non-linear
Effects of Resource Sharing (Memory) • L2 cache miss rates for GUPS • Fifth process = shared L2 = slowdown
Effects of Resource Sharing (Memory) • Repeat using the NAS Parallel Benchmarks • Generalizable Results: • 10-60% slowdown • Generally non-linear slowdown • Slowdown of first four is comparable
Effects of Resource Sharing (I/O) • I/O Bench: series of sequential, backward, and random read and write tests • Super-linear slowdown! • GPFS slowdown erratic – (affected by other applications on the system?)
Resource Sharing: Conclusions • To what extent and why do jobs interfere with themselves and each other? • 10-60% for memory • Super-linear for I/O • Tail-heavy slowdown for memory
Can Symbiotic Space-Sharing Work? • To what extent and why do jobs interfere with themselves and each other? • If this interference exists, how effectively can it be reduced by alternative job mixes? • Are these alternative job mixes feasible for parallel codes and what is the net gain? • How can a job scheduler create symbiotic schedules?
Alternate Job Mixes • For this set, utilizing unused processors to execute other programs has little to no affect • GUPS vs STREAM is the only exception • Both memory intensive • GUPS performs few memory operations: does not bother STREAM • STREAM makes heavy use of cache – increases GUPS L2, L3 miss rates by .2 each
Alternate Job Mixes • Using NAS Benchmarks, results are generalizable • EP and I/O Bench are symbiotic with all • Some symbiosis within the memory intensive codes • CG with IS, BT with others • MG causes most degradation to itself and others • Slowdown of self is among highest observed
Alternate Job Mixes: Conclusions • Proper job mixes can mitigate resource contention slowdown • Applications tend to slow themselves more heavily than others • Some symbiosis may exist even within one application category (e.g. memory-intensive)
Can Symbiotic Space-Sharing Work? • To what extent and why do jobs interfere with themselves and each other? • If this interference exists, how effectively can it be reduced by alternative job mixes? • Are these alternative job mixes feasible for parallel codes and what is the net gain? • How can a job scheduler create symbiotic schedules?
Scheduling Parallel Codes • Large parallel codes are generally condensed on fewer nodes to minimize inter-node communication. • The processes of parallel applications tend to perform similar operations. • Should we spread the application across more nodes?
Scheduling Parallel Codes • Use 16 processor runs of NAS Benchmarks • Add BTIO for parallel I/O • MPI IO FULL - The full MPI-2 I/O implementation uses collective buffering. • MPI IO SIMPLE - The simple MPI-2 I/O implementation does not use collective buffering • EP IO - Using Embarrassingly Parallel I/O, every processor writes its own file and files are not combined to create a single file.
Scheduling Parallel Codes • Every benchmarks shows significant speedup when spread • Why don’t users do this? They would be charged x2. • Can symbiotic space-sharing help?
Scheduling Parallel Codes • Choose some seemingly symbiotic combinations • Maintain speedup even with no idle processors • CG slows down when run with BTIO(S)…
Scheduling Parallel Codes: Conclusions • Spreading applications is beneficial (15% avg. speedup for NAS benchmarks) • Speedup can be maintained with symbiotic combinations while maintaining full utilization
Can Symbiotic Space-Sharing Work? • To what extent and why do jobs interfere with themselves and each other? • If this interference exists, how effectively can it be reduced by alternative job mixes? • Are these alternative job mixes feasible for parallel codes and what is the net gain? • How can a job scheduler create symbiotic schedules?
Towards a Scheduler: Identify Symbiosis • Ask the users to submit a guess for the limiting resource of their applications • Use hardware counters to find good combinations • Biased sampling (test combinations but bias towards known good ones) • Collect statistics on applications as they run alone – E.g. For NPB, memory operations per second correlates well with self-slowdown. Maybe slower applications have more to gain from symbiotic scheduling?
Towards a Scheduler: Prototype • Symbiotic Scheduler vs DataStar • 100 randomly selected 4p and 16p jobs from: {IOBench.4, EP.B.4, BT.B.4, MG.B.4, FT.B.4, DT.B.4, SP.B.4, LU.B.4, CG.B.4, IS.B.4, CG.C.16, IS.C.16, EP.C.16, BTIO FULL.C.16} • small jobs to large jobs: 4:3 • memory-intensive to compute and I/O: 2:1:1 • Expected runtimes were supplied to allow backfilling • Symbiotic scheduler used simplistic heuristic: only schedule memory apps with compute and I/0 • DataStar=5355s, Symbiotic=4451s, Speedup=1.2
Identifying Resource Bottlenecks • Ask the users? • Historical information? • Application profiling?
User Inputs: Why Ask Users? • Accountability • Changing runtime requires concent • Transparent cause of runtime differences • Familiarity • Submission flags from users are standard • Simplicity • Least “technical” solution • Start simple and add complexity when needed
Application Workload Applications deemed “of strategic importance to the United States federal government” by a $30M NSF procurement* • WRFWeather Research Forcasting System from the DoD’s HPCMP program • OOCOREOut Of Core solver froom the DoD’s HPCMP program • MILCMIMD Lattice Computation from the DoE’s National Energy Research Scientific Computing (NERSC) program • PARATECParallel Total Energy Code from NERSC • HOMMEHigh Order Methods Modeling Environment from the National Center for Atmospheric Research * High Performance Computing Systems Acquisition: Towards a Petascale Computing Environment for Science and Engineering
User Inputs • User inputs collected independently from five expert users • Users reported to have used MPI Trace, HPMCOUNT, etc • Are these inputs accurate enough to inform a scheduler?
User-Guided Symbiotic Schedules • The Table: • 64p runs using 32-way, p690 nodes • Speedups are vs 2 nodes • Predicted Slowdown |Predicted Speedup|No Prediction • All applications speed up when spread (even with communication bottlenecks) • Users identified non-symbiotic pairs • User speedup predictions were 94% accurate • Avg. speedup is 15% (Min=7%, Max=22%)
Conclusions • To what extent and why do jobs interfere with themselves and each other?10-60% for memory and 1000%+ for I/O (DataStar) • If this interference exists, how effectively can it be reduced by alternative job mixes?Almost completely given the right job • How can parallel codes leverage this and what is the net gain?Spread across more nodes. Normally up to 40% with our test set. • How can a job scheduler create symbiotic schedules? Ask users, use hardware counters, and do future work…
Future Work • Workload study: are there enough I/O and CPU intensive codes? • Automated symbiosis detection: What are the relationships between hardware counter statistics and expected symbiosis? • Heuristics: How should the scheduler actually operate? How will it affect fairness or other policy objectives? • Symbiotic grid schedulers: Can the larger job mix present more symbiotic opportunities?