220 likes | 357 Views
Instant-access cycle stealing for parallel applications requiring interactive response. Paul Kelly (Imperial College) Susanna Pelagatti (University of Pisa) Mark Rossiter (ex-Imperial, now with Telcordia). Application scenario…. Workplace with fast LAN and many PCs
E N D
Instant-access cycle stealing for parallel applications requiring interactive response Paul Kelly (Imperial College) Susanna Pelagatti (University of Pisa) Mark Rossiter (ex-Imperial, now with Telcordia)
Application scenario… • Workplace with fast LAN and many PCs • Some users occasionally need high computing power to accelerate interactive tasks • Example:CAD • Interactive design of components/structures • Analyse structural properties • Simulate fluid flow • Compute high-resolution rendering • Most PCs are under-utilised most of the time • Can we use spare CPU cycles to improve responsiveness?
The challenge… • Cycle stealing the easy way… • Maintain a batch queue • Maximise throughput for multiple, long-running jobs • Wait til desktop users leave their desks • This paper is about doing it the hard way: • Using spare cycles to accelerate short, parallel tasks (5-60 seconds) • In order to reduce interactive response time • While desktop users are at their desks • This means: • No batch queue – execute immediately using resources instantaneously available • No time to migrate or checkpoint tasks • No time to ship data across wide-area network
A challenging environment… • For our experiments, we used a group of 32 Linux PCs in a very busy CS student lab • Graph shows hourly-average percentage utilisation (on a log scale) over a typical day • Although not 100% busy, machines in continuous use
Scenario • Host PCs service interactive desktop users • Requests to execute parallel guest jobs arrive intermittently • System allocates group of idle PCs to execute guest job • Objectives: • Minimise average response time for guest jobs • Keep interference suffered by hosts within reasonable limits • We show that this can really work, even in our extremely challenging environment • Next: characterise patterns of idleness • Then: design software to assign guest tasks • Then: evaluate alternative strategies by simulation
Earlier work • Batch queue, multiple long-running jobs • Parallel jobs • “60-workstation cluster can handle job arrival trace taken from a dedicated 32-node CM-5” • Wide-area networks • Our goal: Improve response time for individual tasks • Litzkow, Livny, Mutka, “Condor - a hunter of idle workstations”. ICDCS’88. • Atallah, Black, et al, “Models and algorithms for co-scheduling compute-intensive tasks on networks of workstations”. JPDC 1992. • Arpaci, Dusseau et al “The interaction of parallel and sequential workloads on a network of workstations”. SIGMETRICS’95 • Acharya, Edjlali, Saltz, “The utility of exploiting idle workstations for parallel computing”. SIGMETRICS’97 • Petrini, Feng, “Buffered coscheduling: a new methodology for multitasking parallel jobs on distributed systems”. IPDPS 2000. • United Devices, Seti@home, Entropia • Subholk, Lieu, Lowekamp, “Automatic node selection for high performance applications on networks”. PPoPP 1999.
Characterize patterns of idleness • Idle periods occur frequently • 90% of idle periods occur within 5s Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process
Characterize patterns of idleness • Idle periods occur frequently • 90% of idle periods occur within 5s • Idle periods don’t last long • Only 50% last more than 3.3s Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process
Distribution of idleness – 32 PCs in busy student lab • It’s very likely that we’ll have up to 15 idle machines at any time
Distribution of idleness – 32 PCs in busy student lab • It’s unlikely that the same 15 machines will stay idle for long • It’s very likely that we’ll have up to 15 idle machines at any time
So how much can we hope to get? • With our 32-PC cluster, an idle group of 5 processors has about a 50% chance of remaining idle for more than 5 seconds • This is our parallel computing resource!
The mpidled software • mpidled is a Linux daemon process which runs on every participating PC: • Monitors system utilisation, determines whether system is idle • Uses this and past measurements to predict short-term future utilization • mpidle is a client application which lists the participating PCs which are currently predicted to be idle • Produces list of machine names, for use as MPI machinefile
Zero administration by leadership election • Participating PCs are regularly unplugged and rebooted • Vital to minimize systems administration overheads… • Mpidled daemons autonomously elect “leader” to handle client requests (current implementation relies on LAN broadcast, confined to one subnet) • Mpidle usually responds in less than 0.15s
Load prediction • We use recent measurements of idleness to predict how idle each PC will be in the future • Good prediction leads to • shorter execution time for guest jobs • Less interference with host processes, ie the desktop user • We’re interested in short-running guest jobs – so we don’t consider migrating tasks if the prediction turns out wrong
How good is load prediction? • Previous studies (Dinda and O’Halloran, Wolski et al) have shown that taking the weighted mean of the last few samples works as well as anything For 10-second prediction Forecast length (seconds)
How well does it work? • Simulation, driven by traces from 32 machines gathered over one week, during busy working hours • Uses application’s speedup curve to predict execution time given number of processors available • Also uses trace load data to compute CPU share available on each processor • For this study, we simulated execution of a ray-tracing task • Sequential execution takes 42 seconds • Speedup is more-or-less linear with 50-60% efficiency • Requests to execute a guest task arrive with an exponential distribution, with mean inter-arrival time of 20 seconds
How well does it work - baseline • Disruption to desktop users is dramatically reduced compared to assigning work at random (but not zero) • Although many processors used, speedup is low • Quite often, a guest task is rejected because no processor is idle • Usually because earlier guest task is still running
Allocation policy matters… • The simplest policy is to allocate all available (idle) processors to each guest job • This leads to a bimodal distribution: a substantial proportion of guest jobs get little or no benefit
A better strategy – holdback • The problem: • If a second guest task arrives before the first has finished, very few processors are available to run it • Idea: “holdback” • Hold back a percentage r of processors in reserve • Each guest task is allocated (1-r) of the available (idle) processors
Holdback improves fairness • By holding back some resources at each allocation, guest tasks get a more predictable and consistent share • How much to hold back depends on rate of arrival of guest tasks Frequency (%) Group size Group size Group size
How much to hold back • Mean speedup maximised with right holdback • Parallel efficiency lower than would be on dedicated parallel system, due to interference • Larger group size doesn’t imply higher speedup • Details depend on speedup characteristics of guest application workload
Conclusions & Further work • Simple, effective tool, to be made freely available • Even extremely busy environments can host a substantial parallel workload • Short interactive jobs can be accelerated, if • Relatively small startup cost, data size • Parallel execution time lies within scope of load prediction – 10 seconds or so • Desktop users prepared to tolerate some interference • Plenty of scope for further study… • Memory contention • Adaptive holdback • Integrate with queuing to handle longer-running jobs • How to reduce startup delay?