1 / 38

System Noise & OS Clock Ticks

System Noise & OS Clock Ticks. Dan Tsafrir Yoav Etsion Dror G. Feitelson Scott Kirkpatrick http://www.cs.technion.ac.il/~dan/papers/Noise05ICS.pdf. Context cont. Many HPC apps. are bulk-synchronous:

nhu
Download Presentation

System Noise & OS Clock Ticks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System Noise & OS Clock Ticks Dan Tsafrir Yoav Etsion Dror G. Feitelson Scott Kirkpatrick http://www.cs.technion.ac.il/~dan/papers/Noise05ICS.pdf

  2. Context cont. • Many HPC apps. are bulk-synchronous: • Iterative computation-phases, separated by barriers • Typically run on a dedicated partition • Problem: poor scalability if granularity is fine • Grain size can be ≤ 1ms while(…) { compute; barrier; }

  3. Data from LLNL’s ASCI-White 4.5 4.0 3.5 3.0 o observed performance 2.5 avg. phase duration [ms] 2.0 1.5 1.0 From Jones et al, “Improving the scalability of parallel jobs by adding parallel awareness to the OS”, SC’03 (with permission of authors) – expected performance 0.5 0 0 1000 2000 3000 4000 5000 6000 CPUs

  4. Data from LANL’s ASCI-Q 5 ms granularity compute-phase latency [ms] From Petrini et al, “The case of the missing supercomputer performance”, SC’03 (with permission of authors) 1 ms granularity nodes (quads)

  5. The reason: system noise • Noise = sporadic per-node system activity • Unrelated to app. & but deprives CPU • One late task holds up thousands of peers to which is synchronizes • More nodes => increased noise probability

  6. FAST OS Issues Metrics Usage Models SSI Virtualization Adaptability Fault Handling Common API Collective RT I/O OS Noise

  7. Sources of noise • Nodes typically run a general-purpose OS (see Top500) in which there’s … • Process noise: • native OS dæmons (kswapd, ntpd, …) • Cluster dæmons • Non-process noise (interrupts): • Network interrupts • Periodic clock interrupts = Ticks • …

  8. OS periodic clock interrupts = Ticks • Every few ms the OS wakes up • CPU accounting, preemption, pending signals • Ticks always happen • Even if the system is otherwise idle • Ticks are widespread • Used by ALL mainstream GPOSs (Windows*, UNIX*) • Tick-resolution: until recently, 100 Hz • Last 2-3 years: a shift to 1000 Hz (Linux, FreeBSD, DragonflyBSD), 250 Hz

  9. What’s next • Explain linearity • Quantify tick-noise impact, as affected by • Platforms • Grain sizes • Tick frequencies • Operating systems • Establish a general case against ticks • Suggest how to deal with ticks

  10. Why is perf. degradation linear?The “noise law” • n = number of nodes in the cluster • p = prob. a task is delayed: • on a single node • within a compute-phase • Job’s per-phase no-delay prob: (1-p)n • Job’s per-phase delay prob: dp(n)=1-(1-p)n • Claim if: p ≤ 1/10nthen: dp(n) ≈ pn

  11. A desirable value for p p=10-2 • n ≤ 10,000 : • Need p ≤ 10-5 For dp(n) ≤ 10% • Need p ≤ 10-6For dp(n) ≈ 0 • n > 10,000 : (BlueGene/L) • Need p ≤ 10-6 p=10-3 p=10-4 p=10-5 p=10-6

  12. Measuring the noise • Micro benchmark • Pentium-IV, 2.8GHz, Linux-2.6.9 • Default settings (1000 Hz, daemons, …) • Analyze the time[] distribution for( i = 1 … 106 ) : start = cycle_counter for(…) /* one ms work */ ; end = cycle_counter time[i] = end – start

  13. Impact of interrupt-noise n=500, p=1/100, Dp(n)=99%

  14. Reason: indirect overheads (cache) vs. a total of 8%

  15. Granularity & platform

  16. Platform & tick-frequency millions thousands

  17. Cache misses: user vs. kernel

  18. Different operating systems

  19. Observation “Periodic ticks are a stupid idea, that’s why K42 doesn’t use them.” ~ Orran Krieger, Jan 2006

  20. The alternative: One-shot timers • Eliminate periodic ticks • Set timer only for the next (earliest) event • Examples: • ~K42, Pebble, Pegasus project, ~KURT, ~Firm Timers • Common in realtime context

  21. So why do GPOSs use ticks? • There are a few reasons…

  22. Why use ticks?Historical Reasons • [Dijkstra’1968]: First known reference “The Structure of the THE Multiprogramming System” In CACM • Specifies two reasons: • Only way to implement multitasking (back then...) • Allows for “exhaustive testing” “…[Had we made] the central processor react directly upon any weird time successionof interrupts, the number of relevant states would have exploded to such a height that exhaustive testing would have been an illusion.” • Older HW: Expensive timer reprogramming • E.g., for Intel, until Pentium-II (1997; APIC since)

  23. Why use ticks?Inertia • This is the way it has always been done… • 40 years old design decision • All mainstream GPOSs are affected • Legacy • Kernel subsystems rely on it • User space too  (e.g. top) • Simplicity • Polling is simpler than being event-driven • It works! • If it ain’t broken…

  24. Why use ticks?Inertia – cont. • For a GPOS developer, question is: “why NOT??” (HPC noise is an insufficient incentive)

  25. Why use ticks?Objective Reasons • Bounds Overhead • Handling clock interrupts • Context switch: to kernel & back • Ticks aggregate nearby events • Robust • With one-shot timers • Any user can bring the system down: for(int ns=1/*nanosec*/; 1; ns++)setitimer( shortly + ns );

  26. To truly motivate a change in GPOSs should address…

  27. Drawback #1: HPC noise • Dramatic performance loss • Noise effect amplified by “noise law” • Fine grain noise (mostly ticks) => 2/3 perf. degradation • Quad nodes => loose 25% of machine to noise!

  28. Dealing with noise: Traditional approach Uncoordinated Coscheduled Not always appropriate for ticks + only treating the symptom…

  29. Drawback 2#: Desktop Slowdown

  30. Drawback #3: Power consumption

  31. Drawback #4: Security breach

  32. Implement #1: External process cheat client[default; 1000Hz] cheat server[10,000Hz] program fork/exec SIGSTOP sleep epsilon SIGCONT request msg 80% select 80% tick n send msg SIGSTOP sleep epsilon SIGCONT request msg 80% select 80% tick n+1 SIGSTOP send msg sleep epsilon request msg 80% SIGCONT select

  33. One low-end servercan “steal” an entire cluster

  34. Implement #2: No external process • Assume app. is an event-driven simulator start = cycle_counter; for( i = 0,1, … ) now = cycle_counter; if( i==0 || (now-start > 0.9tick) ) sleep epsilon; // wakeup at start of next tick start = cycle_counter; execute_event(i);

  35. Drawback 5#: VM overhead • IBM’s S/390 “We are facing the problem that we want to start many (> 1000) Linux images on a big S/390 machine. Every image has its own 100HZ timer. … You quickly end up with 100% CPU only for the timer interrupts of otherwise idle images” Martin Schwidefsky, Apr 2001, LKML

  36. Drawback 6#: poor resolution • Other soft-realtime apps.: • Crash-tests, rate-monotonic net, file-system benchmarking, various sensors, etc. • SIGMETRICS’03 [“Effects of Clock Resolution…”]: • Increasing the tick rate is beneficial but overheads are high.

  37. Solution: Go tickless (event-based) • Implementation: • Events: quantum-expiration, process-alarms, … • Event time are aligned on Hz boundaries • Use one-shot timers (as Pentium’s APIC) • Hz is a settable parameter • Hz=0 (pure one-shot), Hz=100 (most GPOSs), Hz>1000 (soft RT) • Accounting upon each kernel entry, etc. • Benefits: • Reduced noise (1 runnable process => no ticks) • Reduced overheads • Faster computation (even for desktops) • Reduced power consumption • Improved security & robustness • Improved resolution (high Hz without useless ticking)

  38. Bottom line • The general-purpose OS, become more general • As more applications can enjoy it

More Related