70 likes | 232 Views
Operating System Attributes for High Performance Computing. Ken Rozendal Distinguished Engineer IBM Linux Technology Center. Operating System Attributes for HPC. Reducing NUMA Effects Exploiting larger page sizes Reducing operating system “jitter” Avoiding planned and unplanned downtime
E N D
Operating System Attributes forHigh Performance Computing Ken Rozendal Distinguished Engineer IBM Linux Technology Center
Operating System Attributes for HPC • Reducing NUMA Effects • Exploiting larger page sizes • Reducing operating system “jitter” • Avoiding planned and unplanned downtime • Other attributes
Reducing NUMA Effects • Most systems have NUMA attributes due to memory bus and cache designs. • The degree of NUMA behavior is substantially different between systems. • The default OS behavior in placing new memory pages makes critical difference. • The applications need to either code to the default NUMA behavior or explicitly place. • OS needs to provide APIs for discovering NUMA topology and providing placement policies.
Exploiting Larger Page Sizes • Larger page sizes reduce TLB reloads. • Most of the benefit occurs with the first few doublings of the page size. • Using both small and large page sizes requires very flexible allocation policies. • Need to have OS adjust quickly for changing requirements for large pages. • Need to be able to place large pages without changing application source code.
Reducing Operating System “Jitter” • OS “jitter” - interruptions to execution on one node amplified across a cluster • Types of interruption – hardware and software interrupts, daemons • Approaches: • Eliminate types of interrupts (e.g. timer ticks) • Simplify – eliminate unused subsystems • Daemon squashing • Synchronizing interruptions across CPUs on node and nodes in cluster
Avoiding Planned and Unplanned Downtime • Avoid hardware failures causing downtime. • CPUs, Memory, I/O • Avoid downtime due to software updates. • Concurrent update to operating system components • Avoid downtime due to hardware updates. • OS migration between systems • Application migration • Recover from unplanned downtimes: • Checkpoint/restart
Other Operating System Attributes for HPC • Support for standard programming models • Support for high performance interconnects • Parallel file systems • Performance analysis and tuning tools • Parallel application debugging tools • Cluster system management tools