Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU

Kernel-level Measurement for Integrated Parallel Performance ViewsKTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon

KTAU: Outline • Introduction • Motivations • Objectives • Architecture / Implementation Choices • Experimentation – the performance views • Perturbation Study • Future work and directions • Acknowledgements

Introduction : ZeptoOS and TAU • DOE OS/RTS for Extreme Scale Scientific Computation(Fastos) • Conduct OS research to provide effective OS/Runtime for petascale systems • ZeptoOS (under Fastos) • Scalable components for petascale architectures • Joint project Argonne National Lab and University of Oregon • ANL: Putting light-weight kernel (based on Linux) on BG/L and other platforms (XT3) • University of Oregon • Kernel performance monitoring, tuning • KTAU • Integration of TAU infrastructure with Linux Kernel • Integration with ZeptoOS, installation on BG/L • Port to 32-bit and 64-bit Linux platforms

KTAU: Motivation • Application Performance • user-level execution performance + • OS-level operations performance • Different Domains: E.g. Time, Hardware Perf. Metrics • PAPI (Performance Application Programming Interface) • Exposes virtualized hardware counters • TAU (Tuning and Analysis Utility) • Measures a lot of interesting user-level entities: parallel application, MPI, libraries … • Time domain • Uses PAPI to correlate counter information with source

KTAU: MotivationSimple Parallel Model

KTAU: MotivationSimple Parallel Model - Scale

KTAU: MotivationEffects of Scale • As HPC systems continue to scale to larger processor counts • Application performance more sensitive • New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…]) • Isolating these system-level issues as bottlenecks is non-trivial • Comprehensive performance understanding • Observation of all performance factors • Relative contributions and interrelationship: can we correlate?

KTAU: MotivationProgram - OS Interactions • Program OS Interactions - Direct vs. Indirect Entry Points • Direct - Applications invoke the OS for certain services • Syscalls (and internal OS routines called directly from syscalls) • Indirect - OS takes actions without explicit invocation by application • Preemptive Scheduling • (HW) Interrupt handling • OS-background activity (keeping track of time and timers, bottom-half handling, etc) • Indirect interactions can occur at any OS entry (not just when entering through Syscalls)

KTAU: MotivationProgram - OS Interactions

KTAU: MotivationProgram - OS Interactions • Direct Interactions easier to handle • Synchronous with user-code and in process-context • Indirect Interactions more difficult to handle • Usually asynchronous and in interrupt-context: Hard to measure and harder to correlate/integrate with app. measurements • Indirect interactions may be unrelated to current task • E.g. Kernel-level packet processing for another process • But related in terms of time to current process

KTAU: MotivationProgram - OS Interactions(Partial)

KTAU: MotivationKernel-wide vs. Process-centric • Kernel-wide - Aggregate kernel activity of all active processes in system • Understand overall OS behavior, identify and remove kernel hot spots. • Cannot show what parts of app. spend time in OS and why • Process-centric perspective - OS performance within context of a specific application’s execution • Virtualization and Mapping performance to process • Interactions between programs, daemons, and system services • Tune OS for specific workload or tune application to better conform to OS config. • Expose real source of performance problems (in the OS or the application)

KTAU: MotivationKernel-wide vs. Process-centric

KTAU: MotivationExisting Approaches • User-space Only measurement tools • Many tools only work at user-level and cannot observe system-level performance influences • Kernel-level Only measurement tools • Most only provide the kernel-wide perspective – lack proper mapping/virtualization • Some provide process-centric views but cannot integrate OS and user-level measurements • Combined or Integrated User/Kernel Measurement Tools • A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance • Typically these focus only on Direct OS interactions. Indirect interactions not merged. • Using Combinations of above tools • Without better integration, does not allow fine-grained correlation between OS and App. • Many kernel tools do not explicitly recognize Parallel workloads (e.g. MPI ranks) • Need an integrated approach to parallel perf. observation, analyses

Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Integrate user and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process-centric views in parallel systems Leverage existing parallel performance analysis/viz tools Support for observing, collecting and analyzing parallel data KTAU: High-Level Objectives

KTAU: Outline • Introduction • Motivations • Objectives • Architecture / Implementation Choices • Experimentation – the performance views • Perturbation Study • ZeptoOS – KTAU on Blue Gene / L • Future work and directions • Acknowledgements

KTAU Architecture

KTAU: Arch. / Impl. Choices • Instrumentation • Static Source instrumentation • Macro Map-ID: Map block of code and process-context to unique index (dense id-space) – easy array lookup. • Macro Start, Stop – provide the mapping index and process-context is implicit • Measurement • Differentiate between ‘local/self’ and ‘inter-context’ access. HPC codes primarily use ‘self’. • Store performance data in PCB (task_struct) • Integrating Kernel/User Performance state • Don’t assume synchronous kernel-entry or process-context • Have to use memory mapping between kernel and appl. State • Pinning shared state in memory • Kernel Call Groups – program-OS interactions summary • Analyses and Visualization – Use TAU facilities

KTAU: Controlled Experiments • Controlled Experiments • Exercise kernel in controlled fashion • Check if KTAU produces the expected correct and meaningful views • Test machines • Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau) • Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau) • Benchmarks • NPB LU, SP applications[NPB] • Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution. • LMBENCH[LMBENCH] • Suite of micro-benchmarks exercising Linux kernel • A few others not shown (e.g. SKAMPI)

KTAU: Controlled Examples continued… Profiling

KTAU: Controlled ExperimentsObserving Interrupts User-level Inclusive Time User-level Exclusive Time Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? Benchmark: NPB-SP application on 16 nodes User-level profile does not tell the whole story!

KTAU: Controlled ExperimentsObserving Interrupts User+OS Inclusive Time User+OS Exclusive Time MPSP excl. time difference only 4 secs. Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq() Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv. Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out. Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply()

KTAU: Controlled ExperimentsObserving Scheduling • NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) • Simulate daemon interference using “toy” daemon • Daemon periodically wakes-up and performs activity • What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)

KTAU: Controlled ExperimentsObserving Scheduling • Select node 8 and take a closer look … B: Process-level View (Each row is single process on host 8) ‘Toy’ Daemon activity 2 NPB LU processes

KTAU: Controlled ExperimentsObserving Scheduling • Instrumentation to differentiate voluntary/involuntary schedule • Experiment re-run on 4-processor SMP • Local slowdown - preemptive scheduling • Remote slowdown - voluntary scheduling (waiting!) C: Voluntary / Involuntary Scheduling (Each row is single MPI rank) Pre-empted out by ‘Toy’ daemon Other 3 yield cpu voluntarily and wait!

LMBENCH Page-Fault KTAU: Controlled ExperimentsObserving Exceptions Call Group Relations Program-OS Call Graph

Merging App / OS Traces MPI_Send OS Routines KTAU: Controlled ExperimentsTracing Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR]

KTAU: Controlled Examples continued…Tracing Correlating CIOD and RPC-IOD Activity

KTAU: Larger-Scale Runs • Run parallel benchmarks on larger-scale (128 dual-cpu nodes) • Identify (and remove) system-level performance issues • Understand perturbation overheads introduced by KTAU • NPB benchmark: LU Application[NPB] • Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution. • ASC benchmark: Sweep3D[Sweep3d] • Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh. • Test machine: Chiba-City Linux cluster (ANL) • 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux 2.6.14.2 (ktau) kernel, connected by Ethernet

KTAU: Larger-Scale Runs • Experienced problems on Chiba by chance • Initially ran NPB-LU and Sweep3D codes on 128x1 configuration • Then ran on 64x2 configuration • Extreme performance hit (72% slower!) with the 64x2 runs • Used KTAU views to identify and solve issues iteratively • Eventually brought performance gap to 13% for LU and 9% for Sweep.

KTAU: Larger-scale Runs User-level MPI_Recv MPI_Recv OS Interactions Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED.

KTAU: Larger-scale Runs Voluntary Scheduling Preemptive Scheduling Note: x-axis log scale Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling.

KTAU Larger-scale Runs ccn10 Node-level View Interrupt Activity NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre-emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.

Approx. 100% longer Many more OS-TCP Calls KTAU Larger-scale Runs Use ‘Merged’ performance data to identify imbalance.Why does purely compute bound region have lots of I/O? TCP within Compute : Time TCP within Compute : Calls 100% More background OS-TCP activity in Compute phase. More imbalance!

KTAU Larger-scale Runs Cost / Call of OS-level TCP OS-TCP in SMP Costlier • IRQ-Balancing blindly distributes interrupts and bottom-halves. • E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1 • Cache issues! [COMSWARE]

KTAU Perturbation Study • Five different Configurations • Base: Vanilla kernel, un-instrumented benchmark • Ktau-Off: Kernel patched with Ktau and instrumentations compiled-in. But all instrumentations turned Off (boot-time control) • Prof-All: All kernel instrumentations turned On. • Prof-Sched: Only scheduler subssystem’s instrumentations turned on • Prof-All+TAU: ProfAll, but also with user-level Tau instrumentation enabled • NPB LU application benchmark: • 16 nodes, 5 different configurations, Mean over 5 runs each • ASC Sweep3D: • 128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. • Test machine: Chiba-City ANL

KTAU Perturbation Study Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: 368.25 369.9 % Avg Slow.: 0.49% Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling.

KTAU: Outline • Introduction • Motivations • Objectives • Architecture / Implementation Choices • Experimentation – the performance views • Perturbation Study • Future work and directions • Acknowledgements

KTAU: Future Work • Dynamic measurement control - enable/disable events w/o recompilation or reboot • Improve performance data sources that KTAU can access - E.g. PAPI • Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information • full callpaths, • phase-based profiling, • merged user/kernel traces • Integration of Tau, Ktau with Supermon (possibly MRNet?), TAUg (next) • Porting efforts: IA-64, PPC-64 and AMD Opteron • ZeptoOS: Planned characterization efforts • BGL I/O node • Dynamically adaptive kernels

Acknowledgements • Prof. Allen D Malony • Dr. Sameer Shende, Senior Scientist • Alan Morris, Senior Software Engineer, PRL • Suravee Suthikulpanit , MS Student (Graduated)

Support Acknowledgements • Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and • National Science Foundation (grant no. NSF CCF 0444475)

References • [petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03 • [jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03 • [PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000. • [VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, 1996. • [ZeptoOS]: “ZeptoOS: The small linux for big computers,” http://www.mcs.anl.gov/zeptoos/ • [NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.

References • [Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000 • [LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294 • [TAU]: “TAU: Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/paracomp/tau/ • [KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, 2006. • [KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” (under submission)

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU