350 likes | 523 Views
Kernel-level Measurement for Integrated Parallel Performance Views. A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon. Motivation. Application performance is a consequence of User-level execution
E N D
Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon
Motivation • Application performance is a consequence of • User-level execution • OS-level operation • Good tools exist for observing user-level performance • User-level events • Communication events • Execution time measures • Hardware performance • Fewer tools exist to observe OS-level aspects • Ideally would like to do both simultaneously • OS-level influences on application performance
Scale and Performance Sensitivity • HPC systems continue to scale to larger processor counts • Application performance more performance sensitive • OS factors can lead to performance bottlenecks[Petrini’03, Jones’03, …] • System/application performance effects are complex • Isolating system-level factors is non-trivial • Require comprehensive performance understanding • Observation of all performance factors • Relative contributions and interrelationship • Can we correlate OS and application performance?
Phase Performance Effects Waiting timedue to OS Overhead accumulates
Program - OS Interactions • Direct • applications invoke the OS for certain services • syscalls and internal OS routines called from syscalls • Indirect • OS operations w/o explicit invocation by application • preemptive scheduling • (HW) interrupt handling • OS-background activity • can occur at any time
Performance Perspectives • Kernel-wide • Aggregate kernel activity of all active processes • Understand overall OS behavior • Identify and remove kernel hot spots • Cannot show application-specific OS actions • Process-centric • OS performance in specific application context • Virtualization and mapping performance to process • Programs, daemons, and system services interactions • Expose sources of performance problems • Tune OS for specific workload and application for OS
Existing Approaches • User-space only measurement tools • Cannot observe system-level performance influences • Kernel-level only measurement tools • Cannot integrate OS and user-level measurements • A few Combined or integrated user/kernel measurement tools • Can correlate kernel and user-level performance • Typically focus ONLY on direct OS interactions • Indirect interactions not normally merged • Do not explicitly recognize parallel workloads • MPI ranks, OpenMP threads, … • Need an integrated approach to parallel performance observation and analyses that support both perspectives
High-Level Objectives • Low-overhead OS performance measurement • Both kernel-wide and process-centric perspectives • Merge user-level and kernel-level performance information across all program-OS interactions • Provide online information and the ability to function without a daemon where possible • Support both profiling and tracing for kernel-wide and process-centric views in parallel systems • Leverage existing parallel performance analysis, viz. tools • Support for observing, collecting and analyzing parallel data
ZeptoOS and TAU/KTAU • Lots of fine-grained OS measurement is required for each component of the ZeptoOS work • How and why do the various OS source and configuration changes affect parallel applications? • How do we correlate performance data between • OS components • Parallel application and OS • Solution: TAU/KTAU • An integrated methodology and framework to measure performance of applications and OS kernel
KTAU Case Studies • Demonstrate KTAU capabilities • Controlled Experiments • Larger-Scale Benchmarks • Investigate KTAU operating characteristics • Overhead • Test environment • Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau) • Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau) • Chiba City: 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux 2.6.14.2 (ktau) kernel, connected by Ethernet
Scenarios and Benchmarks • Controlled experiment scenarios • Interrupt load • Scheduling load • System calls • Benchmarks • NPB LU, SP applications [NPB] • Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution. • LMBENCH [LMBENCH] (not shown here) • Suite of micro-benchmarks exercising Linux kernel
Observing Interrupts User-level Inclusive Time User-level Exclusive Time Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? Benchmark: NPB-SP application on 16 nodes User-level profile does not tell the whole story!
Observing Interrupts User+OS Inclusive Time User+OS Exclusive Time MPSP excl. time difference only 4 secs. Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply() Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq()
Observing Scheduling • NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) • Simulate daemon interference using “toy” daemon • Daemon periodically wakes-up and performs activity • What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)
Observing Scheduling • Select node 8 and take a closer look … B: Process-level View (Each row is single process on host 8) ‘Toy’ Daemon activity 2 NPB LU processes
Merging App / OS Traces MPI_Send OS Routines Tracing System Calls Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR]
Larger-Scale Runs • Run parallel benchmarks on larger-scale • 128 dual-cpu nodes • Identify (and remove) system-level performance issues • Understand overheads introduced by KTAU • NPB benchmark: LU Application [NPB] • ASC benchmark: Sweep3D [Sweep3d]
LU Experiment • Experienced problems on Chiba by chance • Initially ran NPB-LU and Sweep3D codes on 128x1 configuration • Then ran on 64x2 configuration • Extreme performance hit (72% slower!) with the 64x2 runs • Used KTAU views to identify and solve issues iteratively • Eventually brought performance gap to 13% for LU and 9% for Sweep.
LU Experiment 64x2 Configuration MPI_Recv OS Interactions User-level MPI_Recv Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED.
LU Experiment Voluntary Scheduling Preemptive Scheduling Note: x-axis log scale Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling.
LU Experiment ccn10 Node-level View Interrupt Activity NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre-emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.
Approx. 100% longer Many more OS-TCP Calls Sweep3D Experiment Use ‘Merged’ performance data to identify overlap, imbalance. Why does purely compute bound region have lots of I/O? TCP within Compute : Time TCP within Compute : Calls 100% More background OS-TCP activity in Compute phase. More imbalance! • I/O Compute Overlap - Explicit MPI or OS buffering - Tcp Sends • Imbalance in workload - Tcp Recvs from on-time finishers
Sweep3D Experiment Cost / Call of OS-level TCP OS-TCP in SMP Costlier • IRQ-Balancing blindly distributes interrupts and bottom-halves. • E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1 • Cache issues! [COMSWARE]
Perturbation Study • Five different Configurations • Baseline: Vanilla kernel, un-instrumented benchmark • Ktau-Off: Kernel instrumentations compiled-in. But turned Off - (disabled probe effect) • Prof-All: All kernel instrumentations turned On • Prof-Sched: Only scheduler subssystem’s instrumentations turned on • Prof-All+TAU: ProfAll, + user-level Tau instrumentation enabled • NPB LU application benchmark: • 16 nodes, 5 different configurations, Mean over 5 runs each • ASC Sweep3D: • 128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. • Test machine: Chiba-City ANL
Perturbation Study Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: 368.25 369.9 % Avg Slow.: 0.49% Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling.
Future Work • Dynamic measurement control • Improve performance data sources • Improve integration with TAU’s user-space capabilities • Better correlation of user and kernel performance • Full callpaths and phase-based profiling • Merged user/kernel traces (already available) • Integration of TAU and KTAU with Supermon • Porting efforts to IA-64, PPC-64, and AMD Opteron • ZeptoOS characterization efforts • BGL I/O node • Dynamically adaptive kernels
Acknowledgements • Department of Energy’s Office of Science • National Science Foundation • University of Oregon (UO) Core Team • Aroon Nataraj, PhD Student • Prof. Allen D Malony • Dr. Sameer Shende, Senior Scientist • Alan Morris, Senior Software Engineer • Argonne National Lab (ANL) Contributors • Pete Beckman • Kamil Iskra • Kazutomo Yoshii • Thanks also to… • Suravee Suthikulpanit, UO MS (Graduated) • Rick Bradshaw, HPC Systems Support, ANL
Outline • Motivations • Objectives • ZeptoOS project • KTAU Architecture • Case Studies - the Performance Views • Perturbation Study • Future work • Acknowledgements
Recent Work on ZeptoOS Project • Accurate Identification of “noise” sources • Modified Linux on BG/L should be efficient • Effect of OS “noise” on synchronization / collectives • What OS aspects induce what types of interference • code paths • configurations • devices attached • Requires user-level and OS measurement • If can identify noise sources, then can remove or alleviate interference
Approach • ANL Selfish benchmark to identify “detours” • Noise events in user-space • Shows durations and frequencies of events • Does NOT show cause or source • Runs a tight loop with an expected (ideal) duration • logs times and duration of detours • Use KTAU OS-tracing to record OS activity • Correlate time of occurrence • uses same time source as Selfish benchmark • Infer type of OS-activity (if any) causing the “detour”
OS/User Performance View of Scheduling preemptivescheduling
Other Applications of KTAU • Need to fill this in … (is there time)