290 likes | 444 Views
Early Experiences with KTAU on the Blue Gene / L. A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon. Outline. Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Ongoing / Recent work Future work and directions Acknowledgements
E N D
Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon
Outline • Introduction • Motivations • Objectives • Architecture • KTAU on Blue Gene / L • Ongoing / Recent work • Future work and directions • Acknowledgements • References • Team
Introduction : ZeptoOS and TAU • DOE OS/RTS for Extreme Scale Scientific Computation(Fastos) • Conduct OS research to provide effective OS/Runtime for petascale systems • ZeptoOS (under Fastos) • Scalable components for petascale architectures • Joint project Argonne National Lab and University of Oregon • ANL: Putting light-weight kernel (based on Linux) on BG/L and other platforms (XT3) • University of Oregon • Kernel performance monitoring, tuning • KTAU • Integration of TAU infrastructure in Linux Kernel • Integration with ZeptoOS, installation on BG/L • Port to 32-bit and 64-bit Linux platforms
ZeptoOS: The Small Linux for Big Computers • Research Exploration • What are the fundamental limits and advanced designs required for petascale Operating System Suites • Behaviour at large scales • Management & optimization of OS suites • Collectives • Fault tolerance • Measurement, collection and analysis of OS performance data from large number of nodes • Strategy • Modified Linux on BG/L I/O nodes • Measure and understand behavior • Modified Linux for BG/L compute nodes • Measure and understand behavior • Specialized I/O daemon on I/O node (ZOID) • Measure and understand behavior (ZeptoOS BG/L Symposium presentation slide reused with permission from Pete Beckman [beckman06-bgl])
ZeptoOS and KTAU • Lots of fine-grained OS measurement is required for each component of the ZeptoOS work • Exactly what aspects of Linux need to be changed to achieve ZeptoOS goals? • How and why do the various OS source and configuration changes affect parallel applications? • How do we correlate performance data between • the parallel application, • the compute node OS, • the I/O Daemon and • the I/O Node OS • Enter TAU/KTAU - An integrated methodology and framework to measure performance of applications and OS kernel across a system like BG/L.
Motivation • Application Performance • user-level execution performance + • OS-level operations performance • Domains: Time and Hardware Performance Metrics • PAPI (Performance Application Programming Interface) • Exposes virtualized hardware counters • TAU (Tuning and Analysis Utility) • Measures most user-level entities: parallel application, MPI, libraries … • Time domain • Uses PAPI to correlate counter information to source • But how to correlate OS-level influences with App. Performance?
Motivation (continued) • As HPC systems continue to scale to larger processor counts • Application performance more sensitive • New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…]) • Isolating these system-level issues as bottlenecks is non-trivial • Require Comprehensive Performance Understanding • Observation of all performance factors • Relative contributions and interrelationship • Can we correlate?
Motivation (continued)Program - OS Interactions • Program OS Interactions - Direct vs. Indirect Entry Points • Direct - Applications invoke the OS for certain services • Syscalls (and internal OS routines called directly from syscalls) • Indirect - OS takes actions without explicit invocation by application • Preemptive Scheduling • (HW) Interrupt handling • OS-background activity (keeping track of time and timers, bottom-half handling, etc) • Indirect interactions can occur at any OS entry (not just when entering through Syscalls) • Direct Interactions easier to handle • Synchronous with user-code and in process-context • Indirect Interactions more difficult to handle • Usually asynchronous and in interrupt-context: Hard to measure and harder to correlate/integrate with app. measurements
Motivation (continued)Kernel-wide vs. Process-centric • Kernel-wide - Aggregate kernel activity of all active processes in system • Understand overall OS behavior, identify and remove kernel hot spots. • Cannot show what parts of app. spend time in OS and why • Process-centric perspective - OS performance within context of a specific application’s execution • Virtualization and Mapping performance to process • Interactions between programs, daemons, and system services • Tune OS for specific workload or tune application to better conform to OS config. • Expose real source of performance problems (in the OS or the application)
Motivation (continued)Existing Approaches • User-space Only measurement tools • Many tools only work at user-level and cannot observe system-level performance influences • Kernel-level Only measurement tools • Most only provide the kernel-wide perspective – lack proper mapping/virtualization • Some provide process-centric views but cannot integrate OS and user-level measurements • Combined or Integrated User/Kernel Measurement Tools • A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance • Typically these focus only on Direct OS interactions. Indirect interactions not merged. • Using Combinations of above tools • Without better integration, does not allow fine-grained correlation between OS and App. • Many kernel tools do not explicitly recognize Parallel workloads (e.g. MPI ranks) • Need an integrated approach to parallel perf. observation, analyses
Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Merge user-level and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process-centric views in parallel systems Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data High-Level Objectives
KTAU: Outline • Introduction • Motivations • Objectives • Architecture • KTAU on Blue Gene / L • Recent/Ongoing Work (since publication) • Future work and directions • Acknowledgements • References • Team
KTAU On BGL’s ZeptoOS • I/O Node • Open source modified Linux Kernel (2.4, 2.6) - ZeptoOS • Control I/O Daemon (CIOD) handles I/O syscalls from Compute nodes in pset. • Compute Node • IBM proprietary (closed-source) light-weight kernel • No scheduling or virtual memory support • Forwards I/O syscalls to CIOD on I/O node • KTAU on I/O Node: • Integrated into ZeptoOS config and build system. • Require KTAU-D (daemon) as CIOD is closed-source. • KTAU-D periodically monitors sys-wide or individual process • Visualization of trace/profile of ZeptoOS, CIOD using Paraprof, Vampir/Jumpshot.
On BG/L (continued)Early Experiences CIOD Kernel Trace zoomed-in (running iotest benchmark)
On BG/L(continued)Early Experiences Correlating CIOD and RPC-IOD Activity
KTAU On BG/L Will Eventually Look Like … Replace with: Linux + KTAU Replace with: ZOID + TAU
Ongoing/Recent Work (since publication) • Accurate Identification of “noise” sources • Modified Linux on BG/L should not take a performance loss • One area of concern - OS “noise” effects on Synchronization / Collectives • Requires identifying exactly what aspects (code paths, configurations, devices attached) of the OS induce what types of interference • This will require user-level as well as OS measurement • Our Approach • Use the Selfish benchmark [Beckman06] to identify “detours” (or noise events) in user-space • This shows durations and frequencies of events, but NOT cause/source. • Simultaneously use KTAU OS-tracing to record OS activity • Correlate time of occurrence (both use same time source - hw time counter) • Infer which type of OS-activity (if any) caused the “detour” • Remove or alleviate interference using above information (Work-in-progress)
Ongoing/Recent Work (continued)“Noise” Source Identification BGL IO-N: Merged OS/User Performance View of Scheduling
Ongoing/Recent Work (continued)“Noise” Source Identification Merged OS/User View of OS Background Activity
Ongoing/Recent Work (continued)“Noise” Source Identification Zoomed-In: Merged OS/User View of OS Background Activity
Future Work • Dynamic measurement control - enable/disable events w/o recompilation or reboot • Improve performance data sources that KTAU can access - E.g. PAPI • Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information • full callpaths, • phase-based profiling, • merged user/kernel traces (already available) • Integration of Tau, Ktau with Supermon • Porting efforts: IA-64, PPC-64 and AMD Opteron • ZeptoOS: Planned characterization efforts • BGL I/O node • Dynamically adaptive kernels
Support Acknowledgements • Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and • National Science Foundation (grant no. NSF CCF 0444475)
References • [petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03 • [jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03 • [PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000. • [VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, 1996. • [ZeptoOS]: “ZeptoOS: The small linux for big computers,” http://www.mcs.anl.gov/zeptoos/ • [NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.
References • [Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000 • [LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294 • [TAU]: “TAU: Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/paracomp/tau/ • [KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, 2006. • [KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” in IEEE Cluster-2006 (Best Paper)
Team University of Oregon (UO) Core Team • Aroon Nataraj, PhD Student • Prof. Allen D Malony • Dr. Sameer Shende, Senior Scientist • Alan Morris, Senior Software Engineer Argonne National Lab (ANL) Contributors • Pete Beckman • Kamil Iskra • Kazutomo Yoshii Past Members • Suravee Suthikulpanit , MS Student, UO, (Graduated)
Thank You • Questions? • Comments? • Feedback?