480 likes | 625 Views
Scalable Dynamic Instrumentation for Bluegene/L. by Gregory Lee. Acknowledgements. Lawrence Livermore National Laboratory employees (the brains behind the project): Dong Ahn, Bronis de Supinski, and Martin Schulz Fellow Student Scholars:
E N D
Scalable Dynamic Instrumentation for Bluegene/L by Gregory Lee
Acknowledgements • Lawrence Livermore National Laboratory employees (the brains behind the project): • Dong Ahn, Bronis de Supinski, and Martin Schulz • Fellow Student Scholars: • Steve Ko (University of Illinois) and Barry Rountree (University of Georgia) • Advisor: • Allan Snavely
Bluegene/L • 64K Compute Nodes, two 700MHZ processors per node • Compute nodes run custom lightweight kernel – no multitasking, limited system calls • Dedicated I/O nodes for additional system calls and external communication • 1024 I/O nodes with same architecture as compute nodes
Why Dynamic Instrumentation? • Parallel applications often have long compile times and longer run times • Typical method (printf or log file): • Modify code to output intermediate steps, recompile, rerun • Static code instrumentation? • Need to restart application • Where to put instrumentation? • Dynamic instrumentation • No need to recompile or rerun application • Instrumentation can be easily added and removed
DynInst • API that allows programs to insert code “snippets” into a running application • Interfaces with debugger (i.e. ptrace) • Machine independent interface via abstract syntax trees • Uses trampoline code
Dynamic Probe Class Library (DPCL) • API built on DynInst • Higher level of abstraction • Instrumentation as probes, C-like expressions • Adds ability to instrument multiple processes in a parallel application • Allows data to be sent from application to tool
Scaling Issues with DPCL • Tool requires one socket each per super daemon and daemon • Strain system limit • Huge bottleneck at tool • BG/L compute nodes can’t support daemons • No multitasking • App sends data via shared memory with daemon
Multicast Reduction Network • Developed at the University of Wisconsin, Madison • Creates tree topology network between front end tool and compute node application processes • Scalable multicast from front end to compute nodes • Upstream, downstream, and synchronization filters
DynInst Modifications • Perform instrumentation remotely from CNs on BG/L’s I/O nodes • Instrument multiple processes per daemon. • Interface with CIOD
DPCL Front End Modifications • Move from Process oriented to Application oriented view • Job Launch via Launchmon • Starts application and daemon processes • Gathers process information (i.e. PIDs, hostnames, etc.) • Statically link application to runtime library
DPCL Daemon Modifications • Removed super daemons • Commands processed through a DPCL filter • Ability to instrument multiple processes • Single message de-multiplexed for all processes • Callbacks must cycle through daemons ProcessD objects
Current State • DynInst not fully functional on BG/L • Can control application processes, but not instrument • Ported to the Multiprogrammatic Capability Cluster (MCR) • 1,152 dual 2.4GHz Pentium 4 Xeon processors • 11 Teraflops • Linux cluster • Fully functional DynInst
Performance Tests • uBG/L • 1024 node (1 rack) of BG/L • 8 Compute Nodes per I/O node • MCR • Latency test • Throughput test • DPCL performance
MRNet Latency Test • Created a MRNet Communicator with a single compute node • Front end sends a packet to the compute node and awaits an ack • Average of 1000 send message, receive ack pairs
MRNet Throughput Test • Each compute node sends fixed number of packets to front end • 100 to 1000 packets in 100 packet intervals • With and without sum filter • Each data point represents best of at least 5 runs • Avoid system noise
MRNet Performance Conclusions • Moderate latency • Scalability • Scales well for most test cases • Some problems at extreme points • A smart DPCL tool would not require this stress on communication • Filters very effective • For balanced tree topology • For large number of nodes
MRNet DPCL Performance Tests • Tests on both uBG/L and MCR • 3 tests: • Simple DPCL command latency • Master daemon optimization results • Attach latency
Blocking Command Latency • Time to construct and send simple command and receive all acks from daemons • Minimal overhead of sending a command
Master Daemon Optimization • Some data sent to tool is redundant: • Executable information i.e. module names, function names, and instrumentation points • Only need one daemon to send this data
DPCL Performance Conclusion • Scales very well • Optimization benefits most at larger number of nodes • uBG/L long pre attach and attach times • Could be 8x worse on BG/L • Room for optimization
Interface Extension: Contexts • More control over process selection • vs. single process or entire application • Create MPI communicator-like “contexts” • Ability to take advantage of MRNet’s filters • Can specify context for any DPCL command • Default to “world” context
Interface Extension Results • Currently fully implemented and functional • Need to perform test to show utility • Can utilize MRNet filters • Less intrusive on application
Conclusion • More application oriented view using MRNet • Scales well under tests performed • Need to test on larger machines • Contexts for arbitrary placement of instrumentation
References • M. Schulz, D. Ahn, A. Bernat, B. R. de Supinski, S. Y. Ko, G. Lee, and B. Rountree. Scalable Dynamic Binary Instrumentation for Blue Gene/L. • L. DeRose, T. Hoover Jr., and J. K. Hollingsworth. The Dynamic Probe Class Library – An Infrastructure for Developing Instrumentation for Performance Tools. • P. Roth, D. Arnold, and B. Miller. MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools • B. Buck and J. K. Hollingsworth. An API for Runtime Code Patching. • D. M. Pase. Dynamic Probe Class Library (DPCL): Tutorial and Reference Guide. IBM, 1998.
Questions? • Comments? • Ideas?