1 / 48

Scalable Dynamic Instrumentation for Bluegene/L

Scalable Dynamic Instrumentation for Bluegene/L. by Gregory Lee. Acknowledgements. Lawrence Livermore National Laboratory employees (the brains behind the project): Dong Ahn, Bronis de Supinski, and Martin Schulz Fellow Student Scholars:

jane
Download Presentation

Scalable Dynamic Instrumentation for Bluegene/L

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Dynamic Instrumentation for Bluegene/L by Gregory Lee

  2. Acknowledgements • Lawrence Livermore National Laboratory employees (the brains behind the project): • Dong Ahn, Bronis de Supinski, and Martin Schulz • Fellow Student Scholars: • Steve Ko (University of Illinois) and Barry Rountree (University of Georgia) • Advisor: • Allan Snavely

  3. Bluegene/L • 64K Compute Nodes, two 700MHZ processors per node • Compute nodes run custom lightweight kernel – no multitasking, limited system calls • Dedicated I/O nodes for additional system calls and external communication • 1024 I/O nodes with same architecture as compute nodes

  4. Why Dynamic Instrumentation? • Parallel applications often have long compile times and longer run times • Typical method (printf or log file): • Modify code to output intermediate steps, recompile, rerun • Static code instrumentation? • Need to restart application • Where to put instrumentation? • Dynamic instrumentation • No need to recompile or rerun application • Instrumentation can be easily added and removed

  5. DynInst • API that allows programs to insert code “snippets” into a running application • Interfaces with debugger (i.e. ptrace) • Machine independent interface via abstract syntax trees • Uses trampoline code

  6. DynInst Trampoline Code

  7. Dynamic Probe Class Library (DPCL) • API built on DynInst • Higher level of abstraction • Instrumentation as probes, C-like expressions • Adds ability to instrument multiple processes in a parallel application • Allows data to be sent from application to tool

  8. Original DPCL structure

  9. Scaling Issues with DPCL • Tool requires one socket each per super daemon and daemon • Strain system limit • Huge bottleneck at tool • BG/L compute nodes can’t support daemons • No multitasking • App sends data via shared memory with daemon

  10. Multicast Reduction Network • Developed at the University of Wisconsin, Madison • Creates tree topology network between front end tool and compute node application processes • Scalable multicast from front end to compute nodes • Upstream, downstream, and synchronization filters

  11. Typical Tool Topology

  12. Tool Topology with MRNet

  13. DynInst Modifications • Perform instrumentation remotely from CNs on BG/L’s I/O nodes • Instrument multiple processes per daemon. • Interface with CIOD

  14. DPCL Front End Modifications • Move from Process oriented to Application oriented view • Job Launch via Launchmon • Starts application and daemon processes • Gathers process information (i.e. PIDs, hostnames, etc.) • Statically link application to runtime library

  15. DPCL Daemon Modifications • Removed super daemons • Commands processed through a DPCL filter • Ability to instrument multiple processes • Single message de-multiplexed for all processes • Callbacks must cycle through daemons ProcessD objects

  16. MRNet based DPCL on BG/L

  17. Current State • DynInst not fully functional on BG/L • Can control application processes, but not instrument • Ported to the Multiprogrammatic Capability Cluster (MCR) • 1,152 dual 2.4GHz Pentium 4 Xeon processors • 11 Teraflops • Linux cluster • Fully functional DynInst

  18. Performance Tests • uBG/L • 1024 node (1 rack) of BG/L • 8 Compute Nodes per I/O node • MCR • Latency test • Throughput test • DPCL performance

  19. MRNet Latency Test • Created a MRNet Communicator with a single compute node • Front end sends a packet to the compute node and awaits an ack • Average of 1000 send message, receive ack pairs

  20. MRNet Latency Results

  21. MRNet Throughput Test • Each compute node sends fixed number of packets to front end • 100 to 1000 packets in 100 packet intervals • With and without sum filter • Each data point represents best of at least 5 runs • Avoid system noise

  22. MCR – one process per CN MRNet topology

  23. MCR MRNet One Proc Throughput

  24. MCR – One Proc Filter Speedup

  25. MCR – two processes per CN MRNet topology

  26. MCR MRNet Two Procs Throughput

  27. MCR – Two Proc Filter Speedup

  28. uBG/L MRNet topology

  29. uBG/L MRNet Throughput Results

  30. uBG/L Filter Speedups

  31. MRNet Performance Conclusions • Moderate latency • Scalability • Scales well for most test cases • Some problems at extreme points • A smart DPCL tool would not require this stress on communication • Filters very effective • For balanced tree topology • For large number of nodes

  32. MRNet DPCL Performance Tests • Tests on both uBG/L and MCR • 3 tests: • Simple DPCL command latency • Master daemon optimization results • Attach latency

  33. Blocking Command Latency • Time to construct and send simple command and receive all acks from daemons • Minimal overhead of sending a command

  34. Blocking Command Latency - MCR

  35. Blocking Command Latency – uBG/L

  36. Master Daemon Optimization • Some data sent to tool is redundant: • Executable information i.e. module names, function names, and instrumentation points • Only need one daemon to send this data

  37. Optimization Results - MCR

  38. Optimization Speedup - MCR

  39. Optimization Results – uBG/L

  40. Optimization Speedup – uBG/L

  41. Attach Latency (Optimized) – MCR

  42. Attach Latency (Optimized) –uBG/L

  43. DPCL Performance Conclusion • Scales very well • Optimization benefits most at larger number of nodes • uBG/L long pre attach and attach times • Could be 8x worse on BG/L • Room for optimization

  44. Interface Extension: Contexts • More control over process selection • vs. single process or entire application • Create MPI communicator-like “contexts” • Ability to take advantage of MRNet’s filters • Can specify context for any DPCL command • Default to “world” context

  45. Interface Extension Results • Currently fully implemented and functional • Need to perform test to show utility • Can utilize MRNet filters • Less intrusive on application

  46. Conclusion • More application oriented view using MRNet • Scales well under tests performed • Need to test on larger machines • Contexts for arbitrary placement of instrumentation

  47. References • M. Schulz, D. Ahn, A. Bernat, B. R. de Supinski, S. Y. Ko, G. Lee, and B. Rountree. Scalable Dynamic Binary Instrumentation for Blue Gene/L. • L. DeRose, T. Hoover Jr., and J. K. Hollingsworth. The Dynamic Probe Class Library – An Infrastructure for Developing Instrumentation for Performance Tools. • P. Roth, D. Arnold, and B. Miller. MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools • B. Buck and J. K. Hollingsworth. An API for Runtime Code Patching. • D. M. Pase. Dynamic Probe Class Library (DPCL): Tutorial and Reference Guide. IBM, 1998.

  48. Questions? • Comments? • Ideas?

More Related