320 likes | 685 Views
PAPI Directions. Dan Terpstra Innovative Computing Lab University of Tennessee. PAPI Directions. Overview What’s PAPI? What’s New? Features Platforms What’s Next? Network PAPI Thermal PAPI When? PAPI release roadmap What’s ICL? (a word from our sponsor). What’s PAPI?.
E N D
PAPI Directions Dan Terpstra Innovative Computing Lab University of Tennessee
PAPI Directions • Overview • What’s PAPI? • What’s New? • Features • Platforms • What’s Next? • Network PAPI • Thermal PAPI • When? • PAPI release roadmap • What’s ICL? • (a word from our sponsor) IBM Petascale Workshop 2006
What’s PAPI? • A software layer (library) designed to provide the tool developer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major micro-processors. • Countable events are defined in two ways: • platform-neutral Preset Events • Platform-dependent Native Events • Preset Events can be derived from multiple Native Events • All events referenced by name and collected in EventSets for sampling • Events can be multiplexed if counters are limited • Statistical sampling is implemented by: • Software overflow with timer driven sampling • Hardware overflow if supported by the platform IBM Petascale Workshop 2006
Where’s PAPI • PAPI runs on most modern processors and Operating Systems of interest to HPC: • IBM POWER3,4,5 / AIX • POWER4,5 / Linux • PowerPC-32 and -64 / Linux • Blue Gene • Intel Pentium II, III, 4, M, EM64T, etc. / Linux • Intel Itanium • AMD Athlon, Opteron / Linux • Cray T3E, X1, XD3, XT3 Catamount • Altix, Sparc, … • NOTE: All Linux implementations require the perfctr kernel patch. • Except Itanium which uses the built-in perfmon interface • Perfmon2 development is underway to replace perfctr and be pre-installed in the kernel – NO PATCHES NEEDED! IBM Petascale Workshop 2006
Extending PAPI beyond the CPU • PAPI has historically targeted on on-processor performance counters • Several categories of off-processor counters exist • network interfaces: Myrinet, Infiniband, GigE • memory interfaces: Cray X1 • thermal and power interfaces: ACPI • CHALLENGE: • Extend the PAPI interface to address multiple counter domains • Preserve the PAPI calling semantics, ease of use, and platform independence for existing applications IBM Petascale Workshop 2006
Multi-Substrate PAPI • Goals: • Isolate hardware dependent code in a separable ‘substrate’ module • Extend platform independent code to support multiple simultaneous substrates • Add or modify API calls to support access to any of several substrates • Modify build environment for easy selection and configuration of multiple available substrates IBM Petascale Workshop 2006
PAPI 3.0 Design PAPI High Level PAPI Low Level Portable Layer • Hardware Independent Layer PAPI Machine DependentSubstrate Machine Specific Layer KernelExtension Operating System Hardware Performance Counters IBM Petascale Workshop 2006
PAPI 4.0 Multiple Substrate Design PAPI High Level PAPI High Level PAPI Low Level PAPI Low Level Portable Layer Portable Layer • Hardware Independent Layer • Hardware Independent Layer PAPI Machine DependentSubstrate PAPI Machine DependentSubstrate PAPI Machine DependentSubstrate Machine Specific Layer Machine Specific Layer KernelExtension KernelExtension KernelExtension Operating System Operating System Operating System Hardware Performance Counters Hardware Performance Counters Off-Processor Hardware Counters IBM Petascale Workshop 2006
API Changes • 3 calls augmented with a substrate index • Old syntax preserved in wrapper functions for backward compatibility • Modified entry points: • PAPI_create_eventset PAPI_create_sbstr_eventset • PAPI_get_opt PAPI_get_sbstr_opt • PAPI_num_hwctrs PAPI_num_sbstr_hwctrs • New entry points for new functionality: • PAPI_num_substrates • PAPI_get_sbstr_info • Old code can run with no source modifications IBM Petascale Workshop 2006
PAPI 4.0 Status • Multi-substrate development complete • Some CPU platforms not yet ported • Substrates available for • ACPI (Advanced Configuration and Power Interface ) • Myrinet MX • Substrates under development for • Infiniband • GigE • Friendly User release available now for CVS checkout IBM Petascale Workshop 2006
Myrinet MX Counters IBM Petascale Workshop 2006
Myrinet MX Counters IBM Petascale Workshop 2006
Multiple Measurements • The HPCC HPL benchmark with 3 performance metrics: • FLOPS; Temperature; Network Sends/Receives • Node 7: IBM Petascale Workshop 2006
Multiple Measurements • The HPCC HPL benchmark with 3 performance metrics: • FLOPS; Temperature; Network Sends/Receives • Node 3: IBM Petascale Workshop 2006
Data Structure Addressing • Goal: • Measure events related to specific data addresses (structures). • Availability: • Itanium: 160 / 475 native events • rumored on POWER4; POWER5? • PAPI example: • ...opt.addr.eventset = EventSet; opt.addr.start = (caddr_t)array; opt.addr.end = (caddr_t)(array + size_array); retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &opt);actual.start = (caddr_t)array - opt.addr.start_off; actual.end = (caddr_t)(array + size_array) + opt.addr.end_off; ... IBM Petascale Workshop 2006
Rensselaer to Build and House$100 Million Supercomputer NY Times, May 11, 2006 Rensselaer Polytechnic Institute announced yesterday that it was combining forces with New York State and I.B.M. to build a $100 million supercomputer that will be among the 10 most powerful in the world. The computer, a type of I.B.M. system known as Blue Gene, will be on Rensselaer's campus in Troy, N.Y., and will have the power to perform more than 70 trillion calculations per second. It will mainly be used to help researchers make smaller, faster semiconductor devices and for nanotechnology research. IBM Petascale Workshop 2006
PAPI and BG/L 2 FPU PMCs 2 FPU PMCs UPC Module 48 Shared Counters • Performance Counters: • 48 UPC Counters • shared by both CPUs • External to CPU cores • 32 bits :( • 2 Counters on each FPU • 1 counts load/stores • 1 counts arithmetic operations • Accessed via blg_perfctr IBM Petascale Workshop 2006
PAPI and BG/L (2): Versions • PAPI 2.3.4 • Original release • Poor native event support • PAPI 3.2.2 beta • Currently being beta tested • Full access to native events by name • Limitations • Only events exposed by bgl_perfctr • No control over native event edges • Still no overflow/profile support • Is there a timer available? • No configure script (cross-compilation) • No scripted acceptance test suite(multiple queuing systems) IBM Petascale Workshop 2006
PAPI and BG/L (3): Presets Test case avail.c: Available events and hardware information. ------------------------------------------------------------------------- Vendor string and code : (1312) Model string and code : PVR=0x5202:0x1891 Serial=R00-M0-N1-C:J16-U01 (1375869073) CPU Revision : 20994.062500 CPU Megahertz : 700.000000 CPU's in this Node : 1 Nodes in this System : 32 Total CPU's : 32 Number Hardware Counters : 52 Max Multiplex Counters : 32 ------------------------------------------------------------------------- Name Derived Description (Mgr. Note) PAPI_L3_TCM No Level 3 cache misses () PAPI_L3_LDM Yes Level 3 load misses () PAPI_L3_STM No Level 3 store misses () PAPI_FMA_INS No FMA instructions completed () PAPI_TOT_CYC No Total cycles () PAPI_L2_DCH Yes Level 2 data cache hits () PAPI_L2_DCA Yes Level 2 data cache accesses () PAPI_L3_TCH No Level 3 total cache hits () PAPI_FML_INS No Floating point multiply instructions () PAPI_FAD_INS No Floating point add instructions () PAPI_BGL_OED No BGL special event: Oedipus operations () PAPI_BGL_TS_32B Yes BGL special event: Torus 32B chunks sent () PAPI_BGL_TS_FULL Yes BGL special event: Torus no token UPC cycles () PAPI_BGL_TR_DPKT Yes BGL special event: Tree 256 byte packets () PAPI_BGL_TR_FULL Yes BGL special event: UPC cycles (CLOCKx2) tree rcv is full () ------------------------------------------------------------------------- avail.c PASSED IBM Petascale Workshop 2006
PAPI and BG/L (4): Native Events • 328 native events available • Only events exposed by bgl_perfctr • 4 arithmetic events per FPU • 4 Load/Store events per FPU • 312 UPC events BGL_FPU_ARITH_ADD_SUBTRACT 0x40000000 |Add and subtract, fadd, fadds, fsub, fsubs (Book E add, substract)| BGL_FPU_ARITH_MULT_DIV 0x40000001 |Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)| BGL_FPU_ARITH_OEDIPUS_OP 0x40000002 |Oedipus operations, All symmetric, asymmetric, and complex Oedipus multiply-add instructions| ... BGL_UPC_TS_ZP_VCD0_CHUNKS 0x40000145 |ZP vcd0 chunks| BGL_UPC_TS_ZP_VCD1_CHUNKS 0x40000146 |ZP vcd1 chunks| BGL_PAPI_TIMEBASE 0x40000148 |special event for getting the timebase reg| ------------------------------------------------------------------------- Total events reported: 328 native_avail.c PASSED IBM Petascale Workshop 2006
XT3 and Catamount The Oak Ridger February 21, 2006 “The Cray XT3 Jaguar, the flagship computing system in ORNL's Leadership Computing Facility, was ranked tenth in the world in a November 2005 survey of supercomputers, delivering 20.5 trillion operations per second (teraflops).” IBM Petascale Workshop 2006
PAPI and Catamount • Opteron-based • Catamount OS similar to CNK • Driven by Sandia-Cray version of perfctr • No overflow / profiling • Configure works because compile node == compute node • Test Suite script works because there’s only one queuing system IBM Petascale Workshop 2006
…and of course, Cell • The PlayStation 3's CPU is based on a chip codenamed "Cell" • Each Cell contains 8 APUs. • An APU is a self contained vector processor acting independently from the others. • 4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each) • 256 Gflop/speak! 32 bit floating point; 64 bit floating point at 25 Gflop/s. • But what about the performance counters! IBM Petascale Workshop 2006
When? PAPI Release Schedule • PAPI 3.3.0: RealSoonNow™ • BG/L in beta testing • Merging and deprecating PAPI 3.0.8.1 • Regression testing on other platforms • PAPI 4.0: Q2, 2006 • Porting some substrates to Multi-substrate model • Developing additional non-cpu substrates • Wanna Help? Distributed Testing… IBM Petascale Workshop 2006
Distributed Testing • Dart / CTest • Mozilla Tinderbox • DejaGnu • Homegrown • Others? • Problem: • How do you develop/test/verify on multiple systems with multiple OS’s at multiple sites? • Automatically; Transparently; Repetitively IBM Petascale Workshop 2006
A Word from our Sponsor…Innovative Computing Laboratory Jack’s Research Group in the CS Department Size- About 45-50 people 16 students; 19 scientific staff; 10 support staff; 1 visitors Funding NSF Supercomputer Centers (UCSD & NCSA) Next Generation Software (NGS) Info Tech Res. (ITR) Middleware Init. (NMI) DOE Scientific Discovery through Advanced Computing (SciDAC) Math in Comp Sci (MICS) DARPA High Productivity Computing Systems DOD Modernization Work with companies AMD, Cray, Dolphin, Microsoft, MathWorks, Intel, Sun, Myricom, SGI, HP, IBM, Northrop-Grumman PhD Dissertation, MS Project Equipment A number of clusters Desktop machines Office setup Summer internships Industry, ORNL, … Travel to meetings Participate in publications IBM Petascale Workshop 2006
ICL Class of 2005 IBM Petascale Workshop 2006
Speculative Performance Positions • PostDoc Positions Probably Available • PAPI • New Platforms (Cell?) • New Substrates (Infiniband?) • KOJAK • Automated Performance Analysis • ECLIPSE PTP & TAU Integration • See me for brochures or more info IBM Petascale Workshop 2006
PAPI Directions Dan Terpstra Innovative Computing Lab University of Tennessee