PAPI 3.0.8.1 on Blue Gene L

PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Presentation overview • Project objectives • PAPI explanation • Blue Gene L explanation • Current state of research

Project objectives • Upgrade PAPI on BG/L • Provide interface for network counters • Allow Lawrence Livermore National Lab users to also have access to PAPI • Using network counters to place tasks optimally on BG/L

PAPI – Intro Courtesy of http://icl.cs.utk.edu/papi/

PAPI – Intro • PAPI useful to profile your own programs. • Many tools based on PAPI • PapiEx – Command line measurement tool • PerfSuite – Aggregate measurement and statistical profiling package and API • HPCToolkit – Statistical profiling package • Many more!

PAPI – Supported platforms • IBM – POWER3, 604, 604e, POWER4 • Cray T3E, Cray X1 • AMD – Athlon, Opteron • Intel – P1 to P4, Itanium I and II • UltraSparc I, II & III • MIPS R10K, R12K, R14K • Alpha

PAPI – Generic Interface • Call sequence for generic interface • PAPI_library_init – Initialize memory for PAPI’s data structures • PAPI_create_eventset – Create an empty list of events • PAPI_add_event – Add events to be counted • PAPI_start – Begin counting all events within the specified eventset • PAPI_stop – Stop all counters and read their current values

PAPI – Events: Presets • Presets – list of predefined events implemented on all systems where they can be supported • Not all presets available on every architecture (e.g. BG/L has no cache lower than L3 – thus L1 cache hit preset not applicable) • Native events form the basic building blocks for PAPI presets

PAPI – Events: Presets Courtesy of http://icl.cs.utk.edu/papi/

PAPI – Events: Native • In addition to the predefined PAPI preset events, the PAPI library also exposes a majority of the events native to each platform • Can be added to eventsets in the same manner as presets

PAPI – Events: Native

PAPI – Internals • Array of eventsets is the main portion

PAPI – Other features • Multiplexing – If there are not enough hardware counters • Thread safe – Profiling is thread safe • Overflow detection – Hardware counters have limited space

PAPI – PAPI2 vs PAPI3 • PAPI 3 significantly reduced overheads for starting, stopping and reading the counters Courtesy of http://icl.cs.utk.edu/papi/

PAPI – PAPI2 vs PAPI3 • Better native event support in PAPI3 • Better thread support in PAPI3 • Overflow and Profiling enhancements in PAPI3 • Myriad bug fixes and code cleanup in PAPI3

PAPI – PAPI2 vs PAPI3 • Overlapping eventsets supported in PAPI2 • Minor changes in the API – mostly dereferencing variables

Blue Gene L – Intro • 65,536 nodes connected in 64 x 32 x 32 3D torus • Nodes made up of PowerPC 440 embedded processors • Smaller than most super computers • Consumes less power

Blue Gene L

Blue Gene L - Networks • 3D torus network (node to node) • Tree network (broadcasts)

Blue Gene L – HW counters • 48 universal performance counters • 4 floating point unit counters • Counters 32 bit – must use virtual counters to prevent overflow

Blue Gene L – HW counters

Research – Overall goals • Network hardware counters new • Use network counters to determine traffic between tasks • Try to optimize placement of tasks to minimize communication latency • Given counts and distances: cost = counts * distance. Minimize over all nodes

Research – Counting • First goal to determine what is being counted

Research – Networks • For each MPI call – determine which network counters are being used • Tree is supposed to be for broadcasts • Torus is supposed to be for point to point communication • Ambiguities in the specification

Research – Future decisions • How to profile a target application • Manually insert PAPI instrumentation: a lot of work • Instrument binaries with counting code • What information to store • All counts on each node: a lot of data • Sample of all nodes: not as accurate (what if the tasks behave / communicate differently?

Research – Future decisions • How to use collected information • Profile an application to obtain counter feedback to determine optimized static task layout • Dynamically migrate tasks in response to counters

PAPI 3.0.8.1 on Blue Gene L

PAPI 3.0.8.1 on Blue Gene L

Presentation Transcript

Performance Analysis on Blue Gene/P

PAPI for Blue Gene/Q: The 5 BGPM Components

Running on the SDSC Blue Gene

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Blue Gene/L: Delivering Large Scale Parallelism

Blue Gene / C

BLUE GENE/L

SDSC Blue Gene: Overview

Early Experiences with KTAU on the Blue Gene / L

Blue Gene Bring Up

Early Experiences with KTAU on the IBM Blue Gene / L

Analysis of Cluster Failures on Blue Gene Supercomputers

Blue Gene Simulator

Interconnection and Packaging in IBM Blue Gene/L

The IBM Blue Gene/L System Architecture

Application Performance Analysis on Blue Gene/L

The Blue Gene Experience

Blue Gene/P Navigator

Blue Gene / C