Application Performance Analysis and Modeling

Application Performance Analysis and Modeling An Overview of Recent SNL Activities August 20, 2007 Sudip Dosanjh Computer and Software Systems Sandia National Laboratories sudip@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Goals • Improve performance and scalability • Better utilization of an existing system • Determine which platform a given simulation should run on • Better utilization of a set of machines • Guide purchase of capacity and capability systems • Help design future supercomputers • Cost/benefit analysis • Focus R&D efforts

Science and Engineering Apps • Continuum • Computational fluid dynamics • Shock physics (CTH) • Arbitrary Lagrangian Eulerian (Alegra) • Structural mechanics • Combustion • Device simulations • E&M • Radiation • Enclosure radiation • DAE • Circuit Modeling • Particles • Molecular dynamics (LAMMPS) • Particle-in-cell

Informatics is an Emerging App Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47(3, March 2004): pp 45-47

Before Upgrade 10,880 2.0 GHz single-core AMD Opteron CPUs 43.52 TF/s peak SeaStar 1.2 2-4 GB per socket #9 on June 2006 Top 500 list Catamount LWK After Upgrade 13,600 2.4 GHz dual-core AMD Opteron CPUs 130.56 TF/s peak SeaStar 2.1 network Doubled NIC bandwidth 2-4 GB per socket #3 on current Top 500 list Catamount LWK with virtual node mode support Red Storm

Upgrade PerformanceSingle-Core vs Dual-Core • CTH - Shape Charge • Constant work/core • Speedup of at least 1.4 out to 2048 sockets • Sage - timing_c • Constant work/core • At scale, speedup is at least a factor of 1.6

Catamount Virtual Node LWK Performs Well on 7X Applications

Cluster more cost effective MPP more cost effective Efficiency ratio = Cost ratio = 1.8 Peak Linpack Scientific & Eng. Codes Sandia Study: 2004 • Where should codes run? Capability or Capacity? • Looked at the relative scaling efficiency between Red Storm and Cplant and then correlated efficiency to cost. • Answer:Capacity < 256 nodes Capability > 256 nodes Presto CTH DSMC

Red Storm & Thunderbird Performance Comparison (I.e. Capability vs Capacity) Turnover job size = 256

CTH Model with Load Imbalance T = E(κ,φ)N3 + C(λ + τkN2) + S(γlog(P)) +Limbal • T is the execution time per time step • N is size of an edge of a processor’s subdomain • C and S are number of exchanges and collectives • P is the number of processors • k is the number of variables in an exchange • λ and τ are latency and transfer cost • γ is the cost of one stage of collective • E(κ,φ) is the calculation time per cell • Limbal represents the effects of load imbalance

Monte-Carlo Particle Transport Code, ITS • Model built and verified on Red Storm, ASCI Red, Cplant, Vplant • Red Storm performance model used to predict 10K Proc Performance • Algorithm modification (new code) results in much improved scaling • Execution Time = Compute_time(num_hist) + Commu_time(msg_sizes, latency, BW, p) + I/O time • Master-Worker; Many to one; tally data sent to master • Tcomm.= {2 * T48 + T48000 + T432 + T16M + T368 } * num_batches * (p-1) • Input: Latency, Bandwidth(Pt-to-Pt), num_procs(p), num_batches • Model with Rabenseifner’s Algorithm: Tcomm. = 2*ln(p)*(latency) + 2*(p-1)*n*(Msg_transfer_time) + (p-1)*n*(local_reduction_time)/p • p = num processors • n = message size, Bytes

HW Architectural Results SW Structural Simulation Toolkit (SST) • Motivation • Currently developing a simulation environment to ... • Provide validated baseline for future exploration • Answer “What If” questions to guide future design efforts • Understand complex system-level interactions • Goals • Focus on parallel systems: HW & SW • Quick turnaround • Flexibility • Multiple front-ends • Execution driven • Trace driven • Multiple back-ends • Explore novel architectures (e.g. Multi-core, NIC, Memory) • Support conventional architectures (e.g. Single core, DDR) • Reusable, Extensible, & Parallelizable • Customers • Micro-architects • System Architects • Application Performance Analysis

SST: Structure • Front-Ends & Back-Ends Joined by Processor/Thread Interface • Enkidu “glues” back-end components

SST: Capabilities and Components

Applying SST: Red Storm SeaStar NIC • Architectural Features • Embedded 500 Mhz PPC440, local SRAM, DMA Engines, NIC Bus • High speed network interface to 3D mesh router • 800Mhz HyperTransport interface to CPU • Host/NIC communicate through memory • NIC Modeling • PPC 440: Used SimpleScalar • Local SRAM: Existing SST component • Tx/Rx DMA engines • Existing component • Respond to same commands at RS DMA • Flow controlled • HT Interface • Connects CPU/NIC • NIC Bus • Connects internal NIC components (PPC, SRAM, etc...) • HyperTransport Modeling • HyperTransport connection modeled at two components • HTLink models latency • HTLink_bw models link bandwidth • Models contention • Tracks backlog of requests w/ simple BW counting scheme • Implements flow-control with finite request queue • Queue depth set to cover round-trip times and allow full bandwidth

Validating SST: Latency & Bandwidth • Used MPI “ping-pong” and OSU streaming BW • Compared with real Seastar 1.2 and 2.1 chips • Latency, message rate, and bandwidth • within 5% for range of sizes

Validating SST: Primitives • Sources of Error • Small message optimization in Red Storm (<16 bytes) • Lack of cache-line invalidation instruction • Processor model?

HyperTransport • Vary NIC Clock Rate Effect On Latency: • 2X NIC Clock -> 30% improvement in latency • 4X NIC Clock -> 50% improvement • Vary HT Bus Effect On Bandwidth: • No effect to small messages • Does effect peak bandwidth Processor Speed NIC Bus • Vary NIC Bus Effect On Latency: • 1/2 Bus Latency -> 8% latency reduction • Vary HT Bus Effect On Latency: • MPI latency increases linearly with HT latency • 4 HT transactions per MPI message Varying NIC Clk Freq Design Space Exploration Varying HT BW Varying NIC Bus Latency Varying HT Latency

Better Finding the Bottleneck: Computation, Branches, or Memory • In the node, Memory performance is key bottleneck • Even perfect branch prediction and infinite FUs would be less valuable than improving memory latency. • Prefetching, caches don’t help emerging applications

Latency/Bandwidth Sensitivity Latency & Bandwidth are both constraining performance Informatics Applications Scientific Applications Emerging applications more sensitive to Latency and Bandwidth

Memory Operations Dominate • FP ops (“Real work”) < 10% of Sandia codes • Several Integer calculations, loads for each FP load • Memory and Integer Ops dominate • ...and most integer ops are computing memory addresses • Theme: processing is now cheap, data movement is expensive Integer Instruction Usage Instruction Mix

A Hybrid MPI Simulator • Execution driven network simulator • Uses Lamport’s virtual time concept to run application unperturbed • Uses MPI profiling interface • Easy to “hook” into existing applications • A relink is the only application porting demand • No need to instrument or analyze code • MPI Wrapper Library • Keeps track of virtual time • Send and receive MPI data as before • Manages events to coordinate network simulator with the application

MPI Simulator: Virtual Time • Lamport’s idea of synchronizing virtual time between nodes using messages • Events carry MPI envelope information and virtual send time • Algorithm adjusts local virtual time, • if send (Tx) plus network delay is within time of receiver (T1) • Tv = Tx + delay • Otherwise assume message was already here • Tv = T1

Hybrid MPI Simulator Node Architecture Collective Example Model Validation w/NAS Benchmarks Model Tuning

Future challenges • Data locality on chip • Impact of programming models • Accelerators

Application Performance Analysis and Modeling

Application Performance Analysis and Modeling

Presentation Transcript

Performance Modeling

Using Performance Monitoring Hardware for Application Performance Analysis

Modeling and Analysis

Simulation Modeling and Analysis

Static Analysis and Modeling

Performance Modeling and Analysis with PEBIL

Advanced Windows Presentation Foundation Application Performance Tuning and Analysis

Analysis and Performance

Project F2: Application Performance Analysis

Finite Element Modeling and Analysis with a Biomechanical Application

MODELING AND ANALYSIS

Modeling Assertions: Symbolic Model Representation of Application Performance

Project F2: Application Performance Analysis

Application Performance

Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Modeling Application Process

Performance Modeling

Causal Models, Learning Algorithms and their Application to Performance Modeling

Simulation Modeling and Analysis

Performance Modeling

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall

Analysis and Performance Results of a Molecular Modeling Application on Merrimac