250 likes | 385 Views
Application Performance Analysis and Modeling. An Overview of Recent SNL Activities August 20, 2007 Sudip Dosanjh Computer and Software Systems Sandia National Laboratories sudip@sandia.gov.
E N D
Application Performance Analysis and Modeling An Overview of Recent SNL Activities August 20, 2007 Sudip Dosanjh Computer and Software Systems Sandia National Laboratories sudip@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Goals • Improve performance and scalability • Better utilization of an existing system • Determine which platform a given simulation should run on • Better utilization of a set of machines • Guide purchase of capacity and capability systems • Help design future supercomputers • Cost/benefit analysis • Focus R&D efforts
Science and Engineering Apps • Continuum • Computational fluid dynamics • Shock physics (CTH) • Arbitrary Lagrangian Eulerian (Alegra) • Structural mechanics • Combustion • Device simulations • E&M • Radiation • Enclosure radiation • DAE • Circuit Modeling • Particles • Molecular dynamics (LAMMPS) • Particle-in-cell
Informatics is an Emerging App Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47(3, March 2004): pp 45-47
Before Upgrade 10,880 2.0 GHz single-core AMD Opteron CPUs 43.52 TF/s peak SeaStar 1.2 2-4 GB per socket #9 on June 2006 Top 500 list Catamount LWK After Upgrade 13,600 2.4 GHz dual-core AMD Opteron CPUs 130.56 TF/s peak SeaStar 2.1 network Doubled NIC bandwidth 2-4 GB per socket #3 on current Top 500 list Catamount LWK with virtual node mode support Red Storm
Upgrade PerformanceSingle-Core vs Dual-Core • CTH - Shape Charge • Constant work/core • Speedup of at least 1.4 out to 2048 sockets • Sage - timing_c • Constant work/core • At scale, speedup is at least a factor of 1.6
Cluster more cost effective MPP more cost effective Efficiency ratio = Cost ratio = 1.8 Peak Linpack Scientific & Eng. Codes Sandia Study: 2004 • Where should codes run? Capability or Capacity? • Looked at the relative scaling efficiency between Red Storm and Cplant and then correlated efficiency to cost. • Answer:Capacity < 256 nodes Capability > 256 nodes Presto CTH DSMC
Red Storm & Thunderbird Performance Comparison (I.e. Capability vs Capacity) Turnover job size = 256
CTH Model with Load Imbalance T = E(κ,φ)N3 + C(λ + τkN2) + S(γlog(P)) +Limbal • T is the execution time per time step • N is size of an edge of a processor’s subdomain • C and S are number of exchanges and collectives • P is the number of processors • k is the number of variables in an exchange • λ and τ are latency and transfer cost • γ is the cost of one stage of collective • E(κ,φ) is the calculation time per cell • Limbal represents the effects of load imbalance
Monte-Carlo Particle Transport Code, ITS • Model built and verified on Red Storm, ASCI Red, Cplant, Vplant • Red Storm performance model used to predict 10K Proc Performance • Algorithm modification (new code) results in much improved scaling • Execution Time = Compute_time(num_hist) + Commu_time(msg_sizes, latency, BW, p) + I/O time • Master-Worker; Many to one; tally data sent to master • Tcomm.= {2 * T48 + T48000 + T432 + T16M + T368 } * num_batches * (p-1) • Input: Latency, Bandwidth(Pt-to-Pt), num_procs(p), num_batches • Model with Rabenseifner’s Algorithm: Tcomm. = 2*ln(p)*(latency) + 2*(p-1)*n*(Msg_transfer_time) + (p-1)*n*(local_reduction_time)/p • p = num processors • n = message size, Bytes
HW Architectural Results SW Structural Simulation Toolkit (SST) • Motivation • Currently developing a simulation environment to ... • Provide validated baseline for future exploration • Answer “What If” questions to guide future design efforts • Understand complex system-level interactions • Goals • Focus on parallel systems: HW & SW • Quick turnaround • Flexibility • Multiple front-ends • Execution driven • Trace driven • Multiple back-ends • Explore novel architectures (e.g. Multi-core, NIC, Memory) • Support conventional architectures (e.g. Single core, DDR) • Reusable, Extensible, & Parallelizable • Customers • Micro-architects • System Architects • Application Performance Analysis
SST: Structure • Front-Ends & Back-Ends Joined by Processor/Thread Interface • Enkidu “glues” back-end components
Applying SST: Red Storm SeaStar NIC • Architectural Features • Embedded 500 Mhz PPC440, local SRAM, DMA Engines, NIC Bus • High speed network interface to 3D mesh router • 800Mhz HyperTransport interface to CPU • Host/NIC communicate through memory • NIC Modeling • PPC 440: Used SimpleScalar • Local SRAM: Existing SST component • Tx/Rx DMA engines • Existing component • Respond to same commands at RS DMA • Flow controlled • HT Interface • Connects CPU/NIC • NIC Bus • Connects internal NIC components (PPC, SRAM, etc...) • HyperTransport Modeling • HyperTransport connection modeled at two components • HTLink models latency • HTLink_bw models link bandwidth • Models contention • Tracks backlog of requests w/ simple BW counting scheme • Implements flow-control with finite request queue • Queue depth set to cover round-trip times and allow full bandwidth
Validating SST: Latency & Bandwidth • Used MPI “ping-pong” and OSU streaming BW • Compared with real Seastar 1.2 and 2.1 chips • Latency, message rate, and bandwidth • within 5% for range of sizes
Validating SST: Primitives • Sources of Error • Small message optimization in Red Storm (<16 bytes) • Lack of cache-line invalidation instruction • Processor model?
HyperTransport • Vary NIC Clock Rate Effect On Latency: • 2X NIC Clock -> 30% improvement in latency • 4X NIC Clock -> 50% improvement • Vary HT Bus Effect On Bandwidth: • No effect to small messages • Does effect peak bandwidth Processor Speed NIC Bus • Vary NIC Bus Effect On Latency: • 1/2 Bus Latency -> 8% latency reduction • Vary HT Bus Effect On Latency: • MPI latency increases linearly with HT latency • 4 HT transactions per MPI message Varying NIC Clk Freq Design Space Exploration Varying HT BW Varying NIC Bus Latency Varying HT Latency
Better Finding the Bottleneck: Computation, Branches, or Memory • In the node, Memory performance is key bottleneck • Even perfect branch prediction and infinite FUs would be less valuable than improving memory latency. • Prefetching, caches don’t help emerging applications
Latency/Bandwidth Sensitivity Latency & Bandwidth are both constraining performance Informatics Applications Scientific Applications Emerging applications more sensitive to Latency and Bandwidth
Memory Operations Dominate • FP ops (“Real work”) < 10% of Sandia codes • Several Integer calculations, loads for each FP load • Memory and Integer Ops dominate • ...and most integer ops are computing memory addresses • Theme: processing is now cheap, data movement is expensive Integer Instruction Usage Instruction Mix
A Hybrid MPI Simulator • Execution driven network simulator • Uses Lamport’s virtual time concept to run application unperturbed • Uses MPI profiling interface • Easy to “hook” into existing applications • A relink is the only application porting demand • No need to instrument or analyze code • MPI Wrapper Library • Keeps track of virtual time • Send and receive MPI data as before • Manages events to coordinate network simulator with the application
MPI Simulator: Virtual Time • Lamport’s idea of synchronizing virtual time between nodes using messages • Events carry MPI envelope information and virtual send time • Algorithm adjusts local virtual time, • if send (Tx) plus network delay is within time of receiver (T1) • Tv = Tx + delay • Otherwise assume message was already here • Tv = T1
Hybrid MPI Simulator Node Architecture Collective Example Model Validation w/NAS Benchmarks Model Tuning
Future challenges • Data locality on chip • Impact of programming models • Accelerators