1 / 25

Application Performance Analysis and Modeling

Application Performance Analysis and Modeling. An Overview of Recent SNL Activities August 20, 2007 Sudip Dosanjh Computer and Software Systems Sandia National Laboratories sudip@sandia.gov.

rose-cherry
Download Presentation

Application Performance Analysis and Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application Performance Analysis and Modeling An Overview of Recent SNL Activities August 20, 2007 Sudip Dosanjh Computer and Software Systems Sandia National Laboratories sudip@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. Goals • Improve performance and scalability • Better utilization of an existing system • Determine which platform a given simulation should run on • Better utilization of a set of machines • Guide purchase of capacity and capability systems • Help design future supercomputers • Cost/benefit analysis • Focus R&D efforts

  3. Science and Engineering Apps • Continuum • Computational fluid dynamics • Shock physics (CTH) • Arbitrary Lagrangian Eulerian (Alegra) • Structural mechanics • Combustion • Device simulations • E&M • Radiation • Enclosure radiation • DAE • Circuit Modeling • Particles • Molecular dynamics (LAMMPS) • Particle-in-cell

  4. Informatics is an Emerging App Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47(3, March 2004): pp 45-47

  5. Before Upgrade 10,880 2.0 GHz single-core AMD Opteron CPUs 43.52 TF/s peak SeaStar 1.2 2-4 GB per socket #9 on June 2006 Top 500 list Catamount LWK After Upgrade 13,600 2.4 GHz dual-core AMD Opteron CPUs 130.56 TF/s peak SeaStar 2.1 network Doubled NIC bandwidth 2-4 GB per socket #3 on current Top 500 list Catamount LWK with virtual node mode support Red Storm

  6. Upgrade PerformanceSingle-Core vs Dual-Core • CTH - Shape Charge • Constant work/core • Speedup of at least 1.4 out to 2048 sockets • Sage - timing_c • Constant work/core • At scale, speedup is at least a factor of 1.6

  7. Catamount Virtual Node LWK Performs Well on 7X Applications

  8. Cluster more cost effective MPP more cost effective Efficiency ratio = Cost ratio = 1.8 Peak Linpack Scientific & Eng. Codes Sandia Study: 2004 • Where should codes run? Capability or Capacity? • Looked at the relative scaling efficiency between Red Storm and Cplant and then correlated efficiency to cost. • Answer:Capacity < 256 nodes Capability > 256 nodes Presto CTH DSMC

  9. Red Storm & Thunderbird Performance Comparison (I.e. Capability vs Capacity) Turnover job size = 256

  10. CTH Model with Load Imbalance T = E(κ,φ)N3 + C(λ + τkN2) + S(γlog(P)) +Limbal • T is the execution time per time step • N is size of an edge of a processor’s subdomain • C and S are number of exchanges and collectives • P is the number of processors • k is the number of variables in an exchange • λ and τ are latency and transfer cost • γ is the cost of one stage of collective • E(κ,φ) is the calculation time per cell • Limbal represents the effects of load imbalance

  11. Monte-Carlo Particle Transport Code, ITS • Model built and verified on Red Storm, ASCI Red, Cplant, Vplant • Red Storm performance model used to predict 10K Proc Performance • Algorithm modification (new code) results in much improved scaling • Execution Time = Compute_time(num_hist) + Commu_time(msg_sizes, latency, BW, p) + I/O time • Master-Worker; Many to one; tally data sent to master • Tcomm.= {2 * T48 + T48000 + T432 + T16M + T368 } * num_batches * (p-1) • Input: Latency, Bandwidth(Pt-to-Pt), num_procs(p), num_batches • Model with Rabenseifner’s Algorithm: Tcomm. = 2*ln(p)*(latency) + 2*(p-1)*n*(Msg_transfer_time) + (p-1)*n*(local_reduction_time)/p • p = num processors • n = message size, Bytes

  12. HW Architectural Results SW Structural Simulation Toolkit (SST) • Motivation • Currently developing a simulation environment to ... • Provide validated baseline for future exploration • Answer “What If” questions to guide future design efforts • Understand complex system-level interactions • Goals • Focus on parallel systems: HW & SW • Quick turnaround • Flexibility • Multiple front-ends • Execution driven • Trace driven • Multiple back-ends • Explore novel architectures (e.g. Multi-core, NIC, Memory) • Support conventional architectures (e.g. Single core, DDR) • Reusable, Extensible, & Parallelizable • Customers • Micro-architects • System Architects • Application Performance Analysis

  13. SST: Structure • Front-Ends & Back-Ends Joined by Processor/Thread Interface • Enkidu “glues” back-end components

  14. SST: Capabilities and Components

  15. Applying SST: Red Storm SeaStar NIC • Architectural Features • Embedded 500 Mhz PPC440, local SRAM, DMA Engines, NIC Bus • High speed network interface to 3D mesh router • 800Mhz HyperTransport interface to CPU • Host/NIC communicate through memory • NIC Modeling • PPC 440: Used SimpleScalar • Local SRAM: Existing SST component • Tx/Rx DMA engines • Existing component • Respond to same commands at RS DMA • Flow controlled • HT Interface • Connects CPU/NIC • NIC Bus • Connects internal NIC components (PPC, SRAM, etc...) • HyperTransport Modeling • HyperTransport connection modeled at two components • HTLink models latency • HTLink_bw models link bandwidth • Models contention • Tracks backlog of requests w/ simple BW counting scheme • Implements flow-control with finite request queue • Queue depth set to cover round-trip times and allow full bandwidth

  16. Validating SST: Latency & Bandwidth • Used MPI “ping-pong” and OSU streaming BW • Compared with real Seastar 1.2 and 2.1 chips • Latency, message rate, and bandwidth • within 5% for range of sizes

  17. Validating SST: Primitives • Sources of Error • Small message optimization in Red Storm (<16 bytes) • Lack of cache-line invalidation instruction • Processor model?

  18. HyperTransport • Vary NIC Clock Rate Effect On Latency: • 2X NIC Clock -> 30% improvement in latency • 4X NIC Clock -> 50% improvement • Vary HT Bus Effect On Bandwidth: • No effect to small messages • Does effect peak bandwidth Processor Speed NIC Bus • Vary NIC Bus Effect On Latency: • 1/2 Bus Latency -> 8% latency reduction • Vary HT Bus Effect On Latency: • MPI latency increases linearly with HT latency • 4 HT transactions per MPI message Varying NIC Clk Freq Design Space Exploration Varying HT BW Varying NIC Bus Latency Varying HT Latency

  19. Better Finding the Bottleneck: Computation, Branches, or Memory • In the node, Memory performance is key bottleneck • Even perfect branch prediction and infinite FUs would be less valuable than improving memory latency. • Prefetching, caches don’t help emerging applications

  20. Latency/Bandwidth Sensitivity Latency & Bandwidth are both constraining performance Informatics Applications Scientific Applications Emerging applications more sensitive to Latency and Bandwidth

  21. Memory Operations Dominate • FP ops (“Real work”) < 10% of Sandia codes • Several Integer calculations, loads for each FP load • Memory and Integer Ops dominate • ...and most integer ops are computing memory addresses • Theme: processing is now cheap, data movement is expensive Integer Instruction Usage Instruction Mix

  22. A Hybrid MPI Simulator • Execution driven network simulator • Uses Lamport’s virtual time concept to run application unperturbed • Uses MPI profiling interface • Easy to “hook” into existing applications • A relink is the only application porting demand • No need to instrument or analyze code • MPI Wrapper Library • Keeps track of virtual time • Send and receive MPI data as before • Manages events to coordinate network simulator with the application

  23. MPI Simulator: Virtual Time • Lamport’s idea of synchronizing virtual time between nodes using messages • Events carry MPI envelope information and virtual send time • Algorithm adjusts local virtual time, • if send (Tx) plus network delay is within time of receiver (T1) • Tv = Tx + delay • Otherwise assume message was already here • Tv = T1

  24. Hybrid MPI Simulator Node Architecture Collective Example Model Validation w/NAS Benchmarks Model Tuning

  25. Future challenges • Data locality on chip • Impact of programming models • Accelerators

More Related