110 likes | 257 Views
Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable Applications: When MPI-only is not enough June 4, 2008 Kevin Pedretti Scalable System Software Dept. Sandia National Laboratories ktpedre@sandia.gov.
E N D
Panel Discussion Presentation Sandia CSRI Workshop onNext-generation Scalable Applications:When MPI-only is not enough June 4, 2008 Kevin Pedretti Scalable System Software Dept. Sandia National Laboratories ktpedre@sandia.gov System Architecture:Near, Medium, and Long-termScalable Architectures Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Near Term • Odds are good, but goods are odd... • Multi-core, many-core, mega-core • Heterogeneous ISAs, cores, systems • Accelerators: GPU, Cell, Clearspeed, FPGA, etc. • Embedded: Tilera, SPI, Ambric (336-core), Tensilica • Scalable Architectures • Peak FLOPS not bottleneck • Improving per-socket efficiency on real applications is “low-hanging fruit” • Decreasing memory size & bandwidth per core • Symbiosis of architecture and system software
Near Term (Cont.) • Adapting MPI implementations for architecture • Shared memory copies vs. NIC • Cache pollution, injection • Leverage hierarchy / intra-node locality • Adapting MPI applications for architecture • MPI + shared memory: LIBSM • MPI + something else for intra-node • OpenMP, Thread Building Blocks, ALF Streaming, CUDA, Rapid Mind, Peakstream/Google, etc. • All incompatible, some similar concepts • Adapting architecture for MPI? • Leveraging interconnect capabilities for PGAS
OS Scalability At 8192 nodes, CNL (2.0.44) is 49% worse than Catamount onthis Partisn problem. Doesn’t appear to be a bandwidth issue.
Task and Memory Placement • No standard mechanisms, most punt and hope for best • Explicit vs. implicit mechanisms • More important than node placement?
Virtual Memory Nice, but Gets in Way Dashed Line = Small pages Solid Line = Large pages (Dual-core Opteron) Open Shapes = Existing Logarithmic Algorithm (Gibson/Bruck) Solid Shapes = New Constant-Time Algorithm (Slepoy, Thompson, Plimpton) UnexpectedBehavior Due to TLB TLB misses increased with large pages,but time to service miss decreased dramatically (10x).Page table fits in L1! (vs. 2MB per GB with small pages)
So, Answer is Large Pages? • DRAM bank conflicts can be considerable depending on data alignment • OS-level and hardware mitigation strategies
Medium Term • More accelerators, normalization • Attractive power and memory efficiency • Commodity processors will integrate GPUs on-chip • HPC-centric off-chip accelerators • General-purpose cores not getting much faster • Leverage architecture for specific app domains • Some common mechanism will/must emerge for dealing with data-parallel accelerators • General-purpose cores become more light-weight, better match for light-weight system software • Chip stacking • Off-chip optics
Long Term • MPP-on-a-chip • On and off-chip optics • More intelligent memory systems • Application driven architectures