160 likes | 259 Views
HPC Software Development at LLNL. Presented to College of St. Rose, Albany. Feb. 11, 2013. Todd Gamblin Cente r for Applied Scientific Computing. LLNL has some of the world’s largest supercomputers. Sequoia #1 in the world, June 2012 IBM Blue Gene/Q 96 racks, 98,304 nodes
E N D
HPC Software Development at LLNL Presented to College of St. Rose, Albany • Feb. 11, 2013 • Todd Gamblin • Center for Applied Scientific Computing
LLNL has some of the world’s largest supercomputers • Sequoia • #1 in the world, June 2012 • IBM Blue Gene/Q • 96 racks, 98,304 nodes • 1.5 million cores • 5-D Torus network • Transactional Memory • Runs lightweight, Linux-like OS • Login nodes are Power7, but compute nodes are PowerPC A2 cores. Requires cross-compiling.
LLNL has some of the world’s largest supercomputers • Zin • Intel Sandy Bridge • 2,916 16-core nodes • 45,656 processors • Infiniband Fat Treeinterconnect • Commodity parts • Runs TOSS, LLNL’s Red Hat Linux Distro
LLNL has some of the world’s largest supercomputers • Others • Almost 30 clusters total • See http://computing.llnl.gov
Supercomputers run very large-scale simulations NIF Target • Multi-physics simulations • Material Strength • Laser-Plasma Interaction • Quantum Chromodynamics • Fluid Dynamics • Lots of complicated numericalmethods for solving equations: • Adaptive Mesh Refinement (AMR) • Adaptive Multigrid • Unstructured Mesh • Structured Mesh Supernova AMR Fluid Interface
Structure of the Lab • Code teams • Work on physics applications • Larger code teams are 20+ people • Software developers • Applied mathematicians • Physicists • Work to meet milestones for lab missions
Structure of the Lab • Livermore Computing (LC) • Run supercomputing center • Development Environment Group • Works with application teams toimprove code performance • Knows about compilers, debuggers,performance tools • Develops performance tools • Software Development Group • Develops
Structure of the Lab • Center For Applied ScientificComputing (CASC) • Most CS Researchers are in CASC • Large groups doing: • Performance Analysis Tools • Power optimization • Resilience • Source-to-source Compilers • FPGAs and new architectures • Applied Math and numerical analysis
Performance Tools Research • Write software to measure theperformance of other software • Profiling • Tracing • Debugging • Visualization • Tools themselves need toperform well: • Parallel Algorithms • Scalability and low overhead are important
Development Environment • Application codes are written in many languages • Fortran, C, C++, Python • Some applications have been around for 50+ years • Tools are typically written in C/C++ • Tools typically run as part of an application • Need to be able to link with application environment • Non-parallel parts of tools are often in Python. • GUI • front-end scripts • some data analysis
We’ve started using Atlassiantools for collaboration • http://www.atlassian.com • Confluence Wiki • JIRA Bug Tracker • Stash git repo hosting • Several advantages for our distributed environment: • Scale to lots of users • Fine-grained permissions allow us to stay within our security model
Simple example: Measuring MPI Parallel Application • Parallel Applications use the MPI Library for communication • We want to measure time spent in MPI calls • Also interested in other metrics • Semantics, parameters, etc. • We write a lot of interposer libraries . . . Single Process
Simple example: Measuring MPI • Parallel Applications use the MPI Library for communication • We want to measure time spent in MPI calls • Also interested in other metrics • Semantics, parameters, etc. • We write a lot of interposer libraries
Example Interposer Code intMPI_Bcast(void *buffer, int count, MPI_Datatypedtype, int root, MPI_Commcomm) { double start = get_time_ns(); PMPI_Bcast(buffer, count, dtype, root, comm); double duration = get_time_ns() – start; record_time(MPI_Bcast, duration); } • This call intercepts calls from the application • It does its own measurement • Then it calls the MPI library • Allows us to measure time spent in particular routines
Another type of problem: communication optimization • See other slide set.