1 / 31

User Environment Enhancements in the DoD HPC Modernization Program

User Environment Enhancements in the DoD HPC Modernization Program. 7 April 2011 Steve Scherr, DoD HPCMP. Topics. Background: HPCMP Storage Initiative Enhanced User Environment HPC EUE Infrastructure HPC Portal. MB Revised: 5/4/2009. HPC Modernization Program. Vision

brian
Download Presentation

User Environment Enhancements in the DoD HPC Modernization Program

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. User Environment Enhancements in the DoD HPC Modernization Program 7 April 2011 Steve Scherr, DoD HPCMP

  2. Topics • Background: HPCMP Storage Initiative • Enhanced User Environment • HPC EUE Infrastructure • HPC Portal MB Revised: 5/4/2009

  3. HPC Modernization Program Vision A pervasive culture existing among DoD’s scientists and engineers where they routinely use advanced computational environments to solve the most demanding problems transforming the way DoD does business─finding better solutions faster. Mission Accelerate development and transition of advanced defense technologies into superior warfighting capabilities by exploiting and strengthening US leadership in supercomputing, communications and computational modeling. MB Revised: 12/11/2009

  4. HPCMP Serves a Large, Diverse DoD User Community FY11 statistics 501 active projects with 4,408 users at 250 sites 5,098 Habus* batch requirements FY10 statistics (as of 9/30/2010) 496 projects with 4,345 users 2,866 Habus* non-real-time requirements * Requirements and usage measured in Habus Customer Focus Computational Electromagnetics & Acoustics – 323 Users Computational Fluid Dynamics – 1,223 Users Electronics, Networking, and Systems/C4I – 211 Users Computational Structural Mechanics – 465 Users Environmental Quality Modeling & Simulation – 163 Users Forces Modeling & Simulation – 235 Users Computational Chemistry, Biology & Materials Science – 690 Users Climate/Weather/Ocean Modeling & Simulation – 315 Users Signal/Image Processing – 586 Users Integrated Modeling & Test Environments – 105 Users 92 users are self characterized as “Other” New CTA Space and Astrophysical Science (SAS) Source: Portal to the Information Environment – July 2010 MB Revised: 1/26/2011

  5. DoD Supercomputing Resource Centers (DSRCs)Six Large HPC Centers • DSRC systems support classified, unclassified and open computing capabilities • 17 large HPC systems • 1 systems ― 44,000+ cores • 6 systems ― 10,000 to 22,000+ cores • 10 systems ― 2,000 to 9,000+ cores • 1.873 peak PetaFlops • 4,750 Habus • Three new FY10 HPC systems • 773 TeraFlops • 2,251 Habus • 14 Petabytes single copy data storage • 28 Petabytes including Disaster Recovery • Connections to Customers • 212 locations MB Revised: 12/22/2010

  6. HPCMP Data Storage Growth 43% increase over FY 2008 34% increase over FY 2009 MB Revised: 12/22/2010

  7. User View of HPCMP Storage • Computational results used in many different ways • Source for additional computation • Interrogated for post-processing • Archived for scientific value • Users are mobile within HPCMP $WORKDIR short-term storage HPC File System HPC A DR Cache DR Tape Archive Server Center Archive Cache Tape HPC B HPC File System $WORKDIR short-term storage

  8. Storage Lifecycle Management (SLM) Rationale • HPCMP can provide enough storage for NEW data • Centers support 2+ generations of storage media • Older media unreadable after tech obsolescence • Users: we can live with constraints & manage data • Need tools to manage data • Need intermediate-length storage

  9. EnhancED User Environment

  10. Remote Job Management • Batch • Data Management Tools – Metadata • Computational Infrastructurefor Software Development(Tools / Environment) Evolving Enterprise Service Model • Interactive Grid Generation Customers Services Infrastructure MB Revised: 8/27/2010

  11. Remote Disaster Recovery Facility HPC Enhanced User EnvironmentArchitecture Single Point of Access Temporary Storage 10 days Storage Lifecycle Management Center-wide ILM-managed File System 30 days Center-wide Job Management HPC SYSTEM A Archive Server Data Analysis Services Utility Server SLM Metadata Catalog Service Software Development Environment Local Tape Archive HPC SYSTEM B DR&E Portal Metadata Replication Between all DSRCs Temporary Storage 10 days Grid-Generation Capabilities Services Compute Storage MB Revised: 12/22/2010

  12. HPC Enhanced User Environment • Interactive Computing • Single point of access • Center-wide job management • Remote data analysis • Center-wide filesystem • Medium-term storage • User-specified metadata • Data Management Tools • Insight into file archives • Program-wide visibility • HPC Portal • Supercharge the engineering desktop MB Revised: 8/3/2010

  13. HPC EUE Infrastructure

  14. Hardware Components • Center-wide File System: Panasas PAS 8 • 340 blades, 4 TB unformatted • Arista 7508 switch • Utility Server: Appro 1U Tetra, 88 nodes • 44 compute: 2 AMD Opteron 2.3 GHz CPUs, 16 cores, 128 GB memory • 22 large memory: 4 AMD Opteron 2.3 GHz CPUs, 32 cores, 256 GB memory • 22 graphics: 2 AMD Opteron 2.3 GHz CPUs, 16 cores, 256 GB memory, NVIDIA Tesla M2050

  15. System Configuration • $HOME • 10 GB quota • $WORKDIR • 200 TB • 100 TB user quota • Standard scrubbing • $CENTER • 800 TB • Possible user quota (200 TB) • 30-day scrub policy • SLM compatible • $ARCHIVE • Managed by SLM • Accessed through SLM • Center-wide Job Management • qsub, qstat, qdel • Resource Requests • PBS Pro

  16. Storage Lifecycle Management • Based on Nirvana SRB and SAM-QFS • Manages $ARCHIVE • Set metadata to specify retention period • Can register files on $CENTER -- target to automate registration by end 2011 • HPC access to $ARCHIVE through transfer queue • Also working PBS parameter mechanism – future just-in-time • Customer Experience workgroup developing auxiliary commands (Sdata) for user-defined metadata • Global visibility

  17. HPC Portal

  18. HPC Desktop Portal Initiative • Goals • Enable DoD scientists and engineers to apply the power of HPC without being HPC experts • Provide access to HPC resources using current web technology—attract and retain new technology experts to DoD • Methods • Provide HPC Software as a Service over web with zero or minimal footprint • Provide common analysis tools enabled for seamless HPC use (MATLAB) • Provide accessible optimized tools for technology domains (CREATE, institutes) • Extension of desktop; interactive response • Single sign-on through CAC

  19. HPC Portal • Engaging with DoD engineering organizations • Understand their requirements and how we can support • Examining Cloud Computing Concepts • Software as a Service • Infrastructure as a Service • Phase 1: Parallel MATLAB capability • ARL lead, deliver in June • Built on Microsoft HPC Server • Additional available applications, FMS, CFD, etc. • Phase 2: Present CREATE capability • Identifying API, middleware, design framework

  20. HPC Modernization Program MB Revised: 11/23/2009

  21. Backup

  22. Storage Lifecycle Management • Layered Software Capability • Information Lifecycle Management • Metadata – user and system defined • Policies – drive HSM • Reporting • Hierarchical Storage Management • Tiered Storage • Disaster Recovery • Multi-system, multi-center • Assign metadata attributes from all HPC systems • Work toward “shared” files between centers

  23. Information Lifecycle Management Provide capability to users and administrators Control costs Hierarchical Storage Management Based on ILM information Includes disaster recovery Common user interface Work toward shared files Storage Lifecycle Management

  24. ILM Requirements • Metadata attributes • User-assignable • System-assignable • Defaults • Tools and Reports • Enable management of data files • Policies • Based on attributes • Used to drive HSM

  25. ILM Attribute Requirements • Associated with all objects • Arbitrary number, size, type • Attribute permissions separate from underlying files • System read/write • Creator/Owner read/write • Collections of other users • Inheritance or default-setting at creation • Settable via templates or functions • ILM must scale to 1B files today • No impact on I/O performance for HSM • Attributes can be output textually

  26. ILM Tool Requirements • Tools for manipulating files under ILM control • Attribute-aware • Attribute-preserving • Operate on files, directories, or lists of objects • Create/modify attributes • Reports • Based on multiple criteria, attribute values • Status of pending operations • Consistent with attribute permissions

  27. HPCMP Storage Initiative • Computing power grows annually—so do stored files • Archived data is hard for users to use and manage • Costs: User time, labor, hardware, software and media • Storage Initiative • Objective: Refresh to manage data for next 10 years • Goals: 10-year architecture • Leverage advances in technology • Improve user productivity • Improve reliability & adaptability • Sustain within current storage budget MB Revised: 5/4/2009

  28. HPCMP Data Storage GrowthSingle Copy Data Storage • Impact of 16x growth in eight years • Data Analysis • Data Locality and Movement • Data Duplication • Disaster Recovery • Network Loading • Storage Technologies 22 x MB Revised: 12/22/2010

  29. HPC Enhanced User Environment (HEUE) • Purpose • Provide computational scientists more tools and capabilities to perform research more efficiently and effectively • Benefit • Decrease time-to-solution, increase S&E productivity and analytical power, reduce future costs of data archive • Tasks • Storage lifecycle management implementation • Metadata for file management and identification • Program-wide datafile visibility and access • Center-wide filesystem: efficient storage for data analysis and extraction • Center-wide job management: single point-of-access, increase user productivity • Remote visualization for large datasets • Web-based access to HPC capability MB Revised: 12/22/2010

  30. Requested Software System Software • PBS Pro, OpenMPI • InfiniBand Software Stack • NVIDIA Linux x86_64 driver set • Compliance with BCT policies Development Tools • PGI Compiler Suite (C/C++/Fortran) • GNU Compiler Suite & debugger • TotalView debugger • NVIDIA GPGPU development Environment (OpenCL and CUDA) • Common Set of Open Source Utilities • BC policy: PAPII, SCALASCA, TAU, PDT, Valgrind • DDT and DDT with CUDA debugger Data Analysis Tools • CEI – Ensight Suite • Intelligent Light – FieldView • RSI, Inc. – IDL • Mathworks – Matlab • NCAR Graphics Library • Kitware – ParaView • Tecplot, Inc. –Tecplot • VisIt Visualization Tool • Computational Science Environment (CSE) • ezVIZ

  31. Requested Software Pre/Post Processing Software • ANSYS CFD • Abaqus • LS-PrePost • Parasolid Designer (pre) • Pointwise – Gridgen Math Libraries • ARPACK, FFTW, PETSc, SuperLU, LAPACK, ScaLAPACK, BLAS, ATLAS, GotoBLAS, SPRNG, GSL New • Pipeline Pilot (Accelrys product) – automation of the process of predicting compute intensity on the fly and submitting jobs to the US • Isight (DSS product) - design optimization & process integration (some portions are interactive & some are for batch processing) Secure Remote Visualization • PKI-VNC • Longhorn

More Related