310 likes | 505 Views
User Environment Enhancements in the DoD HPC Modernization Program. 7 April 2011 Steve Scherr, DoD HPCMP. Topics. Background: HPCMP Storage Initiative Enhanced User Environment HPC EUE Infrastructure HPC Portal. MB Revised: 5/4/2009. HPC Modernization Program. Vision
E N D
User Environment Enhancements in the DoD HPC Modernization Program 7 April 2011 Steve Scherr, DoD HPCMP
Topics • Background: HPCMP Storage Initiative • Enhanced User Environment • HPC EUE Infrastructure • HPC Portal MB Revised: 5/4/2009
HPC Modernization Program Vision A pervasive culture existing among DoD’s scientists and engineers where they routinely use advanced computational environments to solve the most demanding problems transforming the way DoD does business─finding better solutions faster. Mission Accelerate development and transition of advanced defense technologies into superior warfighting capabilities by exploiting and strengthening US leadership in supercomputing, communications and computational modeling. MB Revised: 12/11/2009
HPCMP Serves a Large, Diverse DoD User Community FY11 statistics 501 active projects with 4,408 users at 250 sites 5,098 Habus* batch requirements FY10 statistics (as of 9/30/2010) 496 projects with 4,345 users 2,866 Habus* non-real-time requirements * Requirements and usage measured in Habus Customer Focus Computational Electromagnetics & Acoustics – 323 Users Computational Fluid Dynamics – 1,223 Users Electronics, Networking, and Systems/C4I – 211 Users Computational Structural Mechanics – 465 Users Environmental Quality Modeling & Simulation – 163 Users Forces Modeling & Simulation – 235 Users Computational Chemistry, Biology & Materials Science – 690 Users Climate/Weather/Ocean Modeling & Simulation – 315 Users Signal/Image Processing – 586 Users Integrated Modeling & Test Environments – 105 Users 92 users are self characterized as “Other” New CTA Space and Astrophysical Science (SAS) Source: Portal to the Information Environment – July 2010 MB Revised: 1/26/2011
DoD Supercomputing Resource Centers (DSRCs)Six Large HPC Centers • DSRC systems support classified, unclassified and open computing capabilities • 17 large HPC systems • 1 systems ― 44,000+ cores • 6 systems ― 10,000 to 22,000+ cores • 10 systems ― 2,000 to 9,000+ cores • 1.873 peak PetaFlops • 4,750 Habus • Three new FY10 HPC systems • 773 TeraFlops • 2,251 Habus • 14 Petabytes single copy data storage • 28 Petabytes including Disaster Recovery • Connections to Customers • 212 locations MB Revised: 12/22/2010
HPCMP Data Storage Growth 43% increase over FY 2008 34% increase over FY 2009 MB Revised: 12/22/2010
User View of HPCMP Storage • Computational results used in many different ways • Source for additional computation • Interrogated for post-processing • Archived for scientific value • Users are mobile within HPCMP $WORKDIR short-term storage HPC File System HPC A DR Cache DR Tape Archive Server Center Archive Cache Tape HPC B HPC File System $WORKDIR short-term storage
Storage Lifecycle Management (SLM) Rationale • HPCMP can provide enough storage for NEW data • Centers support 2+ generations of storage media • Older media unreadable after tech obsolescence • Users: we can live with constraints & manage data • Need tools to manage data • Need intermediate-length storage
Remote Job Management • Batch • Data Management Tools – Metadata • Computational Infrastructurefor Software Development(Tools / Environment) Evolving Enterprise Service Model • Interactive Grid Generation Customers Services Infrastructure MB Revised: 8/27/2010
Remote Disaster Recovery Facility HPC Enhanced User EnvironmentArchitecture Single Point of Access Temporary Storage 10 days Storage Lifecycle Management Center-wide ILM-managed File System 30 days Center-wide Job Management HPC SYSTEM A Archive Server Data Analysis Services Utility Server SLM Metadata Catalog Service Software Development Environment Local Tape Archive HPC SYSTEM B DR&E Portal Metadata Replication Between all DSRCs Temporary Storage 10 days Grid-Generation Capabilities Services Compute Storage MB Revised: 12/22/2010
HPC Enhanced User Environment • Interactive Computing • Single point of access • Center-wide job management • Remote data analysis • Center-wide filesystem • Medium-term storage • User-specified metadata • Data Management Tools • Insight into file archives • Program-wide visibility • HPC Portal • Supercharge the engineering desktop MB Revised: 8/3/2010
Hardware Components • Center-wide File System: Panasas PAS 8 • 340 blades, 4 TB unformatted • Arista 7508 switch • Utility Server: Appro 1U Tetra, 88 nodes • 44 compute: 2 AMD Opteron 2.3 GHz CPUs, 16 cores, 128 GB memory • 22 large memory: 4 AMD Opteron 2.3 GHz CPUs, 32 cores, 256 GB memory • 22 graphics: 2 AMD Opteron 2.3 GHz CPUs, 16 cores, 256 GB memory, NVIDIA Tesla M2050
System Configuration • $HOME • 10 GB quota • $WORKDIR • 200 TB • 100 TB user quota • Standard scrubbing • $CENTER • 800 TB • Possible user quota (200 TB) • 30-day scrub policy • SLM compatible • $ARCHIVE • Managed by SLM • Accessed through SLM • Center-wide Job Management • qsub, qstat, qdel • Resource Requests • PBS Pro
Storage Lifecycle Management • Based on Nirvana SRB and SAM-QFS • Manages $ARCHIVE • Set metadata to specify retention period • Can register files on $CENTER -- target to automate registration by end 2011 • HPC access to $ARCHIVE through transfer queue • Also working PBS parameter mechanism – future just-in-time • Customer Experience workgroup developing auxiliary commands (Sdata) for user-defined metadata • Global visibility
HPC Desktop Portal Initiative • Goals • Enable DoD scientists and engineers to apply the power of HPC without being HPC experts • Provide access to HPC resources using current web technology—attract and retain new technology experts to DoD • Methods • Provide HPC Software as a Service over web with zero or minimal footprint • Provide common analysis tools enabled for seamless HPC use (MATLAB) • Provide accessible optimized tools for technology domains (CREATE, institutes) • Extension of desktop; interactive response • Single sign-on through CAC
HPC Portal • Engaging with DoD engineering organizations • Understand their requirements and how we can support • Examining Cloud Computing Concepts • Software as a Service • Infrastructure as a Service • Phase 1: Parallel MATLAB capability • ARL lead, deliver in June • Built on Microsoft HPC Server • Additional available applications, FMS, CFD, etc. • Phase 2: Present CREATE capability • Identifying API, middleware, design framework
HPC Modernization Program MB Revised: 11/23/2009
Storage Lifecycle Management • Layered Software Capability • Information Lifecycle Management • Metadata – user and system defined • Policies – drive HSM • Reporting • Hierarchical Storage Management • Tiered Storage • Disaster Recovery • Multi-system, multi-center • Assign metadata attributes from all HPC systems • Work toward “shared” files between centers
Information Lifecycle Management Provide capability to users and administrators Control costs Hierarchical Storage Management Based on ILM information Includes disaster recovery Common user interface Work toward shared files Storage Lifecycle Management
ILM Requirements • Metadata attributes • User-assignable • System-assignable • Defaults • Tools and Reports • Enable management of data files • Policies • Based on attributes • Used to drive HSM
ILM Attribute Requirements • Associated with all objects • Arbitrary number, size, type • Attribute permissions separate from underlying files • System read/write • Creator/Owner read/write • Collections of other users • Inheritance or default-setting at creation • Settable via templates or functions • ILM must scale to 1B files today • No impact on I/O performance for HSM • Attributes can be output textually
ILM Tool Requirements • Tools for manipulating files under ILM control • Attribute-aware • Attribute-preserving • Operate on files, directories, or lists of objects • Create/modify attributes • Reports • Based on multiple criteria, attribute values • Status of pending operations • Consistent with attribute permissions
HPCMP Storage Initiative • Computing power grows annually—so do stored files • Archived data is hard for users to use and manage • Costs: User time, labor, hardware, software and media • Storage Initiative • Objective: Refresh to manage data for next 10 years • Goals: 10-year architecture • Leverage advances in technology • Improve user productivity • Improve reliability & adaptability • Sustain within current storage budget MB Revised: 5/4/2009
HPCMP Data Storage GrowthSingle Copy Data Storage • Impact of 16x growth in eight years • Data Analysis • Data Locality and Movement • Data Duplication • Disaster Recovery • Network Loading • Storage Technologies 22 x MB Revised: 12/22/2010
HPC Enhanced User Environment (HEUE) • Purpose • Provide computational scientists more tools and capabilities to perform research more efficiently and effectively • Benefit • Decrease time-to-solution, increase S&E productivity and analytical power, reduce future costs of data archive • Tasks • Storage lifecycle management implementation • Metadata for file management and identification • Program-wide datafile visibility and access • Center-wide filesystem: efficient storage for data analysis and extraction • Center-wide job management: single point-of-access, increase user productivity • Remote visualization for large datasets • Web-based access to HPC capability MB Revised: 12/22/2010
Requested Software System Software • PBS Pro, OpenMPI • InfiniBand Software Stack • NVIDIA Linux x86_64 driver set • Compliance with BCT policies Development Tools • PGI Compiler Suite (C/C++/Fortran) • GNU Compiler Suite & debugger • TotalView debugger • NVIDIA GPGPU development Environment (OpenCL and CUDA) • Common Set of Open Source Utilities • BC policy: PAPII, SCALASCA, TAU, PDT, Valgrind • DDT and DDT with CUDA debugger Data Analysis Tools • CEI – Ensight Suite • Intelligent Light – FieldView • RSI, Inc. – IDL • Mathworks – Matlab • NCAR Graphics Library • Kitware – ParaView • Tecplot, Inc. –Tecplot • VisIt Visualization Tool • Computational Science Environment (CSE) • ezVIZ
Requested Software Pre/Post Processing Software • ANSYS CFD • Abaqus • LS-PrePost • Parasolid Designer (pre) • Pointwise – Gridgen Math Libraries • ARPACK, FFTW, PETSc, SuperLU, LAPACK, ScaLAPACK, BLAS, ATLAS, GotoBLAS, SPRNG, GSL New • Pipeline Pilot (Accelrys product) – automation of the process of predicting compute intensity on the fly and submitting jobs to the US • Isight (DSS product) - design optimization & process integration (some portions are interactive & some are for batch processing) Secure Remote Visualization • PKI-VNC • Longhorn