270 likes | 512 Views
NCAR’s Data-Centric Supercomputing Environment Yellowstone. December 21, 2011 Anke Kamrath, OSD Director/CISL anke @ ucar.edu. Overview. Strategy Moving from Process to Data-Centric Computing HPC/Data Architecture What we have today at ML What’s planned for NWSC
E N D
NCAR’s Data-Centric Supercomputing EnvironmentYellowstone December 21, 2011Anke Kamrath, OSD Director/CISLanke@ucar.edu
Overview • Strategy • Moving from Process to Data-Centric Computing • HPC/Data Architecture • What we have today at ML • What’s planned for NWSC • Storage/Data/Networking – Data in Flight • WAN and High-Performance LAN • Central Filesystem Plans • Archival Plans NWSC Now Open!
Evolving the Scientific Workflow • Common data movement issues • Time consuming to move data between systems • Bandwidth to archive system is insufficient • Lack of sufficient disk space • Need to evolve data management techniques • Workflow management systems • Standardize metadata information • Reduce/eliminate duplication of datasets (ie – CMIP5) • User Education • Effective methods for understanding workflow • Effective methods for streamlining workflow
Traditional Workflow Process Centric Data Model
Evolving Scientific Workflow Information Centric Data Model
Current Resources @ Mesa Lab GLADE BLUEFIRE LYNX
NWSC: Yellowstone Environment Yellowstone HPC resource, 1.55 PFLOPS peak GLADE Central disk resource 11 PB (2012), 16.4 PB (2014) Geyser & Caldera DAV clusters NCAR HPSS Archive 100 PB capacity ~15 PB/yr growth 1Gb/10Gb Ethernet (40Gb+ future) Data Transfer Services Science Gateways RDA, ESG Partner Sites Remote Vis XSEDE Sites High Bandwidth Low Latency HPC and I/O Networks FDR InfiniBand and 10Gb Ethernet
NWSC-1 Resources in a Nutshell • Centralized Filesystems and Data Storage Resource (GLADE) • >90 GB/sec aggregate I/O bandwidth, GPFS filesystems • 10.9 PetaBytes initially -> 16.4 PetaBytes in 1Q2014 • High Performance Computing Resource (Yellowstone) • IBM iDataPlex Cluster with Intel Sandy Bridge EP† processors with Advanced Vector Extensions (AVX) • 1.552 PetaFLOPs – 29.8 bluefire-equivalents – 4,662 nodes – 74,592 cores • 149.2 TeraBytes total memory • Mellanox FDR InfiniBand full fat-tree interconnect • Data Analysis and Visualization Resource (Geyser & Caldera) • Large Memory System with Intel Westmere EX processors • 16 nodes, 640 cores, 16 TeraBytes memory, 16 NVIDIA Kepler GPUs • GPU-Computation/Vis System with Intel Sandy Bridge EP processors with AVX • 16 nodes, 128 SB cores, 1 TeraByte memory, 32 NVIDIA Kepler GPUs • Knights Corner System with Intel Sandy Bridge EP processors with AVX • 16 nodes, 128 SB cores, 992 KC cores, 1 TeraByte memory - Nov’12 delivery †“Sandy Bridge EP” is the Intel® Xeon® E5-2670 processor
GLADE • 10.94 PB usable capacity 16.42 PB usable (1Q2014) Estimated initial file system sizes • collections ≈ 2 PB RDA, CMIP5 data • scratch ≈ 5 PB shared, temporary space • projects ≈ 3 PB long-term, allocated space • users ≈ 1 PB medium-term work space • Disk Storage Subsystem • 76 IBM DCS3700 controllers & expansion drawers • 90 2-TB NL-SAS drives/controller • add 30 3-TB NL-SAS drives/controller (1Q2014) • GPFS NSD Servers • 91.8 GB/s aggregate I/O bandwidth; 19 IBM x3650 M4 nodes • I/O Aggregator Servers (GPFS, GLADE-HPSS connectivity) • 10-GbE & FDR interfaces; 4 IBM x3650 M4 nodes • High-performance I/O interconnect to HPC & DAV • Mellanox FDR InfiniBand full fat-tree • 13.6 GB/s bidirectional bandwidth/node
YellowstoneNWSC High-Performance Computing Resource • Batch Computation • 4,662 IBM dx360 M4 nodes – 16 cores, 32 GB memory per node • Intel Sandy Bridge EP processors with AVX – 2.6 GHz clock • 74,592 cores total – 1.552 PFLOPs peak • 149.2 TB total DDR3-1600 memory • 29.8 Bluefire equivalents • High-Performance Interconnect • Mellanox FDR InfiniBand full fat-tree • 13.6 GB/s bidirectional bw/node • <2.5 µs latency (worst case) • 31.7 TB/s bisection bandwidth • Login/Interactive • 6 IBM x3650 M4 Nodes; Intel Sandy Bridge EP processors with AVX • 16 cores & 128 GB memory per node
NCAR HPC Profile 30x Bluefire performance
Geyser and CalderaNWSC Data Analysis & Visualization Resource • Geyser: Large-memory system • 16 IBM x3850 nodes – Intel Westmere-EX processors • 40 cores, 1 TB memory, 1 NVIDIA Kepler Q13H-3 GPU per node • Mellanox FDR full fat-tree interconnect • Caldera: GPUcomputation/visualization system • 16 IBM x360 M4 nodes – Intel Sandy Bridge EP/AVX • 16 cores, 64 GB memory, 2 NVIDIA Kepler Q13H-3 GPUs per node • Mellanox FDR full fat-tree interconnect • Knights Corner system (November 2012 delivery) • Intel Many Integrated Core (MIC) architecture • 16 IBM Knights Corner nodes • 16 Sandy Bridge EP/AVX cores, 64 GB memory, 1 Knights Corner adapter per node • Mellanox FDR full fat-tree interconnect
ErebusAntarctic Mesoscale Prediction System (AMPS) 0° • IBM iDataPlex Compute Cluster • 84 IBM dx360 M4 Nodes; 16 cores, 32 GB • Intel Sandy Bridge EP; 2.6 GHz clock • 1,344 cores total – 28 TFLOPs peak • Mellanox FDR InfiniBand full fat-tree • 0.54 Bluefire equivalents • Login Nodes • 2 IBM x3650 M4 Nodes • 16 cores & 128 GB memory per node • Dedicated GPFS filesystem • 57.6 TB usable disk storage • 9.6 GB/sec aggregate I/O bandwidth 90° E 90° W 180° Erebus, on Ross Island, is Antarctica’s most famous volcanic peak and is one of the largest volcanoes in the world – within the top 20 in total size and reaching a height of 12,450 feet.
Yellowstone Software • Compilers, Libraries, Debugger & Performance Tools • Intel Cluster Studio (Fortran, C++, performance & MPI libraries, trace collector & analyzer) 50 concurrent users • Intel VTune Amplifier XE performance optimizer 2 concurrent users • PGI CDK (Fortran, C, C++, pgdbg debugger, pgprof) 50 conc. users • PGI CDK GPU Version (Fortran, C, C++, pgdbg debugger, pgprof) for DAV systems only, 2 concurrent users • PathScale EckoPath (Fortran C, C++, PathDB debugger) 20 concurrent users • Rogue Wave TotalView debugger 8,192 floating tokens • IBM Parallel Environment (POE), including IBM HPC Toolkit • System Software • LSF-HPC Batch Subsystem / Resource Manager • IBM has purchased Platform Computing, Inc. (developers of LSF-HPC) • Red Hat Enterprise Linux (RHEL) Version 6 • IBM General Parallel Filesystem (GPFS) • Mellanox Universal Fabric Manager • IBM xCAT cluster administration toolkit
NCAR HPSS Archive Resource • NWSC • Two SL8500 robotic libraries (20k cartridge capacity) • 26 T10000C tape drives (240 MB/sec I/O rate each) and T10000C media (5 TB/cartridge, uncompressed) initially; +20 T10000C drives ~Nov 2012 • >100 PB capacity • Current growth rate ~3.8 PB/year • Anticipated NWSC growth rate ~15 PB/year • Mesa Lab • Two SL8500 robotic libraries (15k cartridge capacity) • Existing data (14.5 PB): • 1st & 2nd copies will be ‘oozed’ to new media @ NWSC, begin 2012 • New data @ Mesa: • Disaster-recovery data only • T10000B drives & media to be retired • No plans to move Mesa Lab SL8500 libraries (more costly to move than to buy new under AMSTAR Subcontract) Plan to release an “AMSTAR-2” RFP 1Q2013, with target for first equipment delivery during 1Q2014 to further augment the NCAR HPSS Archive.
Yellowstone allocations (% of resource) NCAR’s 29% represents 170 million core-hours per year for Yellowstone alone (compared to less than 10 million per year on Bluefire) plus a similar fraction of the DAV and GLADE resources.
Data in Flight to NWSC • NWSC Networking • Central Filesystem • Migrating from GLADE-ML to GLADE-NWSC • Archive • Migrating HPSS data from ML to NWSC
BiSON to NWSC • Initially three 10G circuits active • Two 10G connections back to Mesa Lab for internal traffic • One 10G direct to FRGP for general Internet2 / NLR / Research and Education traffic • Options for dedicated 10G connections for high performance computing to other BiSON members • System is engineered for 40 individual lambdas • Each lambda can be a 10G, 40G, or 100G connection • Independent lambdas can be sent each direction around the ring (two ADVA shelves at NWSC – one for each direction) • With a major upgrade system could support 80 lambdas • 100Gbps * 80 channels * 2 paths = 16Tbps
High performance LAN • Data Center Networking (DCN) • high speed data center computing with 1G and 10G client facing ports for supercomputer, mass storage, and other data center components • redundant design (e.g., multiple chassis and separate module connections) • future option for 100G interfaces • Juniper awarded RFP after NSF approval • Access switches: QFX3500 series • Network core and WAN edge: EX8200 series switch/routers • Includes spares for lab/testing purposes • Juniper training for NETS staff early Jan-2012 • Deploy Juniper equipment late Jan-2012
Migrating GLADE Data • Temporary work spaces (/glade/scratch, /glade/user) • No data will automatically be moved to NWSC • Allocated project spaces (/glade/projxx) • New allocations will be made for the NWSC • No data will automatically be moved to NWSC • Data transfer option so users may move data they need • Data collections (/glade/data01, /glade/data02) • CISL will move data from ML to NWSC • Full production capability will need to be maintained during the transition • Network Impact • Current storage max performance is 5GB/s • Can sustain ~2GB/s for reads while under a production load • Will move 400TB in a couple of days, however we will saturate a 20Gb/s network link
Migrating Archive • What Migrates???? • Data: 15PBs and counting… • Format: MSS to HPSS • Tape: Tape B (1TB tapes) to Tape C (5TB tapes) • Location: ML to NWSC • HPSS at NWSC to become primary site in Spring 2012 • 1 day outage when metadata servers get moved • ML HPSS will remain as Disaster Recovery Site • Data Migration will take until early 2014 • Will throttle migration to not overload network