330 likes | 504 Views
Extreme I/O. Phil Andrews, Director of High End Computing Technologies San Diego Supercomputer Center University of California, San Diego andrews@sdsc.edu. Applications are becoming more complex: so are their results.
E N D
Extreme I/O Phil Andrews, Director of High End Computing Technologies San Diego Supercomputer Center University of California, San Diego andrews@sdsc.edu
Applications are becoming more complex: so are their results • “Every code should have just one output: yes or no!”-Hans Bruijnes, NMFECC, 1983 • Google search headings, 9 p.m., July 24, 2005: “Results 1 - 100 of about 61,000 English pages for "importance of data". (0.69 seconds)” “Results 1 - 100 of about 978 English pages for "importance of computing". (0.38 seconds)”
Some Data numbers • Enzo (Mike Norman) can output >25 TB in a single run • Earthquake simulations can produce > 50 TB • “The entire NVO archive will contain about 100 terabytes of data to start, and grow to more than 10 petabytes by 2008.” Brian Krebs
Data Storage/Preservation Extreme I/O Can’t be done on Grid (I/O exceeds WAN) SDSC Data Science Env SCEC Visualization Climate SCEC Simulation • 3D + time simulation • Out-of-Core EOL ENZO simulation NVO ENZO Visualization CIPRes Data capability (Increasing I/O and storage) CFD Distributed I/O Capable Protein Folding Campus, Departmental and Desktop Computing CPMD QCD Traditional HEC Env Compute capability (increasing FLOPS) Computing: Data is extremely important!
Data has a life cycle! • A computation is an event; data is a living thing that is conceived, created, curated, consumed, and either deleted, archived, and/or forgotten. • For success in data handling, all aspects must be considered and facilitated from the beginning within an integrated infrastructure
SDSC TeraGrid Data Architecture • 1 PB disk • 6 PB archive • 1 GB/s disk-to-tape • Optimized support for DB2 /Oracle • Philosophy: enable SDSC config to serve the grid as data center LAN (multiple GbE, TCP/IP) SAN GPFS Disk (500TB) Local Disk (50TB) Power 4 DB DataStar WAN (30 Gb/s) HPSS Sun F15K Linux Cluster, 4TF BG/L SAN (2 Gb/s, SCSI) SCSI/IP or FC/IP 30 MB/s per drive 200 MB/s per controller FC Disk Cache (400 TB) FC GPFS Disk (100TB) Database Engine Data Miner Vis Engine design leveraged at other TG sites Silos and Tape, 6 PB, 1 GB/sec disk to tape 52 tape drives
Real Data Needs: • Need balanced data transfers; all the way from Memory Bandwidth to Archival Storage • Need Memory Bandwidths of better than 1 GB/s per processor Gflop. Ideally, two 64-bit words per flop, i.e., 16 bytes per flop.
File System Needs: • File systems must allow ~10 GB/s transfer rates. • File systems must work across arbitrary networks (LAN or WAN) • File Systems must be closely integrated with Archival systems for parallel backups and/or automatic archiving
I/O subsystem needs: • No system degradation when disk fails (there will always be some down) • Tolerance of multiple disk failures per RAID set (likely with many thousands of disks per file system). There are over 7,000 spindles at SDSC. • Rapid transfers to Tape systems • Multi-TB tape cartridges, ~GB/s transfers
Parallel File Systems Across TeraGrid • General Parallel File System (GPFS) • High performance parallel I/O, over 10 GB/s at SDSC • SAN capability • Many redundancy features • Shared AIX-Linux • SDSC, NCSA, ANL • Parallel Virtual File System (PVFS) • Open source • Caltech, ANL, SDSC • HP Parallel File System (HP PFS) • Proprietary parallel file system for TeraScale Computing System (Lemieux) at PSC • Message Passing Interface IO (MPIIO) • High performance portable, parallel I/O interface for MPI programs
Local Data only Part of the Story • SDSC users are not “Captive”; move around • SDSC is the designated data lead for TeraGrid • Many SDSC users are part of multi-site collaborations, whose major intersection is via common data sets • Must extend the data reach across the USA
Working on… • Global file system via GPFS • GSI authentication for GPFS using UID/GID mapping to Globus Mapfile • Dedicated disk/servers for Grid Data using GPFS to serve data across the Grid • Automatic migration to Tape archives • Online DB2 database servers to provide remote DB services to Grid users
S S S S L L L L L L L H H H H H H NCSA 1 TB NFS 70 GB/node 100 TB QFS 8 TB TTS 39 TB GPFS 1.5 PB UniTree Combine Resources Across TG Caltech ANL Home Directory Node Local Storage Scratch/Staging Storage Parallel Filesystem Archival System Cap. 140 GB NFS 4 TB NFS 70 GB/node 64 GB/node IA-64 80 TB PVFS 132 GB/node IA-32 1.2 PB HPSS 16 TB PVFS SDSC PSC 2 TB NFS .5 TB NFS TCS 35 GB/node 38 GB/node TCS 100 TB QFS 24TB SLASH 64 TB GPFS 30 TB PFS 6 PB HPSS / 4 PB DMF SAM-FS
TeraGrid Data Management Server • Sun Microsystems F15K • 72 900 MHz processors • 288 GB shared memory • 48 Fiber channel SAN interfaces • Sixteen 1Gb Ethernet interfaces • SAMQFS • High performance parallel file systems linking directly to archival systems for transparent usage • SAM-QFS and SLASH/DMF running now • Storage Resource Broker • Archival Storage (SAMFS) • Pool of storage with migration policies (like DMF) • 100 TB disk cache • 828 MB/s transfers to archive (using 23 9940B tape drives) • Parallel Filesystem (QFS) • Concurrent R/W • Metadata traffic over GE • Data transferred directly over SAN (GPFS does this too) • Demonstrated 3.2 GB/s reads from QFS file system with 30 TB
What Users Want in Grid Data • Unlimited data capacity. We can almost do this. • Transparent, High Speed access anywhere on the Grid. We can do this. • Automatic Archiving and Retrieval (yep) • No Latency. We can’t do this. (Measuring 60 ms roundtrip SDSC-NCSA)
How do we do this ? One Way: • Large, Centralized Tape Archive at SDSC (6 PB, capable of 1 GB/s) • Large, Centralized Disk Cache at SDSC (400 TB, capable of 10+ GB/s) • Local Disk Cache at remote sites for low-latency, High Performance file access • Connect all 3 in a multi-level HSM across TeraGrid with transparent archiving (reads and writes) across all 3 levels
Infinite Grid Storage? • Infinite (SDSC) storage available over the grid • Looks like local disk to grid sites • Use automatic migration with a large cache to keep files always “online” and accessible • Data automatically archived without user intervention • Want one pool of storage for all systems and functions • Combination of Global Parallel File System (GPFS) on both AIX and Linux with transparent archival migration would allow mounting of unlimited archival storage systems as local file system • Users could have local parallel file system (highest performance, not backed up) and global parallel file system (integrated into HSM) both mounted for use • Need Linux, AIX clients
Global File Systems over WAN • Basis for some new Grids (DEISA) • User transparency (TeraGrid roaming) • On demand access to scientific data sets • Share scientific data sets and results • Access scientific results from geographically distributed instruments and sensors in real-time • No copying of files to here and there and there… • What about UID, GID mapping? • Authentication • Initially use World Readable DataSets and common UIDs for some users. GSI coming • On demand Data • Instantly accessible and searchable • No need for local storage space • Need network bandwidth
FC/SONET FC/SONET SC'02 Export of SDSC SAN across 10 Gbps WAN to PSC booth Baltimore 10 Gb IP San Diego FC/IP FC/IP WAN 8 Gb FC SDSC Booth Fibre Connection • Fibre Channel (FC) over IP boxes, FC over SONET encoding • Encapsulate FC frames within IP • Akara and Nishan, 8 Gb/s gear by Nishan Systems • 728 MB/s reads from disk to memory over SAN; writes slightly slower • 13 TB disk • 8 x 1 Gbps link • Single 1 Gb/s link gave 95 MB/s • Latency approx. 80 ms round trip PSC Booth
High Performance Grid-Enabled Data Movement with GridFTPBandwidth Challenge 2003 SCinet L.A. Hub Booth Hub TeraGrid Network Southern California Earthquake Center data Scalable Visualization Toolkit Gigabit Ethernet Gigabit Ethernet Myrinet Myrinet SAN SAN SDSC SDSC SC Booth 128 1.3 GHz dual Madison processor nodes 77 TB General Parallel File System (GPFS) on SAN 40 1.5 GHz dual Madison processor nodes 40 TB GPFS on SAN
Access to GPFS File Systems over the WAN • Goal: sharing GPFS file systems over the WAN • WAN adds 10-60 ms latency • … but under load, storage latency is much higher than this anyway! • Typical supercomputing I/O patterns latency tolerant (large sequential read/writes) • New GPFS feature • GPFS NSD now allows both SAN and IP access to storage • SAN-attached nodes go direct • Non-SAN nodes use NSD over IP • Work in progress • Technology demo at SC03 • Work toward possible product release Roger Haskin, IBM
On Demand File Access over the Wide Area with GPFSBandwidth Challenge 2003 SCinet L.A. Hub Booth Hub TeraGrid Network Southern California Earthquake Center data Scalable Visualization Toolkit Gigabit Ethernet Gigabit Ethernet Myrinet Myrinet SAN SDSC SC Booth SDSC 40 1.5 GHz dual Madison processor nodes GPFS mounted over WAN No duplication of data 128 1.3 GHz dual Madison processor nodes 77 TB General Parallel File System (GPFS) 16 Network Shared Disk Servers
Global TG GPFS over 10 Gb/sWAN (SC’03 Bandwidth Challenge Winner)
GridFTP Across 10 Gb/s WAN (SC ’03 Bandwidth Challenge Winner)
PSC’s DMF HSM is interfaced to SDSC’s tape archival system • Used FC/IP encoding via WAN-SAN to attach 6 SDSC tape drives to PSC’s DMF Archival System • Approximately 19 MB/s aggregate to tapes at first try
StorCloud Demo SC’04 StorCloud 2004 • Major initiative at SC2004 to highlight the use of storage area networking in high-performance computing • ~1 PB of storage from major vendors for use by SC04 exhibitors • StorCloud Challenge competition to award entrants that best demonstrate the use of storage (similar to Bandwidth Challenge) SDSC-IBM StorCloud Challenge A workflow demo that highlights multiple computation sites on a grid sharing storage at a storage siteIBM computing and storage hardware and the 30 Gb/s communications backbone of the Teragrid 40-node GPFS server cluster Installing the DS4300 Storage IBM DS4300 Storage in StorCloud booth
SC ‘04 Demo IBM-SDSC-NCSA TG network L.A. SDSC booth 4 racks of nodes 1 rack of networking TeraGrid SDSC SCinet 40 1.3 GHz dual Itanium2 processor Linux nodes 4 racks of nodes 1 GE/node 3 FC/node GPFS NSD Servers Export /gpfs-sc04 Gigabit Ethernet DataStar Gigabit Ethernet 11 32-Power4+ processors p690 AIX nodes Possible: 7 nodes with 10 GE adapters Brocade SAN switch 1 2 3 11 1 2 3 40 SAN Switches Federation SP switch StorCloud booth 15 racks of disks 1.5 racks of SAN switch 3 Brocade 24000 switches 128 ports each 360 ports total used 1 176 181 TB raw FastT600 disk 15 racks 4 controllers/rack 4 FC/controller 240 total FC from disks 2 Ethernet ports/controller 120 total Ethernet ports 176 8-Power4+ processor p655 AIX nodes Mount /gpfs-sc04 SAN Disks Gigabit Ethernet
SC ‘04 Demo IBM-SDSC-NCSA Nodes scheduled using GUR ENZO computation on DataStar, output written to StorCloud GPFS served by nodes in SDSC’s SC ‘04 booth Visualization performed at NCSA using StorCloud GPFS and displayed to showroom floor TG network L.A. Chicago SDSC NCSA SCinet 10 Gigabit Ethernet Gigabit Ethernet SC ‘04 SDSC booth TeraGrid Gigabit Ethernet TeraGrid DataStar 40 1.3 GHz dual Itanium2 processor Linux nodes GPFS NSD Servers 40 1.5 GHz dual Itanium2 processor nodes Visualization /gpfs-sc04 mounted Federation SP switch Brocade SAN switch 3 Brocade 24000 switches 176 8-Power4+ processor p655 AIX nodes 7 32-Power4+ processor p690 AIX nodes with 10 GE adapters /gpfs-sc04 mounted SC ‘04 StorCloud booth 160 TB FastT600 disk 15 racks, 2 controllers/rack
SDSC now serving 0.5 PB GFS disk • Initially served across TeraGrid and mounted by ANL and NCSA • Plan to start hosting large datasets for the scientific community • One of first will be NVO, ~50 TB of Night Sky information; Read-Only dataset available for computation across TeraGrid • Extend rapidly with other datasets • Hoping for 1 PB soon
Global File System For the TeraGrid NCSA PSC TeraGrid Network 30 Gbps to Los Angeles Juniper T640 ANL Force 10 - 12000 IA-64 GPFS Servers TeraGrid Linux 3 TeraFlops BG/L 6 TeraFlops 128 I/O Nodes DataStar 11 TeraFlops .5 PetaByte FastT100 Storage Parallel File System exported from San Diego Mounted at all TeraGrid sites