230 likes | 436 Views
Lustre use cases in the TSUBAME2.0 supercomputer. Tokyo Institute of Technology Dr. Hitoshi Sato. Outline. Introduction of the TSUBAME2.0 Supercomputer Overview of TSUBAME2.0 Storage Architecture Lustre FS use cases. TSUBAME2.0 (Nov. 2010 / w NEC-HP).
E N D
Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato
Outline • Introduction of the TSUBAME2.0 Supercomputer • Overview of TSUBAME2.0 Storage Architecture • Lustre FS use cases
TSUBAME2.0 (Nov. 2010 /w NEC-HP) • A green-, cloud-based SC at TokyoTech (Tokyo Japan) • 〜2.4 Pflops (Peak), 1.192 Pflops (Linpack) • 4th in TOP500 Rank (Nov. 2010) • Next gen multi-core x86 CPUs + GPUs • 1432 nodes, Intel Westmere/Nehalem EX CPUs • 4244 NVIDIA Tesla (Fermi) M2050 GPUs • 〜 95TB of MEM /w 0.7 PB/s aggregate BW • Optical Dual-Rail QDR IB /w full bisection BW (FAT Tree) • 1.2 MW Power , PUE = 1.28 • 2nd in Green500 Rank • Greenest Production SC • VM Operation (KVM), Linux + Windows HPC
TSUBAME2.0 Overview HDD-based Storage systems: Total 7.13PB (Parallel FS + Home) StorageTekSL8500Tape Library~4PB Parallel FS5.93PB Home 1.2PB Storage Servers HP DL380 G6 4nodes BlueArcMercury 100 x2 Storage DDN SFA10000 x1 (10 enclosure x1) MDS,OSS servers HP DL360 G6 30nodes Storage DDN SFA10000 x5 (10 enclosure x5) x5 SupreTitenet NFS,CIFS,iSCSI用 x2 OSSx20 MDS x10 NFS,CIFS用 x4 High Speed DataTransfer Servers Interconnets:Full-bisectionOpticalQDRInfiniband Network Core Switch Edge Switch Edge Switch (/w 10GbE ports) 179switches 6switches 12switches Voltaire Grid Director 4036 ×179 IB QDR : 36 ports Voltaire Grid Director 4036E ×6 IB QDR:34ports 10GbE: 2port ManagementServers SupreSinet3 Voltaire Grid Director 4700 ×12IB QDR: 324 ports Computing Nodes: 2.4PFlops(CPU+GPU), 224.69TFlops(CPU), ~100TBMEM, ~200TB SSD Thinnodes Mediumnodes Fatnodes HPProliant DL580 G724nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:128GB SSD: 120GB x 4 = 480GB HPProliant DL580 G710nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:256GB (512GB) SSD: 120GB x 4 = 480GB HP Proliant SL390s G71408nodes CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/node GPU: NVIDIA M2050, 3GPUs/node Mem: 54GB (96GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB) ・・・・・・ PCI –E gen2 x16 x2slot/node GSIC:NVIDIA Tesla S1070GPU 1408nodes (32nodes x44 Racks)
Usage of TSUBAME2.0 storage • Simulation, Traditional HPC • Outputs for intermediate and final results • 200 times in 52GB × 128 nodes of MEMs → 1.3 PB • Computation in 4 (several) patterns → 5.2 PB • Data-intensive computing • Data analysis • ex.) Bioinformatics, Text, Video analyses • Web text processing for acquiring knowledge from 1 bn. pages (2TB) of HTML → 20TB of intermediate outputs→ 60TB of final results • Processing by using commodity tools • MapReduce(Hadoop), Workflow systems, RDBs, FUSE, etc.
Storage problems in Cloud-based SCs • Support for Various I/O workloads • Storage Usage • Usability
I/O workloads • Various HPC apps. run on TSUBAME2.0 • Concurrent Parallel R/W I/O • MPI (MPI-IO), MPI with CUDA, OpenMP, etc. • Fine-grained R/W I/O • Checkpoint, Temporal files • Gaussian, etc. • Read mostly I/O • Data-intensive apps, Parallel workflow, Parameter survey • Array job, Hadoop, etc. • Shared Storage • I/O concentration
Storage Usage • Data life cycle management • Few users occupy most of storage volumes on TSUBAME1.0 • Only 0.02 % of users use more than 1TB of storage volumes • Storage resource characteristics • HDD : 〜 150MB/s, 0.16 $/GB, 10 W/disk • SSD : 100 〜 1000 MB/s, 4.0 $/GB, 0.2 W/disk • Tape : 〜 100MB/s, 1.0 $/GB,low power consumption
Usability • Seamless data access to SCs • Federated storage between private PCs, clusters in lab and SCs • Storage service for campus like cloud storage services • How to deal with large data sets • Transfer big data between SCs • e.g.) Web data mining on TSUBAME1 • NICT (Osaka) → TokyoTech (Tokyo) : Stage-in 2TB of initial data • TokyoTech → NICT : Stage-out 60TB of results • Transfer data via the Internet : 8days • Fedex?
TSUBAME2.0 Storage Overview TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape) QDR IB (×4) × 8 10GbE × 2 QDR IB(×4) ×20 Infiniband QDR Network for LNET and Other Services “Scratch” “Global Work Space” #2 “NFS/CIFS/iSCSI by BlueARC” “cNFS/Clusterd Samba w/ GPFS” “Global Work Space” #1 “Global Work Space” #3 GPFS#4 GPFS#3 GPFS#1 GPFS#2 HOME HOME SFA10k #5 SFA10k #1 SFA10k #2 SFA10k #3 SFA10k #4 System application 3.6 PB Lustre /work0 /work9 /work19 iSCSI /gscr0 GPFS with HSM 1.2PB SFA10k #6 “Thin node SSD” “Fat/Medium node SSD” Home Volumes Parallel File System Volumes 2.4 PB HDD + 〜4PB Tape 130 TB〜 Grid Storage 190 TB Scratch
TSUBAME2.0 Storage Overview TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape) QDR IB (×4) × 8 10GbE × 2 QDR IB(×4) ×20 • Home storage for computing nodes • Cloud-based campus storage services Concurrent Parallel I/O (e.g. MPI-IO) Infiniband QDR Network for LNET and Other Services “Scratch” “Global Work Space” #2 “NFS/CIFS/iSCSI by BlueARC” “cNFS/Clusterd Samba w/ GPFS” “Global Work Space” #1 “Global Work Space” #3 GPFS#4 GPFS#3 GPFS#1 GPFS#2 Read mostly I/O (data-intensive apps, parallel workflow, parameter survey) HOME HOME SFA10k #5 SFA10k #1 SFA10k #2 SFA10k #3 SFA10k #4 System application 3.6 PB Lustre /work0 /work9 /work19 iSCSI /gscr0 GPFS with HSM 1.2PB SFA10k #6 Fine-grained R/W I/O (check point, temporal files) “Thin node SSD” “Fat/Medium node SSD” Data transfer service between SCs/CCs Home Volumes Parallel File System Volumes Backup 2.4 PB HDD + 〜4PB Tape 130 TB〜 Grid Storage 190 TB Scratch
LustreConfiguration LustreFS #1 785.05TB 10bn. inodes LustreFS #2 785.05TB 10bn. inodes LustreFS #3 785.05TB 10bn. inodes MDS x2 OSS x4 MDS x2 OSS x4 MDS x2 OSS x4 General purpose (MPI) Scratch (Gaussian) ReservedBackup
Lustre FS Performance IO Throughput 〜11GB/sec / FS • Sequential Write/Read by IOR-2.10.3 • 1〜8clients, 1〜7proc/client 〜33GB/sec w/ 3FSs Client #1 Client #5 10.7GB/sec 11GB/sec Client #2 Client #6 Client #3 Client #7 Client #4 Client #8 IB QDR Network OSS #1 OSS #3 OSS #2 OSS #4 LustreFS (56OSTs) SFA 10k 600 Slots Data 560x 2TB SATA 7200 rpm LustreFS, 1FS, 4 OSS, IOR@Client(checksum=1)
Failure cases in productive operations • OSS hung up due to heavy I/O load • A user conducts small read() I/O operations via many java procs • I/O slow down on OSS • OSS dispatches iterative failovers and rejects connections from clients • MDS hung up • Client holds unnecessary RPC connections at eviction processes • MDS keeps RPC lock from clients
Related Activities • Lustre FS Monitoring • MapReduce
TSUBAME2.0 Lustre FS Monitoring /w DDN Japan • Purposes : • Real time visualization for TSUBAME2.0 users • Analysis for productive operation and FS research
Lustre Monitoring in detail Web Browser For research and analysis For Realtime Cluster Visualization Monitoring System Web Application LMT GUI Web Server MySQL LMT Server gmetad Management Node MDT statistics OST statistics Monitoring Target gmod gmod Ganglia lustre module LMT server agent Ganglia lustre module LMT server agent cerebro cerebro OST statistics : write/read b/w, OST usages, etc 4 x OSS 2 x MDS ost00 ost01 mdt . . . MDT statistics : open,close,getattr, setattr, link, unlink mkdir, rmdir, statfs, rename, getxattr, inode usages, etc.. 56 x OST (14 OST per OSS) ost55 ost56
MapReduce on the TSUBAME supercomputer • MapReduce • A programming model for large data processing • Hadoop is a common OSS-based implementation • Supercomputers • A candidate for the execution environment • Needs by various application users • Text Processing, Machine learning, Bioinformatics, etc. • Problems : • Cooperation with the existing batch-job scheduler system (PBS Pro) • All jobs including MapReduce tasks should be run under the scheduler’s control • TSUBAME2.0 supports various storage for data-intensive computation • Local SSD storage, Parallel FSs (Lustre, GPFS) • Cooperation with GPU accelerators • Not supported in Hadoop
Hadoop on TSUBAME (Tsudoop) • Script-based invocation • acquire computing nodes via PBS Pro • deploy a Hadoop environment on the fly (incl. HDFS) • execute a user MapReduce jobs • Various FS support • HDFS by aggregating local SSDs • Lustre, GPFS (to appear) • Customized Hadoop for executing CUDA programs (experimental) • Hybrid Map Task Scheduling • Automatically detects map taskcharacteristics by monitoring • Scheduling map tasks to minimizeoverall MapReduce job execution time • Extension of Hadoop Pipes features