380 likes | 398 Views
GUPFS Project Progress Report. May 28-29, 2003 GUPFS Team Greg Butler, Rei Lee, Michael Welcome NERSC. Outline. GUPFS Project Overview Summary of FY 2002 Activities Testbed Technology Update Current Activities Plan Benchmark Methodologies and Results What about Lustre?
E N D
GUPFS Project Progress Report May 28-29, 2003 GUPFS Team Greg Butler, Rei Lee, Michael Welcome NERSC
Outline • GUPFS Project Overview • Summary of FY 2002 Activities • Testbed Technology Update • Current Activities Plan • Benchmark Methodologies and Results • What about Lustre? • Near Term Activities and Future Plan
GUPFS Project Overview • Five year project introduced in NERSC Strategic Proposal • Purpose to make it easier to conduct advanced scientific research using NERSC systems • Simplify end user data management by providing a shared disk file system in NERSC production environment • An evaluation, selection, and deployment project • May conduct or support development activities to accelerate functionality or supply missing functionality
Current NERSC Storage Configuration • Each system has its own separate direct-attached storage • Each system has its own separate user file system and name space • Data transfer between systems is over the LAN
Global Unified Parallel File System (GUPFS) • Global/Unified • A file system shared by major NERSC production systems. • Using consolidated storage and providing unified name space • Automatically sharing user files between systems without replication • Integration with HPSS and Grid is highly desired • Parallel • File system providing performance near to that of native system-specific file systems
NERSC Storage Vision • Single storage pool, decoupled from NERSC computational systems • Flexible management of storage resource • All systems have access to all storage • Buy new storage (faster and cheaper) only as we need it • High performance large capacity storage • Users see same file from all systems • No need for replication • Visualization server has access to data as soon as it is created • Integration with mass storage • Provide direct HSM and backups through HPSS without impacting computational systems • (Potential) Geographical distribution • Reduce need for file replication • Facilitate interactive collaboration
Where We Are Now • Mid way into 2nd year of three-year technology evaluation of: Shared File System, SAN Fabric, and Storage • Constructed complex testbed environment simulating envisioned NERSC environment • Developed testing methodologies for evaluation • Identifying and testing appropriate technologies in all three areas • Collaborating with vendors to fix problems and influence their development in directions beneficial to HPC
Summary FY02 Activities • Completed acquisition and configuration of initial testbed system • Developed GUPFS project as part of NERSC Strategic Proposal • Developed relationships with technology vendors and other Labs • Identified and tracked existing and emerging technologies and trends • Initiated informal NERSC user I/O survey • Developed testing methodologies and benchmarks for evaluation • Evaluated baseline storage characteristics • Evaluated two versions of Sistina GFS file system • Conducted technology update of testbed system • Prepared GUPFS FY02 Technical Report
Testbed Technology Update • Rapid changes in all three technology areas during FY02 • Technology areas remain extremely volatile • Initial testbed inadequate for broader scope of GUPFS project • Initial testbed unable to accommodate advanced technologies: • Too few nodes to install all the advanced fabric elements • Fabric too small and not heterogeneous • Too few nodes to absorb/stress new storage and fabric elements • Too few nodes to test emerging file system • Too few nodes to explore scalability and scalability projections
Updated Testbed Design Goals • Designed to be flexible and accommodate future technologies • Designed to support testing of new fabric technologies • iSCSI • 2Gb/s Fibre Channel • Infiniband storage traffic • Designed to support testing of emerging (inter-connect based) shared file systems • Lustre • InfinARRAY
Recent and Ongoing Activities • Refining testing and benchmark methodologies • Further developing testing and benchmark tools • Testing GFS 5.1 and ADIC StorNext file system • Trying to conduct initial tests of current Lustre • Continuing with beta tests of advanced equipments • InfiniCon 4x Infiniband, Yotta Yotta NetStorager • Testing advanced fabric technologies • iSCSI, Infiniband, Fabric Bridges, 2Gb/s FC • Continuing development of vendor relationships • arranging tests of multiple vendor’s new technologies • GUPFS FY02 Technical Report put on the web • http://www.nersc.gov/aboutnersc/pubs/GUPFS_02.pdf
Technologies Evaluated • File Systems • Sistina 5.1 • ADIC StorNext (CVFS) File System 2.0 • Lustre 0.6 (1.0 Beta 1) • Fabric • FC: Brocade SilkWorm and Qlogic SANbox2 • iSCSI: Cisco SN 5428, Intel iSCSI HBA, iSCSI over IB • Infiniband: InfiniCon InfinIO fabric bridge (IB to FC and GigE) • Inter-connect: Myrinnet, GigE • Storage • 1Gb/s FC: Dot Hill, Silicon Gear, Chaparral • 2Gb/s FC: Chaparral, Yotta Yotta NetStorager, EMC CX 600
Benchmark Methodologies • NERSC Background: • Large number of user-written codes with varying file and I/O characteristics • Unlike industry, can’t optimize for a few applications • More general approach needed: • Determine strengths and weakness of each file system • Profile the performance and scalability of fabric and storage as a baseline for file system performance • Profile file system parallel I/O and metadata performance over the various fabric and storage devices
Our Approach to Benchmark Methodologies • Profile the performance and scalability of fabric and storage • Profile file system performance over the various fabric and storage devices • We have developed two flexible benchmarks: • The mptio benchmark, for parallel file I/O testing • Cache (in-cache) read/write • Disk (out-of-cache) read/write • The metabench benchmark, for parallel file system operations (metadata) testing
Parallel File I/O Benchmark • MPTIO - Parallel I/O performance benchmark • Uses MPI for synchronization and gathering results • I/O to files or RAW devices • NOT an MPI-I/O code – uses posix I/O • Emulates a variety of user applications • Emulates how MPI-I/O Library would perform I/O • Run options: • Multiple I/O threads per MPI process • All processes perform I/O to different files/device • All processes perform I/O to disjoint regions of same file/device • Direct I/O, Synchronous I/O, Byte-Range locking, etc. • Rotate file/offset info among processes/threads between I/O tests. • Aggregate I/O rates; timestamp data for individual I/O ops.
Metadata I/O Benchmark • Metabench: Measure metadata performance of shared file system • Uses MPI for synchronization and to gather results • Eliminate unfair metadata caching effect by using separate process (rank 0) to perform test setup, but not participate in performance test. • Tests performed: • File creation, stat, utime, append, delete • Processes operating on files in distinct directories • Processes operating on files in the same directory
Performance Analysis -An Example • File System: GFS 5.1, Storage System: Silicon Gear RAID5 • Test 1: Write to distinct, per-process file = 29.6 MB/sec • Test 2: Write to shared file = 50.8 MB/sec • Expected Test 2 to be slower – high write lock contention. • Real reason: • Per-process I/O has no lock contention. Each process writes to node buffer cache then flushes to disk. Bursty I/O Overloads RAID controller. • Shared I/O has write lock contention. Each node must flush I/O to disk before releasing lock to other node. Result is more even I/O rate to RAID controller and better aggregate performance (in this case).
Summary of Benchmarking Results • We have developed a set of benchmarking tools and a methodology for evaluation. • Not all storage devices are scalable. • Fabric technologies are still evolving • Fibre Channel (FC) delivers good I/O rate and low CPU overhead – but still pricey (~$1000 per port) • iSCSI is capable of sustaining good I/O rate, but with high CPU overhead – but much cheaper (~$100 per port + hidden performance cost) • SRP (SCSI RDMA Protocol over InfiniBand) is promising, with good I/O rate and low CPU overhead – but pricey (~$1000 per port)
Summary of Benchmarking Results (Cont.) • Currently, no file system is a clear winner • Scalable storage is important to shared file system performance • Both ADIC and GFS show similar file I/O performance that matches the underlying storage performance. • Both ADIC and GFS show good metadata performance in different areas. • Lustre has not been stable enough to run many meaningful benchmarks. • More file systems need to be tested. • End-to-end performance analysis is difficult but important
What About Lustre? • GUPFS project is very interested in Lustre (SGSFS) • Lustre is in early development cycle and very volatile • Difficult to install and set up • Buggy • Changing very quickly • Currently performing poorly
Lustre File System CPU Overhead 1:46:25 – cache flush starts 2:07:57 – RUN_1 2:09:34 – cache flush 2:31:12 – RUN_2 2:32:07 – cache flush 2:55:10 – RUN_3 2:55:57 – cache flush 3:19:30 – RUN_4 3:20:10 – Gather Log Messages RUN_n – parallel I/O with n clients
Near Term Activities Plan • Continue GFS and ADIC StorNext File System evaluation • Conduct further tests of newer Lustre releases • Conduct initial evaluation of GPFS, Panasas, Ibrix • Possibly conduct test of additional file system, storage and fabric technologies • USI’s SSI File System, QFS, cxfs, StorageTank, InfinArray, SANique, SANbolic • DataDirect, 3PAR • Cisco MD95xx, Qlogic iSCSI HBA, 4xHCA, TOE Cards • Continue beta tests • GFS 5.2, ADIC StorNext 3.0, Yotta Yotta, InfiniCon, TopSpin • Complete test analysis and document results on the web • Plan future evaluation activities
Summary • Five year project under way to deploy shared disk file system in NERSC production environment • Established benchmarking methodologies for evaluation • Technology is still evolving • Key technologies are being identified and evaluated • Currently, most products are buggy and do not scale well, but we have the vendors attention!