1 / 38

GUPFS Project Progress Report

GUPFS Project Progress Report. May 28-29, 2003 GUPFS Team Greg Butler, Rei Lee, Michael Welcome NERSC. Outline. GUPFS Project Overview Summary of FY 2002 Activities Testbed Technology Update Current Activities Plan Benchmark Methodologies and Results What about Lustre?

lcassidy
Download Presentation

GUPFS Project Progress Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GUPFS Project Progress Report May 28-29, 2003 GUPFS Team Greg Butler, Rei Lee, Michael Welcome NERSC

  2. Outline • GUPFS Project Overview • Summary of FY 2002 Activities • Testbed Technology Update • Current Activities Plan • Benchmark Methodologies and Results • What about Lustre? • Near Term Activities and Future Plan

  3. GUPFS Project Overview • Five year project introduced in NERSC Strategic Proposal • Purpose to make it easier to conduct advanced scientific research using NERSC systems • Simplify end user data management by providing a shared disk file system in NERSC production environment • An evaluation, selection, and deployment project • May conduct or support development activities to accelerate functionality or supply missing functionality

  4. Current NERSC Storage Configuration • Each system has its own separate direct-attached storage • Each system has its own separate user file system and name space • Data transfer between systems is over the LAN

  5. Global Unified Parallel File System (GUPFS) • Global/Unified • A file system shared by major NERSC production systems. • Using consolidated storage and providing unified name space • Automatically sharing user files between systems without replication • Integration with HPSS and Grid is highly desired • Parallel • File system providing performance near to that of native system-specific file systems

  6. NERSC Storage Vision • Single storage pool, decoupled from NERSC computational systems • Flexible management of storage resource • All systems have access to all storage • Buy new storage (faster and cheaper) only as we need it • High performance large capacity storage • Users see same file from all systems • No need for replication • Visualization server has access to data as soon as it is created • Integration with mass storage • Provide direct HSM and backups through HPSS without impacting computational systems • (Potential) Geographical distribution • Reduce need for file replication • Facilitate interactive collaboration

  7. Envisioned NERSC Storage Configuration(GUPFS)

  8. Where We Are Now • Mid way into 2nd year of three-year technology evaluation of: Shared File System, SAN Fabric, and Storage • Constructed complex testbed environment simulating envisioned NERSC environment • Developed testing methodologies for evaluation • Identifying and testing appropriate technologies in all three areas • Collaborating with vendors to fix problems and influence their development in directions beneficial to HPC

  9. Summary FY02 Activities • Completed acquisition and configuration of initial testbed system • Developed GUPFS project as part of NERSC Strategic Proposal • Developed relationships with technology vendors and other Labs • Identified and tracked existing and emerging technologies and trends • Initiated informal NERSC user I/O survey • Developed testing methodologies and benchmarks for evaluation • Evaluated baseline storage characteristics • Evaluated two versions of Sistina GFS file system • Conducted technology update of testbed system • Prepared GUPFS FY02 Technical Report

  10. Testbed Technology Update • Rapid changes in all three technology areas during FY02 • Technology areas remain extremely volatile • Initial testbed inadequate for broader scope of GUPFS project • Initial testbed unable to accommodate advanced technologies: • Too few nodes to install all the advanced fabric elements • Fabric too small and not heterogeneous • Too few nodes to absorb/stress new storage and fabric elements • Too few nodes to test emerging file system • Too few nodes to explore scalability and scalability projections

  11. Initial Testbed

  12. Updated Testbed Design Goals • Designed to be flexible and accommodate future technologies • Designed to support testing of new fabric technologies • iSCSI • 2Gb/s Fibre Channel • Infiniband storage traffic • Designed to support testing of emerging (inter-connect based) shared file systems • Lustre • InfinARRAY

  13. Updated Tesbed

  14. Recent and Ongoing Activities • Refining testing and benchmark methodologies • Further developing testing and benchmark tools • Testing GFS 5.1 and ADIC StorNext file system • Trying to conduct initial tests of current Lustre • Continuing with beta tests of advanced equipments • InfiniCon 4x Infiniband, Yotta Yotta NetStorager • Testing advanced fabric technologies • iSCSI, Infiniband, Fabric Bridges, 2Gb/s FC • Continuing development of vendor relationships • arranging tests of multiple vendor’s new technologies • GUPFS FY02 Technical Report put on the web • http://www.nersc.gov/aboutnersc/pubs/GUPFS_02.pdf

  15. Technologies Evaluated • File Systems • Sistina 5.1 • ADIC StorNext (CVFS) File System 2.0 • Lustre 0.6 (1.0 Beta 1) • Fabric • FC: Brocade SilkWorm and Qlogic SANbox2 • iSCSI: Cisco SN 5428, Intel iSCSI HBA, iSCSI over IB • Infiniband: InfiniCon InfinIO fabric bridge (IB to FC and GigE) • Inter-connect: Myrinnet, GigE • Storage • 1Gb/s FC: Dot Hill, Silicon Gear, Chaparral • 2Gb/s FC: Chaparral, Yotta Yotta NetStorager, EMC CX 600

  16. File System Feature Matrix

  17. Benchmark Methodologies • NERSC Background: • Large number of user-written codes with varying file and I/O characteristics • Unlike industry, can’t optimize for a few applications • More general approach needed: • Determine strengths and weakness of each file system • Profile the performance and scalability of fabric and storage as a baseline for file system performance • Profile file system parallel I/O and metadata performance over the various fabric and storage devices

  18. Our Approach to Benchmark Methodologies • Profile the performance and scalability of fabric and storage • Profile file system performance over the various fabric and storage devices • We have developed two flexible benchmarks: • The mptio benchmark, for parallel file I/O testing • Cache (in-cache) read/write • Disk (out-of-cache) read/write • The metabench benchmark, for parallel file system operations (metadata) testing

  19. Parallel File I/O Benchmark • MPTIO - Parallel I/O performance benchmark • Uses MPI for synchronization and gathering results • I/O to files or RAW devices • NOT an MPI-I/O code – uses posix I/O • Emulates a variety of user applications • Emulates how MPI-I/O Library would perform I/O • Run options: • Multiple I/O threads per MPI process • All processes perform I/O to different files/device • All processes perform I/O to disjoint regions of same file/device • Direct I/O, Synchronous I/O, Byte-Range locking, etc. • Rotate file/offset info among processes/threads between I/O tests. • Aggregate I/O rates; timestamp data for individual I/O ops.

  20. Metadata I/O Benchmark • Metabench: Measure metadata performance of shared file system • Uses MPI for synchronization and to gather results • Eliminate unfair metadata caching effect by using separate process (rank 0) to perform test setup, but not participate in performance test. • Tests performed: • File creation, stat, utime, append, delete • Processes operating on files in distinct directories • Processes operating on files in the same directory

  21. Storage Performance

  22. Storage Scalability

  23. Fabric Performance & CPU Overhead (Write)

  24. Fabric Performance & CPU Overhead (Read)

  25. File System - File I/O Performance

  26. File System - Metadata Performance (1)

  27. File System - Metadata Performance (2)

  28. Performance Analysis -An Example • File System: GFS 5.1, Storage System: Silicon Gear RAID5 • Test 1: Write to distinct, per-process file = 29.6 MB/sec • Test 2: Write to shared file = 50.8 MB/sec • Expected Test 2 to be slower – high write lock contention. • Real reason: • Per-process I/O has no lock contention. Each process writes to node buffer cache then flushes to disk. Bursty I/O Overloads RAID controller. • Shared I/O has write lock contention. Each node must flush I/O to disk before releasing lock to other node. Result is more even I/O rate to RAID controller and better aggregate performance (in this case).

  29. MPTIO: GFS/SG Example

  30. GFS fsync() Bug

  31. Performance Regression Test

  32. Summary of Benchmarking Results • We have developed a set of benchmarking tools and a methodology for evaluation. • Not all storage devices are scalable. • Fabric technologies are still evolving • Fibre Channel (FC) delivers good I/O rate and low CPU overhead – but still pricey (~$1000 per port) • iSCSI is capable of sustaining good I/O rate, but with high CPU overhead – but much cheaper (~$100 per port + hidden performance cost) • SRP (SCSI RDMA Protocol over InfiniBand) is promising, with good I/O rate and low CPU overhead – but pricey (~$1000 per port)

  33. Summary of Benchmarking Results (Cont.) • Currently, no file system is a clear winner • Scalable storage is important to shared file system performance • Both ADIC and GFS show similar file I/O performance that matches the underlying storage performance. • Both ADIC and GFS show good metadata performance in different areas. • Lustre has not been stable enough to run many meaningful benchmarks. • More file systems need to be tested. • End-to-end performance analysis is difficult but important

  34. What About Lustre? • GUPFS project is very interested in Lustre (SGSFS) • Lustre is in early development cycle and very volatile • Difficult to install and set up • Buggy • Changing very quickly • Currently performing poorly

  35. Lustre File System CPU Overhead 1:46:25 – cache flush starts 2:07:57 – RUN_1 2:09:34 – cache flush 2:31:12 – RUN_2 2:32:07 – cache flush 2:55:10 – RUN_3 2:55:57 – cache flush 3:19:30 – RUN_4 3:20:10 – Gather Log Messages RUN_n – parallel I/O with n clients

  36. Near Term Activities Plan • Continue GFS and ADIC StorNext File System evaluation • Conduct further tests of newer Lustre releases • Conduct initial evaluation of GPFS, Panasas, Ibrix • Possibly conduct test of additional file system, storage and fabric technologies • USI’s SSI File System, QFS, cxfs, StorageTank, InfinArray, SANique, SANbolic • DataDirect, 3PAR • Cisco MD95xx, Qlogic iSCSI HBA, 4xHCA, TOE Cards • Continue beta tests • GFS 5.2, ADIC StorNext 3.0, Yotta Yotta, InfiniCon, TopSpin • Complete test analysis and document results on the web • Plan future evaluation activities

  37. Summary • Five year project under way to deploy shared disk file system in NERSC production environment • Established benchmarking methodologies for evaluation • Technology is still evolving • Key technologies are being identified and evaluated • Currently, most products are buggy and do not scale well, but we have the vendors attention!

More Related