1 / 34

COTS Parallel Archive System: Integration and Performance Studies

This paper presents the integration and performance studies of a new parallel archive storage system concept, using commercial-off-the-shelf (COTS) components and innovative software technology. The system addresses challenges in scalability, speed, and flexibility for archival storage in the HPC community. The authors share their experience in integrating a global parallel file system and a standard backup/archive product with a parallel software code, demonstrating its capability to meet the requirements of future archival storage systems.

sandyb
Download Presentation

COTS Parallel Archive System: Integration and Performance Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The University of California operates Los Alamos National Laboratory for the National Nuclear Security & Administration of the United States Department of Energy. LANL Document Number LA-UR-10-06115 Integration Experiences and Performance Studies of A COTS Parallel Archive System -A New Parallel Archive Storage System Concept and Implementation Hsing-bung (HB) Chen, Gary Grider, Cody Scott, Milton Turley Aaron Torres, Kathy Sanchez, John Bremer Los Alamos National Laboratory Los Alamos, New Mexico 87545, USA September 22nd, 2010 IEEE International Conference on Cluster Computing 2010 Heraklion,Crete, Greece

  2. Abstract Present and future Archive Storage Systems have been challenged to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have also been demanded to perform the same manner but at one or more orders of magnitude faster in performance. Archive systems continue to improve substantially comparable to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently, the number of extreme highly scalable parallel archive solutions is very limited especially for moving a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and an innovative software technology can bring new capabilities into a production environment for the HPC community. This solution is much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. We relay our experience of integrating a global parallel file system and a standard backup/archive product with an innovative parallel software code to construct a scalable and parallel archive storage system. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world’s first petaflop/s computing system, LANL’s Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems. Now this new Parallel Archive System is used on the LANL’s Turquoise Network

  3. Agenda • Background • Issues, Motivation, and Leverage of using COTS Parallel Archive System • Proposed COTS Parallel Archive System • PERFORMANCE STUDIES ON LANL’S ROADRUNNER OPEN SCIENCE PROJECTS • Experience and observed issues of our COTS Parallel Archive System • Summary and Future Works

  4. The DOE Advanced Strategic Computing Initiative Program published this Kiviat diagram that shows parallel file systems scaling performance at an order of magnitude faster than parallel archives

  5. Background • Parallel File Systems & Parallel I/O • HSM Hierarchical Storage Management (HSM) • ILM – Information Life cycle Management • Non-Parallel vs. Parallel Archive Systems • Parallel Archives That Do Not Leverage Parallel File Systems as Their First Tier of Storage • Parallel Archives That Do Leverage Parallel File Systems as Their First Tier of Storage

  6. Archives That Do Not Leverage Parallel File Systems Cluster A Cluster B Scalable Storage Area Network PFS FTA & Similar + Global Parallel File - Scratch File System Global Archive Storage System- Disks + Tapes FTA & Similar – non-parallel data movement

  7. Parallel Archives That Leverage Parallel File Systems Scalable Storage Area Network Cluster A Cluster B PFS Archive Path : Read PFS write NFS Global Parallel File System – scratch file system NFS File Transfer Agent + Migration Path Global Parallel File System + Parallel Tape Archive System- Disks + Tapes  HSM

  8. Motivation - 1 • More leverage of parallel file systems to provide parallel archive is possible and makes sense • Can we leverage parallel file system and non parallel archive COTS solutions that are highly leveragable to build a highly leveraged parallel archive with very creative and unique code needed to provide the parallel archive service? • If this can be realized, a huge cost savings in providing this kind of parallel data movement service could possibly be realized

  9. Motivation - 2 • Disk is becoming more competitive with tape over time for a larger portion of archival data , • Moderate and growing volume Global Parallel File Systems market, • Scalable bandwidth and metadata • Growing use of Global Parallel File Systems for moderate scale HPC • HSM and ILM features in file systems and archives, • High volume (non single parallel file) archives for backup/archive/content mgmt, and • Leverage all free file movement/management tools in Linux, copy, move, ls, tar, etc. • a well known file management environment • get scp, sftp, and web/gui file management for free etc.

  10. Challenging for Parallel Archive System (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware.

  11. Proposed COTS Parallel Archive System • Build a parallel tree walker and copy user space utility, • Add storage pool (stgpool) support (using file system API), • Create an efficient ordered file retrieval utility (using dmapi API and back end tape system query), • Add support for ILM stgpool features, • Add support for ILM stgpool and co-location features in the archive back-end, and • Use FUSE to break up enormous files into pieces that can be migrated and recalled in parallel to/from the back end tape system

  12. Proposed Parallel Archive System - PFTOOL Parallel & Scalable I/O Networking System Cluster A PFS PFS Cluster B Scalable FTA Cluster Parallel Data Movers PFTOOL PFS - Parallel File System I/O PFS Scratch Global Parallel File System • Scalable FTA (File transfer agent) Cluster: • Mounts site Global File System and other site shared file system • Runs commercial ILM enabled Parallel File System • Runs one or multiple copies of commercial backup archive • Runs HSM • Submits job to FTA cluster for data optimized data movement to/from archive Storage Area Network Global Parallel File System/ILM Parallel Tape Archive System

  13. Manager – The conductor • Coordinates parallel tree walk • Balancing File Tree walk vs. Parallel Data Moving • Manage various queues operations • Arranges copy jobs to workers • Issues ouput/display request • Generates final statistics report DirQ NameQ TapeQ Message Queues CopyQ TapeCQ MPI Message Passing Workers – file stat, file copy, tape file restore WatchDog OutPutProc ReadDirProcs TapeProcs PFTOOL’s Software Architecture

  14. PFTOOL’s MPI processes • Manager process: Conductor • OutPutProc process: Display process • WatchDog process: System status monitor • ReadDir process: Explore directory and sub-directory • TapeProc process: Tape data mover • Worker process: Parallel data mover to and from File systems

  15. Parallel File System Tree Walker

  16. Continue - Slide 16

  17. PFTOOL’s run time environment RunTime Tunning parameters – NumProcs, NumTapeProcs, ChunkSize, StoragePool info, Fuse ChunkSize, CopySize ArchiveFUSE file system – Convert a vary large file “N-to-1” copy into a N-toN copies for scaling and performance improvement File Transfer Agent Cluster – GPFS Client/Fuse Client • PFTOOL – RunTime Environemnt • 1 Manager MPI process • 1 OutPutProc MPI process • One or more ReadDirProc MPI process(s) • One or more Worker MPI process(s) • Zero or more TapeProc MPI process(s) • One WatchDog MPI process • NumProc(MPI machine list) = Sum(All MPI processes) • Note: Number TapeProc is set to 0 , when in archive process, giving more worker for copying data PFTOOL utilities – pfls, pfcp, pfcm LoadManager – generate runtime MPI machine list periodically FTA Cluster RunTime Status : On/Off, Upgrate, Testing GPFS/HSM/ILM/MySQL Query Service – Run timeData migration and restoring status

  18. PFTOOL’s runtime activities • LoadManager – Selecting available processes running on machines based on machines’ current CPU workload status • Tape optimization – reduce tape-trashing overhead (mounting and unmounting tape drives), line-up data for tape optimized sequential archiving • A single large file parallel copy – Parallel I/O data movement on a single large file • Very large file parallel copies – FUSE enhanced implementation (conversion of n-to-1 to a n-to-n copy) • Runtime tunable parameters for adjusting PFTOOL commands runtime performance – size of data chunk for copying, number of MPI processes, size of FUSE file selection, number of Tape Drives used,

  19. PFTOOL Software System • Pftool – 7000+ lines C/MPI code • GPFS dsm api code + MySQL database • Pftool commands – PERL scripts, Python scripts • Pftool loadmanger – PERL scripts • Trashcan – open source Python scripts + modification • Reusing/Modifying GNU ‘s Coreutils software code – rm, copy,……

  20. Less Aggressive MPI Polling implementation in PFTOOL while(1) { // main receiving loop MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-1: A typical AP based MPI main receiving loop int msgready = 0; while(1) { // main receiving loop // polling control enhancement while (msgready == 0) { // message is not ready yet MPI_Iprobe(fromProc,tag, comm, &msgready, &mpistatus) usleep(n micro-seconds); } MPI_Recv( message fromProc ) …. Processing message ……. } Figure 8-2: An enhancement LAP based polling control with MPI_Iprobe checking

  21. Commands supported in PFTOOL • pfls – using parallel file tree walker and list files in parallel • pfcp – using parallel file tree walker and copy files in parallel, and • pfcm – using parallel file tree walker and compare source and destination files in terms of byte content comparison. Users use it to verify data integrity of files after data copy.

  22. Top Level view of PFTOOL’s System

  23. RoadRunner Cluster One PetaFlop/s Five NSD node with slow disk pool - 200TB Six DS4800 Fast Disk pool - 200TB Multiple 10GiGE Switches 10 GPFS nodes (parallel data mover) run PFTOOL Mounting /panfs & /gpfs FC switch (FC-4) Two 10Gige links Scratch File System 4PetaBytes capacity /panfs One TSM Server LTO4 x 24 Tape atchive Over 4 PetaBytes One 10 GiGE Switch Parallel Archive Setup for RoadRunner ‘s Open Science Project

  24. Number of files per archive copy job

  25. Number of Mega Bytes copy per job

  26. Data bandwidth (MB/sec) copy per job

  27. Average File size copy per job

  28. MPI Polling comparison studies – CPU occupancy

  29. MPI Polling comparison studies – data rate

  30. Experience and observed issues of our COTS Parallel Archive System • Small File Tape Performance • Aggregation of small files, which consists of bundling these small files into larger aggregates better suited to getting the tape drive up to full speed, and then writing the aggregate to tape • Tape Optimization/Smart Recall • ensure that all files in a tape-recall request are handled by the same machine (Tape Trashing problem) • Limitations of the Synchronous Deleter • built-in synchronous delete function between GPFS and TSM • Single TSM Server • Considering Fail-over using multiple TSM servers

  31. Summary & Future works • Doing parallel movement to/from tape for a single large parallel file, • Hierarchical storage management, • ILM features , • High volume (non-single parallel file) archives for backup / archive / content management, and • Leveraging all free file movement & management tools in Linux such as copy, move, compare, ls, etc.

  32. Contiune - • Currently we are trying to generalize the PFTOOL software and make it accommodate most of parallel file systems such as PVFSv2, GFS, Ceph, Lustre, pNFS etc. • We plan to incorporate additional parallel data movement commands to PFTOOL such as parallel version of chown, chmod, chgrp, find, touch, and grep.

  33. Q & A Thanks

More Related