1 / 20

Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

CHEP 04 Sep 27, 2004 Interlaken, Switzerland. Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing. Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1 1 Grid Technology Research Center, AIST

misae
Download Presentation

Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CHEP 04 Sep 27, 2004 Interlaken, Switzerland Gfarm v2: A Grid file system that supports high-performancedistributed and parallel data computing Osamu Tatebe1, Noriyuki Soda2, Youhei Morita3, Satoshi Matsuoka4 , Satoshi Sekiguchi1 1Grid Technology Research Center, AIST 2SRA, Inc, 3KEK, 4Tokyo Institute of Technology / NII

  2. [Background] Petascale Data Intensive Computing • High Energy Physics • CERN LHC, KEK-B Belle • ~MB/collision, 100 collisions/sec • ~PB/year • 2000 physicists, 35 countries Detector forLHCb experiment Detector for ALICE experiment • Astronomical Data Analysis • data analysis of the whole data • TB~PB/year/telescope • Subaru telescope • 10 GB/night, 3 TB/year

  3. Petascale Data-intensive ComputingRequirements • Peta/Exabyte scale files, millions of millions of files • Scalable computational power • > 1TFLOPS, hopefully > 10TFLOPS • Scalable parallel I/O throughput • > 100GB/s, hopefully > 1TB/s within a system and between systems • Efficiently global sharing with group-oriented authentication and access control • Fault Tolerance / Dynamic re-configuration • Resource Management and Scheduling • System monitoring and administration • Global Computing Environment

  4. Goal and feature of Grid Datafarm • Goal • Dependable data sharing among multiple organizations • High-speed data access, High-performance data computing • Grid Datafarm • Gfarm File System– Global dependable virtual file system • Federates scratch disks in PCs • Parallel & distributed data computing • Associates Computational Grid with Data Grid • Features • Secured based on Grid Security Infrastructure • Scalable depending on data size and usage scenarios • Data location transparent data access • Automatic and transparent replica selection for fault tolerance • High-performance data access and computing by accessing multiple dispersed storages in parallel (file affinity scheduling)

  5. /gfarm ggf jp file1 file2 aist gtrc file2 file1 file3 file4 Grid Datafarm (1): Gfarm file system - World-wide virtual file system [CCGrid 2002] • Transparent access to dispersed file data in a Grid • POSIX I/O APIs • Applications can access Gfarm file system without any modificationas if it is mounted at /gfarm • Automatic and transparent replica selection for fault tolerance and access-concentration avoidance Virtual Directory Tree File system metadata mapping Gfarm File System File replica creation

  6. Grid Datafarm (2): High-performance data access and computing support [CCGrid 2002] Parallel and distributed file I/O Do not separate Storage and CPU

  7. Scientific Application • ATLAS Data Production • Distribution kit (binary) • Atlfast – fast simulation • Input data stored in Gfarmfile system not NFS • G4sim – full simulation (Collaboration with ICEPP, KEK) • Belle Monte-Carlo Production • 30 TB data needs to be generated • 3 M events (60 GB) / day is beinggenerated using a 50-node PC cluster • Simulation data will be generateddistributedly in tens of universitiesand KEK (Collaboration with KEK, U-Tokyo)

  8. GfarmTM v1 • Open source development • GfarmTM version 1.0.3.1 released on July 5, 2004 (http://datafarm.apgrid.org/) • scp, GridFTP server、samba server, . . . Metadata server Application gfmd slapd *Existing applications can access Gfarm file system without any modification using LD_PRELOAD Gfarm library CPU CPU CPU CPU gfsd gfsd gfsd gfsd . . . Compute and file system nodes

  9. Problems of GfarmTM v1 • Functionality of file access • File open in read-write mode*, file locking(* supported in version 1.0.4) • Robustness • Consistency between metadata and physical file • at unexpected application crash • at unexpected modification of physical files • Security • Access control of filesystem metadata • Access control of files by group • File model of Gfarm file - group of files (collection, container) • Flexibility of file grouping

  10. Design of GfarmTM v2 • Supports more than ten thousands of clients and file server nodes • Provides scalable file I/O performance • Gfarm v2 – towards *true* global virtual file system • POSIX compliant - supports read-write mode, advisory file locking, . . . • Robust, dependabe, and secure • Can be substituted for NFS, AFS, . . .

  11. Related work (1) • Lustre • >1,000 clients • Object (file) based management, placed in any OST • No replica management, Writeback cache, Collaborative read cache (planned) • GSSAPI, ACL,StorageTek SFS • Kernel module http://www.lustre.org/docs/ols2003.pdf

  12. Related work (2) • Google File System • >1,000 storage nodes • Fixed-size chunk, placed in any chunkserver • by default, three replicas • User client library, no client and server cache • not POSIX API, support for Google’s data processing needs [SOSP’03]

  13. Opening files in read-write mode (1) • Semantics (the same as AFS) • [without advisory file locking]Updated content is available only when opening the file after a writing process closes the file • [with advisory file locking]Among processes that locks a file, up-to-date content is available in the locked region.This is not ensured when a process writes the same file without file locking.

  14. Opening file in read-write mode (2) Process 1 Process 2 Metadata server fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”) file2 file2 /grid FSN1 FSN2 File access File access fclose() ggf jp fclose() file1 file2 Delete invalid file copy in metadata, but file access is continued FSN1 FSN2 Before closing, any file copycan be accessed

  15. Advisory file locking Process 1 Process 2 Metadata server fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”) file2 file2 /grid Read lock request FSN1 FSN2 Cache flush Disable caching File access File access ggf jp FSN1 File access file1 file2 FSN1 FSN2

  16. Consistent update of metadata (1) Gfarm v1 – Gfarm library updates metadata open Metadata server Application FSN1 close Gfarm library Update metadata Metadata is not updated at unexpected application crash File system node FSN1

  17. Consistent update of metadata (2) Gfarm v2 – file system node updates metadata open Metadata server Application FSN1 Gfarm library close or broken pipe Update metadata File system node FSN1 Metadata is updated by file system node even at unexpected application crash

  18. Generalization of file grouping model Image files taken by Subaru telescope . . . 10 files . . . . . . . . . . . . N sets • 10 files executed in parallel • N files executed in parallel • 10 x N files executed in parallel

  19. File grouping by directory night1 night1-ccd1 . . . . . . shotN shotN shot1 shot2 shot1 shot2 Symlink/hardlink to night1/shot2/ccd1 . . . ccd1 ccd0 ccd9 gfs_pio_open(“night1/shot2”, &gf) Open a Gfarm file thatconcatenates ccd0, . . ., ccd9 gfs_pio_set_view_section(gf, “ccd1”) Set file view to ccd1 section gfs_pio_open(“night1”, &gf) Open a Gfarm file that Concatenates shot1/ccd0, . . ., and shotN/ccd9

  20. Summary and future work • GfarmTM v2 aims at global virtual file system having • Scalability up to more than ten thousands of clients and file system nodes • Scalable file I/O performance • POSIX complience (read-write mode, file locking, . . .) • Fault tolerance, robustness and dependability. • Design and implementation is discussed Future work • Implementation and performance evaluation • Evaluation for scalability up to more than ten thousands of nodes • Data preservation, automatic replica creation

More Related