Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

CHEP 04 Sep 27, 2004 Interlaken, Switzerland Gfarm v2: A Grid file system that supports high-performancedistributed and parallel data computing Osamu Tatebe1, Noriyuki Soda2, Youhei Morita3, Satoshi Matsuoka4 , Satoshi Sekiguchi1 1Grid Technology Research Center, AIST 2SRA, Inc, 3KEK, 4Tokyo Institute of Technology / NII

[Background] Petascale Data Intensive Computing • High Energy Physics • CERN LHC, KEK-B Belle • ~MB/collision, 100 collisions/sec • ~PB/year • 2000 physicists, 35 countries Detector forLHCb experiment Detector for ALICE experiment • Astronomical Data Analysis • data analysis of the whole data • TB~PB/year/telescope • Subaru telescope • 10 GB/night, 3 TB/year

Petascale Data-intensive ComputingRequirements • Peta/Exabyte scale files, millions of millions of files • Scalable computational power • > 1TFLOPS, hopefully > 10TFLOPS • Scalable parallel I/O throughput • > 100GB/s, hopefully > 1TB/s within a system and between systems • Efficiently global sharing with group-oriented authentication and access control • Fault Tolerance / Dynamic re-configuration • Resource Management and Scheduling • System monitoring and administration • Global Computing Environment

Goal and feature of Grid Datafarm • Goal • Dependable data sharing among multiple organizations • High-speed data access, High-performance data computing • Grid Datafarm • Gfarm File System– Global dependable virtual file system • Federates scratch disks in PCs • Parallel & distributed data computing • Associates Computational Grid with Data Grid • Features • Secured based on Grid Security Infrastructure • Scalable depending on data size and usage scenarios • Data location transparent data access • Automatic and transparent replica selection for fault tolerance • High-performance data access and computing by accessing multiple dispersed storages in parallel (file affinity scheduling)

/gfarm ggf jp file1 file2 aist gtrc file2 file1 file3 file4 Grid Datafarm (1): Gfarm file system - World-wide virtual file system [CCGrid 2002] • Transparent access to dispersed file data in a Grid • POSIX I/O APIs • Applications can access Gfarm file system without any modificationas if it is mounted at /gfarm • Automatic and transparent replica selection for fault tolerance and access-concentration avoidance Virtual Directory Tree File system metadata mapping Gfarm File System File replica creation

Grid Datafarm (2): High-performance data access and computing support [CCGrid 2002] Parallel and distributed file I/O Do not separate Storage and CPU

Scientific Application • ATLAS Data Production • Distribution kit (binary) • Atlfast – fast simulation • Input data stored in Gfarmfile system not NFS • G4sim – full simulation (Collaboration with ICEPP, KEK) • Belle Monte-Carlo Production • 30 TB data needs to be generated • 3 M events (60 GB) / day is beinggenerated using a 50-node PC cluster • Simulation data will be generateddistributedly in tens of universitiesand KEK (Collaboration with KEK, U-Tokyo)

GfarmTM v1 • Open source development • GfarmTM version 1.0.3.1 released on July 5, 2004 (http://datafarm.apgrid.org/) • scp, GridFTP server、samba server, . . . Metadata server Application gfmd slapd ＊Existing applications can access Gfarm file system without any modification using LD_PRELOAD Gfarm library CPU CPU CPU CPU gfsd gfsd gfsd gfsd . . . Compute and file system nodes

Problems of GfarmTM v1 • Functionality of file access • File open in read-write mode*, file locking(* supported in version 1.0.4) • Robustness • Consistency between metadata and physical file • at unexpected application crash • at unexpected modification of physical files • Security • Access control of filesystem metadata • Access control of files by group • File model of Gfarm file - group of files (collection, container) • Flexibility of file grouping

Design of GfarmTM v2 • Supports more than ten thousands of clients and file server nodes • Provides scalable file I/O performance • Gfarm v2 – towards *true* global virtual file system • POSIX compliant - supports read-write mode, advisory file locking, . . . • Robust, dependabe, and secure • Can be substituted for NFS, AFS, . . .

Related work (1) • Lustre • >1,000 clients • Object (file) based management, placed in any OST • No replica management, Writeback cache, Collaborative read cache (planned) • GSSAPI, ACL,StorageTek SFS • Kernel module http://www.lustre.org/docs/ols2003.pdf

Related work (2) • Google File System • >1,000 storage nodes • Fixed-size chunk, placed in any chunkserver • by default, three replicas • User client library, no client and server cache • not POSIX API, support for Google’s data processing needs [SOSP’03]

Opening files in read-write mode (1) • Semantics (the same as AFS) • [without advisory file locking]Updated content is available only when opening the file after a writing process closes the file • [with advisory file locking]Among processes that locks a file, up-to-date content is available in the locked region.This is not ensured when a process writes the same file without file locking.

Opening file in read-write mode (2) Process 1 Process 2 Metadata server fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”) file2 file2 /grid FSN1 FSN2 File access File access fclose() ggf jp fclose() file1 file2 Delete invalid file copy in metadata, but file access is continued FSN1 FSN2 Before closing, any file copycan be accessed

Advisory file locking Process 1 Process 2 Metadata server fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”) file2 file2 /grid Read lock request FSN1 FSN2 Cache flush Disable caching File access File access ggf jp FSN1 File access file1 file2 FSN1 FSN2

Consistent update of metadata (1) Gfarm v1 – Gfarm library updates metadata open Metadata server Application FSN1 close Gfarm library Update metadata Metadata is not updated at unexpected application crash File system node FSN1

Consistent update of metadata (2) Gfarm v2 – file system node updates metadata open Metadata server Application FSN1 Gfarm library close or broken pipe Update metadata File system node FSN1 Metadata is updated by file system node even at unexpected application crash

Generalization of file grouping model Image files taken by Subaru telescope . . . 10 files . . . . . . . . . . . . N sets • 10 files executed in parallel • N files executed in parallel • 10 x N files executed in parallel

File grouping by directory night1 night1-ccd1 . . . . . . shotN shotN shot1 shot2 shot1 shot2 Symlink/hardlink to night1/shot2/ccd1 . . . ccd1 ccd0 ccd9 gfs_pio_open(“night1/shot2”, &gf) Open a Gfarm file thatconcatenates ccd0, . . ., ccd9 gfs_pio_set_view_section(gf, “ccd1”) Set file view to ccd1 section gfs_pio_open(“night1”, &gf) Open a Gfarm file that Concatenates shot1/ccd0, . . ., and shotN/ccd9

Summary and future work • GfarmTM v2 aims at global virtual file system having • Scalability up to more than ten thousands of clients and file system nodes • Scalable file I/O performance • POSIX complience (read-write mode, file locking, . . .) • Fault tolerance, robustness and dependability. • Design and implementation is discussed Future work • Implementation and performance evaluation • Evaluation for scalability up to more than ten thousands of nodes • Data preservation, automatic replica creation

Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

Osamu Tatebe 1 , Noriyuki Soda 2 , Youhei Morita 3 , Satoshi Matsuoka 4 , Satoshi Sekiguchi 1

Presentation Transcript

1 2 3 4

1 2 3 4

1. 2. 3. : 4. 1 2:

. 1. - - - 2. - - 3. 4. . 1. 2. 3.

,: 1 2 3 4

Recent Developments of the Ninf Global Computing System Satoshi Matsuoka(TIT) Satoshi Sekiguchi(ETL)

1. 2. 3 . 4.

5 4 3 2 1 -1 -2 -3 -4 -5

1 2 3 4

Satoshi Morita Dept. of Biostatistics and Epidemiology, Yokohama City University Medical Center

Satoshi Kajimoto , Ryo Y amada

Satoshi SUZUKI Waseda University

Satoshi Furukawa

1 1 1 1 2 1 1 3 3 1 1 4 6 4 1

Tatsuya Ishibe, Tomitaka Nakayama, Satoshi Nagayama,

Satoshi Miura Nagoya University

Satoshi Morita Dept. of Biostatistics and Epidemiology, Yokohama City University Medical Center

1. 2. 3. 4.