Grid Datafarm and Bandwidth Challenge

Grid Datafarm and Bandwidth Challenge Osamu Tatebe Grid Technology Research Center, National Institute of Advanced Industrial Science and Technology (AIST) 3rd PRAGMA Fukuoka January 2003

Petascale Data-intensive Computing Requirements • Peta/Exabyte scale files • Scalable parallel I/O throughput • > 100GB/s, hopefully > 1TB/s within a system and between systems • Scalable computational power • > 1TFLOPS, hopefully > 10TFLOPS • Efficiently global sharing with group-oriented authentication and access control • Resource Management and Scheduling • System monitoring and administration • Fault Tolerance / Dynamic re-configuration • Global Computing Environment

Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support • Cluster-of-cluster filesystem on the Grid • File replicas among clusters for fault tolerance and load balancing • Extension of striping cluster filesystem • Arbitrary file block length • Filesystem node = compute node + I/O node. Each node has large fast local disks. • Parallel I/O, parallel file transfer, and more • Extreme I/O bandwidth, >TB/s • Exploit data access locality • File affinity scheduling and local file view • Fault tolerance – file recovery • Write-once files can be re-generated using a command history and re-computation [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002 Available at http://datafarm.apgrid.org/

Distributed disks across the clusters form a single Gfarm file system • Each cluster generates the corresponding part of data • The data are replicated for fault tolerance and load balancing (bandwidth challenge!) • Analysis process is executed on the node that has the data Indiana Sandiego Tokyo Baltimore Tsukuba

gfgrep gfgrep output.2 output.4 gfgrep gfgrep gfgrep output.1 output.5 output.3 Extreme I/O bandwidth support example: gfgrep - parallel grep gfmd gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp % gfrun –G gfarm:inputgfgrep –o gfarm:outputregexpgfarm:input File affinity scheduling Host2.ch Host4.jp open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) input.2 input.4 Host1.ch Host5.jp Host3.ch grep regexp input.1 input.5 input.3 close(f1); close(f2) KEK.JP CERN.CH

Melbourne Australia Grid Datafarm US-OZ-Japan Testbad Tsukuba WAN 1 Gbps OC-12 POS PNWG AIST Tokyo NOC OC-12 Indiana Univ. APAN/TransPAC Indianapolis GigaPoP OC-12 ATM StarLight Titech GbE SuperSINET OC-12 NII-ESnet HEP PVC NOC SDSC ICEPP KEK ESnet GbE 20 Mbps Japan US KEK Titech AIST SDSC Indiana U ICEPP Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s

SC2002 High-Performance Bandwidth Challenge Grid Datafarm for a HEP application Osamu Tatebe (Grid Technology Research Center, AIST) Satoshi Sekiguchi(AIST), Youhei Morita (KEK), Satoshi Matsuoka (Titech & NII), Kento Aida (Titech), Donald F. (Rick) McMullen (Indiana), Philip Papadopoulos (SDSC)

Target Application at SC2002: FADS/Goofy • Monte Carlo Simulation Framework with Geant4 (C++) • FADS/Goofy: Framework for ATLAS/Autonomous Detector Simulation / Geant4-based Object-oriented Folly http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/ • Modular I/O package selection:Objectivity/DB and/or ROOT I/Oon top of Gfarm filesystemwith good scalability • CPU intensive event simulationwith high speed file replicationand/or distribution

Network and cluster configuration for SC2002 Bandwidth Challenge OC-12 Indiana Univ. Indianapolis GigaPoP Tsukuba WAN 1 Gbps OC-12 POS SC2002, Baltimore PNWG AIST Tokyo NOC 10 GE APAN/TransPAC Grid Cluster Federation Booth SCinet 10 GE E1200 OC-12 ATM (271Mbps) StarLight Titech GbE SuperSINET NII-ESnet HEP PVC OC-12 ICEPP KEK ESnet NOC SDSC GbE 20 Mbps US Japan Total bandwidth from/to SC2002 booth: 2.137 Gbps KEK Titech AIST ICEPP SDSC Indiana U SC2002 Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s Peak CPU performance: 962 GFlops

SC2002 Bandwidth Challenge 2.286 Gbps! using 12 nodes Transpacific Multiple routes in a singleapplication! 741 Mbps A record speed Parallel file replication Indiana Univ AIST Tsukuba WAN MAFFIN 10Gbps 1Gbps 622Mbps 622Mbps Tokyo Seattle US Backbone SC2002 Booth KEK 271Mbps Chicago 10Gbps 20Mbps 622Mbps 1Gbps 1Gbps San Diego SDSC Titech U Tokyo Baltimore

Challenging points of TCP-based file transfer • Large latency, high Bandwidth (aka LFN) • Big socket size for large congestion window • Fast window-size recovery after packet loss • High Speed TCP (internet draft by Sally Floyd) • Network striping • Packet loss due to real congestion • Transfer control • Poor disk I/O performance • 3ware RAID with 4 3.5” HDDs on each node • Over 115 MB/s (~ 1 Gbps network bandwidth) • Network striping vs. disk striping access • # streams, stripe size • Limited number of nodes • Need to achieve maximum file transfer performance

File replication between US and Japan 10-sec average bandwidth Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer! （out of 893 Mbps, 10-sec average bandwidth）

SC02 Bandwidth Challenge Result 1-sec average bandwidth 10-sec average bandwidth 0.1-sec average bandwidth We achieved 2.286 Gbps using 12 nodes! (outgoing 1.691 Gbps, incoming 0.595 Gbps)

Demonstration

Summary http://datafarm.apgrid.org/ datafarm@apgrid.org • Petascale Data Intensive Computing Wave • Key technology: Grid and cluster • Grid datafarm is an architecture for • Online >10PB storage, >TB/s I/O bandwidth • Efficient sharing on the Grid • Fault tolerance • Initial performance evaluation shows scalable performance • 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III • 443 MB/s using 23 parallel streams on Presto III • 1063 MB/s, 1436 MB/s on writes and reads on 12 cluster nodes of AIST Gfarm I • 410 MB/s using 6 parallel streams on AIST Gfarm I • Gfarm file replication achieves 2.286 Gbps at SC2002 bandwidth challenge, and 741 Mbps out of 893 Mbps between US and Japan! • Special thanks to: Rick McMullen, John Hicks (Indiana Univ, PRAGMA), Phillip Papadopoulos (SDSC, PRAGMA), Hisashi Eguchi (Maffin), Kazunori Konishi, Yoshinori Kitatsuji, Ayumu Kubota (APAN), Chris Robb (Indiana Univ, Abilene), Force 10 Networks, Inc

Grid Datafarm and Bandwidth Challenge

Grid Datafarm and Bandwidth Challenge

Presentation Transcript

Bandwidth and Pizza

Bandwidth

Experiences from SLAC SC2004 Bandwidth Challenge

Grid Datafarm and File System Services

SC2003 High-Performance Bandwidth Challenge Participant Trans-Pacific Grid Datafarm

BANDWIDTH

Experiences through Grid Challenge Event

Grid Challenge - programming competition on the Grid -

Grid Datafarm Architecture for Petascale Data Intensive Computing

SC2002 Bandwidth Challenge and Data Challenge Application

Experiences with the QWEST/SCinet Bandwidth Challenge 2000-2002

Provenance challenge --- my Grid

End-to-End Bandwidth Allocation and Reservation for Grid Applications

Bandwidth

Bandwidth

Visapult and the SCXX Bandwidth Challenge Hat Trick

Worldwide File Replication on Grid Datafarm

Bandwidth and noise

Worldwide File Replication on Grid Datafarm

Bandwidth Allocation and Reservation

Bandwidth