220 likes | 232 Views
Explore the architecture and requirements of a cutting-edge Grid Datafarm project designed for petascale data-intensive computing, facilitating extreme I/O bandwidth, fault tolerance, and global sharing. Collaborative research involving physicists from 35 countries.
E N D
Worldwide File Replicationon Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial Science and Technology (AIST) APAN 2003 Conference Fukuoka, January 2003
Detector forLHCb experiment Detector for ALICE experiment ATLAS/Grid Datafarm project:CERN LHC Experiment ~2000 physicists from 35 countries ATLAS Detector 40mx20m 7000 Tons LHCPerimeter 26.7km Truck Collaboration between KEK, AIST, Titech, and ICEPP, U Tokyo
Petascale Data-intensive Computing Requirements • Peta/Exabyte scale files • Scalable parallel I/O throughput • > 100GB/s, hopefully > 1TB/s within a system and between systems • Scalable computational power • > 1TFLOPS, hopefully > 10TFLOPS • Efficiently global sharing with group-oriented authentication and access control • Resource Management and Scheduling • System monitoring and administration • Fault Tolerance / Dynamic re-configuration • Global Computing Environment
Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support • Cluster-of-cluster filesystem on the Grid • File replicas among clusters for fault tolerance and load balancing • Extension of striping cluster filesystem • Arbitrary file block length • Filesystem node = compute node + I/O node. Each node has large fast local disks. • Parallel I/O, parallel file transfer, and more • Extreme I/O bandwidth, >TB/s • Exploit data access locality • File affinity scheduling and local file view • Fault tolerance – file recovery • Write-once files can be re-generated using a command history and re-computation [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002 Available at http://datafarm.apgrid.org/
Distributed disks across the clusters form a single Gfarm file system • Each cluster generates the corresponding part of data • The data are replicated for fault tolerance and load balancing (bandwidth challenge!) • Analysis process is executed on the node that has the data Indiana Sandiego Tokyo Baltimore Tsukuba
gfgrep gfgrep output.2 output.4 gfgrep gfgrep gfgrep output.1 output.5 output.3 Extreme I/O bandwidth support example: gfgrep - parallel grep gfmd gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp % gfrun –G gfarm:inputgfgrep –o gfarm:outputregexpgfarm:input File affinity scheduling Host2.ch Host4.jp open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) input.2 input.4 Host1.ch Host5.jp Host3.ch grep regexp input.1 input.5 input.3 close(f1); close(f2) KEK.JP CERN.CH
Design of AIST Gfarm Cluster I • Cluster node (High density and High performance) • 1U, Dual 2.4GHz Xeon, GbE • 480GB RAID with 4 3.5” 120GB HDDs + 3ware RAID • 136MB/s on writes, 125MB/s on reads • 12-node experimental cluster (operational from Oct 2002) • 12U + GbE switch (2U) + KVM switch (2U) + keyboard + LCD • Totally 6TB RAID with 48 disks • 1063MB/s on writes, 1437MB/s on reads • 410 MB/s for file replication with 6 streams • WAN emulation with NistNET 120MB/s GbE switch 10GFlops 480GB
Melbourne Australia Grid Datafarm US-OZ-Japan Testbad Tsukuba WAN 1 Gbps OC-12 POS PNWG AIST Tokyo NOC OC-12 Indiana Univ. APAN/TransPAC Indianapolis GigaPoP OC-12 ATM StarLight Titech GbE SuperSINET OC-12 NII-ESnet HEP PVC NOC SDSC ICEPP KEK ESnet GbE 20 Mbps Japan US KEK Titech AIST SDSC Indiana U ICEPP Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s
SC2002 High-Performance Bandwidth Challenge Grid Datafarm for a HEP application Osamu Tatebe (Grid Technology Research Center, AIST) Satoshi Sekiguchi(AIST), Youhei Morita (KEK), Satoshi Matsuoka (Titech & NII), Kento Aida (Titech), Donald F. (Rick) McMullen (Indiana), Philip Papadopoulos (SDSC)
Target Application at SC2002: FADS/Goofy • Monte Carlo Simulation Framework with Geant4 (C++) • FADS/Goofy: Framework for ATLAS/Autonomous Detector Simulation / Geant4-based Object-oriented Folly http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/ • Modular I/O package selection:Objectivity/DB and/or ROOT I/Oon top of Gfarm filesystemwith good scalability • CPU intensive event simulationwith high speed file replicationand/or distribution
Network and cluster configuration for SC2002 Bandwidth Challenge OC-12 Indiana Univ. Indianapolis GigaPoP Tsukuba WAN 1 Gbps OC-12 POS SC2002, Baltimore PNWG AIST Tokyo NOC 10 GE APAN/TransPAC Grid Cluster Federation Booth SCinet 10 GE E1200 OC-12 ATM (271Mbps) StarLight Titech GbE SuperSINET NII-ESnet HEP PVC OC-12 ICEPP KEK ESnet NOC SDSC GbE 20 Mbps US Japan Total bandwidth from/to SC2002 booth: 2.137 Gbps KEK Titech AIST ICEPP SDSC Indiana U SC2002 Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s Peak CPU performance: 962 GFlops
SC2002 Bandwidth Challenge 2.286 Gbps! using 12 nodes Transpacific Multiple routes in a singleapplication! 741 Mbps A record speed Parallel file replication Indiana Univ AIST Tsukuba WAN MAFFIN 10Gbps 1Gbps 622Mbps 622Mbps Tokyo Seattle US Backbone SC2002 Booth KEK 271Mbps Chicago 10Gbps 20Mbps 622Mbps 1Gbps 1Gbps San Diego SDSC Titech U Tokyo Baltimore
SC2002 booth 12-node AIST Gfarm cluster connected with GbE connects to the SCinet with 10 GE using Force10 E1200 Performance in LAN Network bandwidth: 930Mbps File transfer bandwidth: 75MB/s(=629Mbps) GTRC, AIST The same 7-node AIST Gfarm cluster connects to Tokyo XP with GbE via Tsukuba WAN and Maffin Indiana Univ 15-node PC cluster connected with Fast Ethernet connects to Indianapolis GigaPoP with OC12 SDSC 8-node PC cluster connected with GbE connects to outside with OC12 TransPAC north and south routes Default north route South route is used by static routing for 3 nodes in each SC booth and AIST RTT between AIST and SC booth: north 199ms, south 222ms South route is shaped to 271Mbps GbE PC E1200 OC192 10GE SCinet 12 nodes PC Network and cluster configuration Network configuration at SC2002 booth
Challenging points of TCP-based file transfer • Large latency, high Bandwidth (aka LFN) • Big socket size for large congestion window • Fast window-size recovery after packet loss • High Speed TCP (internet draft by Sally Floyd) • Network striping • Packet loss due to real congestion • Transfer control • Poor disk I/O performance • 3ware RAID with 4 3.5” HDDs on each node • Over 115 MB/s (~ 1 Gbps network bandwidth) • Network striping vs. disk striping access • # streams, stripe size • Limited number of nodes • Need to achieve maximum file transfer performance
Bandwidth measurement result of TransPAC by Iperf Northern route 10-sec average bandwidth Southern route Northern route: 2 node pairs Southern route: 3 node pairs (100Mbps streams) Totally 753Mbps (10-sec avg) (peak bandwidth: 622+271=893Mbps) 5-min average bandwidth
10GE not available Due to evaluation of different configurations of southern route Due to a packet loss problem of Abilenebetween Denver and Kansas City Bandwidth measurement result between SC-booth and other sites by Iperf (1-min average) SDSC Indiana Univ TransPAC North TransPAC South TransPAC Northern route has very high deviation
File replication between US and Japan 10-sec average bandwidth Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer! (out of 893 Mbps, 10-sec average bandwidth)
File replication performance between SC-booth and other US sites
SC02 Bandwidth Challenge Result 1-sec average bandwidth 10-sec average bandwidth 0.1-sec average bandwidth We achieved 2.286 Gbps using 12 nodes! (outgoing 1.691 Gbps, incoming 0.595 Gbps)
Summary http://datafarm.apgrid.org/ datafarm@apgrid.org • Petascale Data Intensive Computing Wave • Key technology: Grid and cluster • Grid datafarm is an architecture for • Online >10PB storage, >TB/s I/O bandwidth • Efficient sharing on the Grid • Fault tolerance • Initial performance evaluation shows scalable performance • 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III • 443 MB/s using 23 parallel streams on Presto III • 1063 MB/s, 1436 MB/s on writes and reads on 12 cluster nodes of AIST Gfarm I • 410 MB/s using 6 parallel streams on AIST Gfarm I • Metaserver overhead is negligible • Gfarm file replication achieves 2.286 Gbps at SC2002 bandwidth challenge, and 741 Mbps out of 893 Mbps between US and Japan! • Smart resource broker is needed!!
Special thanks to • Rick McMullen, John Hicks (Indiana Univ, PRAGMA) • Phillip Papadopoulos (SDSC, PRAGMA) • Hisashi Eguchi (Maffin) • Kazunori Konishi, Yoshinori Kitatsuji, Ayumu Kubota (APAN) • Chris Robb (Indiana Univ, Abilene) • Force 10 Networks, Inc