230 likes | 441 Views
Optimizing USATLAS Data Transfers. Shawn McKee USATLAS Tier-2/Tier-3 Meeting SLAC November 29 th , 2007. Working Group on Data Transfers. One of the central activities in our computing infrastructure is moving data between our sites.
E N D
Optimizing USATLAS Data Transfers Shawn McKee USATLAS Tier-2/Tier-3 Meeting SLAC November 29th, 2007
Working Group on Data Transfers • One of the central activities in our computing infrastructure is moving data between our sites. • A small working group was tasked with exploring the current data transfer capabilities between our Tier-2 and Tier-1 sites • Shawn McKee • Jay Packard • Hiro Ito • Dantong Yu • Wenjing Wu USATLAS Tier-2/3 Mtg - SLAC
Roadmap for Data Transfer Tuning • Our first goal is to achieve ~200 MB (bytes) per second from the Tier-1 to each Tier-2. This is to be achieved by either a: • Small number of high-performance systems (1-5) at each site transferring data via GridFTP, FDT or similar applications. • Single systems benchmarked with I/O > 200 MB/sec • 10 GE NICs (“back-to-back” tests achieved 9981 Mbps) • Large number of “average” systems (10-50), each transferring to a corresponding systems at the other site. This is a good match to an SRM/dCache transfer of many files between sites • Typical current disks are 40-70 MB/sec • Gigabit network is 125 MB/sec • Doesn’t require great individual performance / host: 5-25 MB/sec USATLAS Tier-2/3 Mtg - SLAC
Data Transfers for ATLAS Typical rate is 0.5 – 3 MB/s!!! USATLAS Tier-2/3 Mtg - SLAC
Our Goal is Reasonable (Modest?)… Nebraska (CMS Tier-2) achieved almost double this at “turn-on” of their 10GE link After a recent dCache failure they recovered their entire cache in ~30 hours by averaging ~700 MB/sec. USATLAS Tier-2/3 Mtg - SLAC
Primary Issues • Network • First documented what existing network capabilities were for each site • Typically POOR. Dantong showed results this morning. • After tuning could achieve “wire-speed” in memory to memory tests • Storage • Highly variable. Individual disks can vary from 5 – 90 MB/sec depending upon type of disk interface, RPMs, cache, etc. • RAID controllers can utilize disks in parallel, in principle leading to linear increases in I/O performance as a function of # of spindles(disks)… • Drivers/OS/Kernel can impact performance, as can various tunable parameters • End-to-end Paths • What bottlenecks exist between sites? Is competing traffic a problem? • Overall operations • How well does the system function when data has to move from disk-driver-memory-driver-network-wan-network-driver-memory-driver-disk?! USATLAS Tier-2/3 Mtg - SLAC
Achieving Good Networking Results • Test system-pairs with Iperf (tcp) to determine achievable bandwidth • Check ‘ifconfig <ethx>’ to see if errors or packet loss exists • Examine driver info with ‘ethtool –i <ethx>’ • Set TCP stack parameters to allow full use of bottleneck bandwidth, typically 1 gigabit. The maximum expected round-trip-time (RTT) should be used to estimate the amount of memory for data “in flight” and this should be setup in the TCP stack parameters. NOTE: set the maximums large enough…leave the default and “pressure” values low. • Retest with Iperf to determine effect of change. • Debug with Iperf (udp), ethtool, NDT, wireshark if there are problems • Remember to check both directions… USATLAS Tier-2/3 Mtg - SLAC
Initial Findings • Network for properly tuned hosts is not the bottleneck • Memory-to-disk tests interesting in that they can expose problematic I/O systems (or give confidence in them) • Disk-to-disk tests do poorly. Still a lot of work required in this area. Possible issues: • Wrongly tuned parameters for this task (driver, kernel, OS) • Competing I/O interfering • Conflicts/inefficiency in the Linux “data path” (bus-driver-memory) • Badly organized hardware, e.g., network and storage cards sharing the same bus • Underpowered hardware or bad applications for driving gigabit links USATLAS Tier-2/3 Mtg - SLAC
Network Tools and Tunings • The network stack is the 1st candidate for optimization • Amount of memory allocated for data “in-flight” determines maximum achievable bandwidth for a given src-destination • Parameters (example settings): • net.core.rmem_max = 20000000 • net.core.wmem_max = 20000000 • net.ipv4.tcp_rmem = 4096 87380 20000000 • net.ipv4.tcp_wmem = 4096 87380 20000000 • Other useful tools: Iperf, NDT, wireshark, tracepath, ethtool, ifconfig, sysctl, netperf, FDT. • Lots more info/results in this area available online… • http://www.usatlas.bnl.gov/twiki/bin/view/Admins/NetworkPerformanceP2.html USATLAS Tier-2/3 Mtg - SLAC
Tunable Parameters Impacting I/O • There are potentially MANY places in a linux OS that can have an impact on I/O for WAN transfers… • We need to explore the impact of various tunings and options . • The purple areas in the figure have been at least initially explored by Kyu Park (UFL). • Wenjing Wu (UM) will be continuing work in this area. USATLAS Tier-2/3 Mtg - SLAC
FDT Examples: High Performance Possible http://monalisa.cern.ch/FDT/ 4U Disk Server with 24 HDs 1U Server with 4 HDs CERN ->CALTECH Page_IN New York -> Geneva Read and writes on 2 RAID Controllers in parallel on each server Mean traffic ~ 545 MB/s ~ 2 TB per hour Read and writes on 4 SATA disks in parallel on each server Mean traffic ~ 210 MB/s ~ 0.75 TB per hour MB/s Working on integrating FDT with dCache
SuperComputing 2007 Results • See http://pcbunn.cacr.caltech.edu/sc2007/default.htm • Achieved ~88 Gbps disk-to-disk across the WAN (Many sites) USATLAS Tier-2/3 Mtg - SLAC
I/O Benchmark Tools & Notes • At AGLT2 we wanted to understand the best config for I/O for our new storage systems so we have been running some I/O tests: • Benchmark tool: iozone (iozone-3.279-1.el4.rf.x86_64) • Raid configuration tool: omconfig (srvadmin-omacore-5.2.0-460.i386) • Soft Raid: mdadm (mdadm-2.6.1-4.x86_64) • Testing File size was 32GB (twice the Physical Memory size) • Test of both write and read performance with different record sizes (32 64 128 256 512 1024 2048 4096 8192 kB). • Average write or read performance means the average rate of all the different record sizes.
I/O Testing at AGLT2: Hardware Details • Chassis Model: PowerEdge 2950 • 2 CPUS: Intel Xeon CPU E5335@2.00GHz Model 15 Stepping 11 • Memory : 16GB DDR II SDRAM, Memory Speed: 667 MHz • OS Scientific Linux CERN SLC release 4.5 (Beryllium) • Kernel version: 2.6.20-20UL3smp • Version Report • BIOS Version : 1.5.1 • BMC Version : 1.33 • DRAC 5 Version : 1.14 • Raid controllers • PERC 5/E Adapter Version 5.1.1-0040 (Slot 1 PCI-e 8x) • PERC 5/E Adapter Version 5.1.1-0040 (Slot 2 PCI-e 4x) • Storage Enclosures 4 MD1000 (each 15 SATA-II 750GB disks)
Optimal Number of Disks for Reads? • Read speed vs # disks in controller configuration • NOTE: the peak number is 15; this corresponds to about 45 MB/s/disk Reads (KB/sec) Disks / RAID Configuration
Optimal Number of Disks for Writes? • Write speed vs # disks in controller configuration • NOTE: the peak number is 10; this corresponds to about 45 MB/s/disk Writes(KB/sec) Disks / RAID Configuration
Diagram of 1 Storage Configuration One visible partition: SR0-2R50 USATLAS Tier-2/3 Mtg - SLAC
Multi-Process IOZOne Testing Summary • Summary of “best” settings • Ctrl FWB (forced write-back), RA (Read-ahead) • Array stripe size 128K • XFS file-system • PCI-e x8 slot allows 5% faster reads than PCI-e x4 • Configuration: • Best read is two RAID50 (30 disk) partitions • Best write is a single soft-RAID0 over two RAID5 partitions • Best read is “bad” writes and vice-versa! HOWEVER: All results are wrong! Summing rates achieved has problems…
IOZone Threaded Performance Test We had run intial IOZone tests with 1, 2, 4 and 8 IOZone processes and added the result…not correct if total test time “expands” (serializes access) and processes are in IOwait states… We reran using the threads feature of IOZone Results are very different but reasonable We are able to get read speeds in excess of 10GE speeds. Writes are more problematic…is there a CPU binding issue? Soft RAID0 over 2 RAID5 (30 Disks each) I/O Rate ( MB/s) Number of Threads of IOZone USATLAS Tier-2/3 Mtg - SLAC
Future Directions at AGLT2 • We intend to test 64-bit Solaris on our Dell storage nodes • Hardware is “supported” by Sun (Our college may have a license) • Possibility of 60 disks for ZFS is intriguing given our current tests • Will explore testing methodologies for IOZone • Verify IOZone threads use separate processors (use –P option) • In-depth exploration of “tunable” parameters for I/O • Testing of higher-level storage options: • dCache + NFSv4(.1) (Must upgrade to dCache 1.8) • Xrootd (Take part of our existing resilient dCache -> Xrootd) • GlusterFS (Small testing install on 3-6 nodes) USATLAS Tier-2/3 Mtg - SLAC
Next Steps for Working Group • Lots of work is needed for our Tier-2’s in this area. • Example from AGLT2 implies each site will want to explore options for system optimization specific to their hardware • Most reasonably powerful storage systems should be able to exceed 200MB/s. Getting this across the WAN is the challenge! • For each Tier-2 we need to tune GSIFTP (or FDT) transfers between “powerful” storage systems to achieve “bottleneck” limited performance • Implies we document the bottleneck: I/O subsystem, network, processor? • Include 10GE connected pairs where possible…target 400MB/s/host-pair • For those sites supporting SRM, try many host pair transfers to “go wide” in achieving high-bandwidth transfers. USATLAS Tier-2/3 Mtg - SLAC
Write Compare of different parallel • Writes vs # IOZone processes vs RAID config Writes (MB/sec) Number of Parallel IOZone Processes
Reads vs Number of Parallel IOZones • Reads vs # IOZone processes vs RAID Config Read (MB/sec) Number of Parallel IOZone Processes