300 likes | 419 Views
Grid-Ireland Storage Update. Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin. Overview. e-INIS Regional Datastore @TCD Recent storage procurement Physical infrastructure 10Gb networking Simple lessons learned STEP09 Experiences Monitoring Network (STEP09) Storage.
E N D
Grid-Ireland Storage Update Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin
Overview • e-INIS Regional Datastore @TCD • Recent storage procurement • Physical infrastructure • 10Gb networking • Simple lessons learned • STEP09 Experiences • Monitoring • Network (STEP09) • Storage
e-INIS • The Irish National e-Infrastructure • Funds Grid-Ireland Operations Centre • Creating a National Datastore • Multiple Regional Datastores • Ops Centre runs TCD regional datastore • For all disciplines • Not just science & technology • Projects with (inter)national dimension • Central allocation process • Grid and non-grid use
Procurement • Grid-Ireland @ TCD already had some • Dell Poweredge 2950 (2xQuad Xeon) • Dell MD1000 (SAS - JBOD) • After procurement data store has total • 8x Dell PE2950 (6x1TB disks, 10GbE) • 30x MD1000, each with 15x 1TB disks • ~11.6 TiB each after RAID6 and XFS format (~350TiB total) • 2x Dell Blade Chassis with 8x M600 blades each • Dell tape library (24x Ultrium 4 tapes) • HP ExDS9100 with 4 capacity blocks of 82x 1TB disks each and 4 blades • ~ 233 TiB total available for NFS/http export
Division of storage • DPM installed on Dell hardware • ~100TB for Ops Centre to allocate • Rest for Irish users via allocation process • May also try to combine with iRODS • HP-ExDS high availability store • iRODS primarily • vNFS exports • Not for conventional grid use • Bridge services on blades for community specific access patterns
Infrastructure • Room needed upgrade • Another cooler • UPS maxed out • New high-current AC circuits added • 2x 3kVA UPS per rack acquired for Dell equipment • ExDS has 4x 16A 3Ø - 2 on room UPS, 2 raw • 10 GbE to move data!
10GbE Optimisations • Benchmarked with netperf • http://www.netperf.org • Initially 1-2Gb/s… not good • Had machines that produced figures 4Gb/s + • What’s the difference? • Looked at a couple of documents on this: • http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf • http://docs.sun.com/source/819-0938-13/D_linux.html • Tested various of these optimisations • Initially little improvement (~100Mb/s) • Then identified the most important changes
Best optimisations • Cards fitted to wrong PCI-E port • Were x4 instead of x8 • New kernel version • New kernel supports MSI-X (multiqueue) • Was saturating one core, now distributes • Increase MTU (from 1500 to 9216) • Large difference to netperf • Smaller difference to real loads • Then compared two switches with direct connection
ATLAS STEP ’09 • Storage was mostly in place • 10GbE was there but being tested • Brought into production early in STEP09 • Useful exercise for us • See bulk data transfer in conjunction with user access to stored data • The first large 'real' load on the new equipment • Grid-Ireland OpsCentre at TCD involved as Tier-2 site • Associated with NL Tier-1
Peak traffic observed during STEP ‘09
What did we see? • Data transfers into TCD from NL • Peaked at 440 Mbit/s (capped at 500) • Recently upgraded FW box coped well
HEAnet view of GEANT link TCD view of Grid-Ireland link Internet to storage
What else did we see? • Lots of analysis jobs • Running on cluster nodes • Accessing large datasets directly from storage • Caused heavy load on network and disk servers • Caused problems for other jobs accessing storage • Now known that access patterns were pathological • Also production jobs ATLAS production ATLAS analysis LHCb production
Storage to cluster Almost all data stored on this server 3x1Gbit bonded links set up
MonAMI DPM work • Fix to distinguish FS with identical names on different servers • Fixed display of long labels Display space token stats in TB New code for pool stats
MonAMI status • Pool stats first to use DPM C API • Previously everything was done via MySQL • Was able to merge some of these fixes • Time-consuming to contribute patches • Single “maintainer” with no dedicated effort … • MonAMI useful but future uncertain • Should UKI contribute effort to plugin development? • Or should similar functionality be created for “native” Ganglia?
Conclusions • Recent procurement gave us a huge increase in capacity • STEP09 great test of data paths into and within our new infrastructure • Identified bottlenecks and tuned configuration • Back-ported SL5 kernel to support 10GbE on SL4 • Spread data across disk servers for load-balancing • Increased capacity of cluster-storage link • Have since upgraded switches • Monitoring crucial to understanding what’s going on • Weathermap for quick visual check • Cacti for detailed information on network traffic • LEMON and Ganglia for host load, cluster usage, etc.
Monitoring Links • Ganglia monitoring system • http://ganglia.info/ • Cacti • http://www.cacti.net/ • Network weathermap • http://www.network-weathermap.com/ • MonAMI • http://monami.sourceforge.net/
DPM wishes • Quotas are close to becoming essential for us • 10GbE problems have highlighted that releases on new platforms are needed far more quickly
10Gb summary • Firewall 1Gb outbound 10Gb internally • M8024 switch in ‘bridge’ blade chassis • 24 port (16 to blades) layer 3 switch • Force10 switch main ‘backbone’ • 10GbE cards in DPM servers • 10GbE uplink from ‘National Servers’ 6224 switch • 10GbE Copper (CX4) ExDS to M6220 in 2nd blade chassis • Link between 2 blade chassis M6220 - M8024 • 4-way LAG Force10 - M8024
Force10 S2410 switch • 24 port 10Gb switch • XFP modules • Dell supplied our XFPs so cost per port reduced • 10Gb/s only • Layer 2 switch • Same Fulcrum ASIC as Arista switch tested • Uses a standard reference implementation
Arista demo switch • Arista networks 7124S 24 port switch • SFP+ modules • Low cost per port (switches relatively cheap too) • ‘Open’ software - Linux • Even has bash available • Potential for customisation (e.g. iptables being ported) • Can run 1Gb/s and 10Gb/s simultaneously • Just plug in the different SFPs • Layer 2/3 • Some docs refer to layer 3 as a software upgrade
PCI-E • Our 10GbE cards are Intel PCI-E 10GBASE-SR • Dell had plugged most into the 4xPCI-E slot • An error was coming up in dmesg • Trivial solution: • I moved the cards to 8x slots • Now can get >5Gb/s on some machines
MTU • Maximum Transmission Unit • Ethernet spec says 1500 • Most hardware/software can support jumbo frames • Ixgbe driver allowed MTU=9216 • Must be set through whole path • Different switches have different max value • Makes a big difference to netperf • Example of SL5 machines, 30s tests: • MTU=1500, TCP stream at 5399 Mb/s • MTU=9216, TCP stream at 8009 Mb/s
MSI-X and Multiqueue • Machines on SL4 kernels had very poor receive performance (50Mb/s) • One core was 0% idle • Use mpstat -P ALL • Sys/soft used up the whole core • /proc/interrupts showed PCI-MSI used • All RX interrupts to one core • New kernel had MSI-X and multiqueue • Interrupts distributed, full RX performance
Multiqueues • -bash-3.1$ grep eth2 /proc/interrupts • 114: 247 694613 5597495 1264609 1103 15322 426508 2089709 PCI-MSI-X eth2:v0-Rx • 122: 657 2401390 462620 499858 644629 234 1660625 1098900 PCI-MSI-X eth2:v1-Rx • 130: 220 600108 453070 560354 1937777 128178 468223 3059723 PCI-MSI-X eth2:v2-Rx • 138: 27 764411 1621884 1226975 839601 473 497416 2110542 PCI-MSI-X eth2:v3-Rx • 146: 37 171163 418685 349575 1809175 17262 574859 2744006 PCI-MSI-X eth2:v4-Rx • 154: 27 251647 210168 1889 795228 137892 2018363 2834302 PCI-MSI-X eth2:v5-Rx • 162: 27 85615 2221420 286245 779341 363 415259 1628786 PCI-MSI-X eth2:v6-Rx • 170: 27 1119768 1060578 892101 1312734 813 495187 2266459 PCI-MSI-X eth2:v7-Rx • 178: 1834310 371384 149915 104323 27463 16021786 461 2405659 PCI-MSI-X eth2:v8-Tx • 186: 45 0 158 0 0 1 23 0 PCI-MSI-X eth2:lsc