1 / 30

Grid-Ireland Storage Update

Grid-Ireland Storage Update. Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin. Overview. e-INIS Regional Datastore @TCD Recent storage procurement Physical infrastructure 10Gb networking Simple lessons learned STEP09 Experiences Monitoring Network (STEP09) Storage.

psyche
Download Presentation

Grid-Ireland Storage Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid-Ireland Storage Update Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin

  2. Overview • e-INIS Regional Datastore @TCD • Recent storage procurement • Physical infrastructure • 10Gb networking • Simple lessons learned • STEP09 Experiences • Monitoring • Network (STEP09) • Storage

  3. e-INIS • The Irish National e-Infrastructure • Funds Grid-Ireland Operations Centre • Creating a National Datastore • Multiple Regional Datastores • Ops Centre runs TCD regional datastore • For all disciplines • Not just science & technology • Projects with (inter)national dimension • Central allocation process • Grid and non-grid use

  4. Procurement • Grid-Ireland @ TCD already had some • Dell Poweredge 2950 (2xQuad Xeon) • Dell MD1000 (SAS - JBOD) • After procurement data store has total • 8x Dell PE2950 (6x1TB disks, 10GbE) • 30x MD1000, each with 15x 1TB disks • ~11.6 TiB each after RAID6 and XFS format (~350TiB total) • 2x Dell Blade Chassis with 8x M600 blades each • Dell tape library (24x Ultrium 4 tapes) • HP ExDS9100 with 4 capacity blocks of 82x 1TB disks each and 4 blades • ~ 233 TiB total available for NFS/http export

  5. Division of storage • DPM installed on Dell hardware • ~100TB for Ops Centre to allocate • Rest for Irish users via allocation process • May also try to combine with iRODS • HP-ExDS high availability store • iRODS primarily • vNFS exports • Not for conventional grid use • Bridge services on blades for community specific access patterns

  6. Infrastructure • Room needed upgrade • Another cooler • UPS maxed out • New high-current AC circuits added • 2x 3kVA UPS per rack acquired for Dell equipment • ExDS has 4x 16A 3Ø - 2 on room UPS, 2 raw • 10 GbE to move data!

  7. 10GbE Optimisations • Benchmarked with netperf • http://www.netperf.org • Initially 1-2Gb/s… not good • Had machines that produced figures 4Gb/s + • What’s the difference? • Looked at a couple of documents on this: • http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf • http://docs.sun.com/source/819-0938-13/D_linux.html • Tested various of these optimisations • Initially little improvement (~100Mb/s) • Then identified the most important changes

  8. Best optimisations • Cards fitted to wrong PCI-E port • Were x4 instead of x8 • New kernel version • New kernel supports MSI-X (multiqueue) • Was saturating one core, now distributes • Increase MTU (from 1500 to 9216) • Large difference to netperf • Smaller difference to real loads • Then compared two switches with direct connection

  9. Throughput test

  10. Latency test

  11. ATLAS STEP ’09 • Storage was mostly in place • 10GbE was there but being tested • Brought into production early in STEP09 • Useful exercise for us • See bulk data transfer in conjunction with user access to stored data • The first large 'real' load on the new equipment • Grid-Ireland OpsCentre at TCD involved as Tier-2 site • Associated with NL Tier-1

  12. Peak traffic observed during STEP ‘09

  13. What did we see? • Data transfers into TCD from NL • Peaked at 440 Mbit/s (capped at 500) • Recently upgraded FW box coped well

  14. HEAnet view of GEANT link TCD view of Grid-Ireland link Internet to storage

  15. What else did we see? • Lots of analysis jobs • Running on cluster nodes • Accessing large datasets directly from storage • Caused heavy load on network and disk servers • Caused problems for other jobs accessing storage • Now known that access patterns were pathological • Also production jobs ATLAS production ATLAS analysis LHCb production

  16. Storage to cluster Almost all data stored on this server 3x1Gbit bonded links set up

  17. Network update

  18. MonAMI DPM work • Fix to distinguish FS with identical names on different servers • Fixed display of long labels Display space token stats in TB New code for pool stats

  19. MonAMI status • Pool stats first to use DPM C API • Previously everything was done via MySQL • Was able to merge some of these fixes • Time-consuming to contribute patches • Single “maintainer” with no dedicated effort … • MonAMI useful but future uncertain • Should UKI contribute effort to plugin development? • Or should similar functionality be created for “native” Ganglia?

  20. Conclusions • Recent procurement gave us a huge increase in capacity • STEP09 great test of data paths into and within our new infrastructure • Identified bottlenecks and tuned configuration • Back-ported SL5 kernel to support 10GbE on SL4 • Spread data across disk servers for load-balancing • Increased capacity of cluster-storage link • Have since upgraded switches • Monitoring crucial to understanding what’s going on • Weathermap for quick visual check • Cacti for detailed information on network traffic • LEMON and Ganglia for host load, cluster usage, etc.

  21. Thanks for your attention!

  22. Monitoring Links • Ganglia monitoring system • http://ganglia.info/ • Cacti • http://www.cacti.net/ • Network weathermap • http://www.network-weathermap.com/ • MonAMI • http://monami.sourceforge.net/

  23. DPM wishes • Quotas are close to becoming essential for us • 10GbE problems have highlighted that releases on new platforms are needed far more quickly

  24. 10Gb summary • Firewall 1Gb outbound 10Gb internally • M8024 switch in ‘bridge’ blade chassis • 24 port (16 to blades) layer 3 switch • Force10 switch main ‘backbone’ • 10GbE cards in DPM servers • 10GbE uplink from ‘National Servers’ 6224 switch • 10GbE Copper (CX4) ExDS to M6220 in 2nd blade chassis • Link between 2 blade chassis M6220 - M8024 • 4-way LAG Force10 - M8024

  25. Force10 S2410 switch • 24 port 10Gb switch • XFP modules • Dell supplied our XFPs so cost per port reduced • 10Gb/s only • Layer 2 switch • Same Fulcrum ASIC as Arista switch tested • Uses a standard reference implementation

  26. Arista demo switch • Arista networks 7124S 24 port switch • SFP+ modules • Low cost per port (switches relatively cheap too) • ‘Open’ software - Linux • Even has bash available • Potential for customisation (e.g. iptables being ported) • Can run 1Gb/s and 10Gb/s simultaneously • Just plug in the different SFPs • Layer 2/3 • Some docs refer to layer 3 as a software upgrade

  27. PCI-E • Our 10GbE cards are Intel PCI-E 10GBASE-SR • Dell had plugged most into the 4xPCI-E slot • An error was coming up in dmesg • Trivial solution: • I moved the cards to 8x slots • Now can get >5Gb/s on some machines

  28. MTU • Maximum Transmission Unit • Ethernet spec says 1500 • Most hardware/software can support jumbo frames • Ixgbe driver allowed MTU=9216 • Must be set through whole path • Different switches have different max value • Makes a big difference to netperf • Example of SL5 machines, 30s tests: • MTU=1500, TCP stream at 5399 Mb/s • MTU=9216, TCP stream at 8009 Mb/s

  29. MSI-X and Multiqueue • Machines on SL4 kernels had very poor receive performance (50Mb/s) • One core was 0% idle • Use mpstat -P ALL • Sys/soft used up the whole core • /proc/interrupts showed PCI-MSI used • All RX interrupts to one core • New kernel had MSI-X and multiqueue • Interrupts distributed, full RX performance

  30. Multiqueues • -bash-3.1$ grep eth2 /proc/interrupts • 114: 247 694613 5597495 1264609 1103 15322 426508 2089709 PCI-MSI-X eth2:v0-Rx • 122: 657 2401390 462620 499858 644629 234 1660625 1098900 PCI-MSI-X eth2:v1-Rx • 130: 220 600108 453070 560354 1937777 128178 468223 3059723 PCI-MSI-X eth2:v2-Rx • 138: 27 764411 1621884 1226975 839601 473 497416 2110542 PCI-MSI-X eth2:v3-Rx • 146: 37 171163 418685 349575 1809175 17262 574859 2744006 PCI-MSI-X eth2:v4-Rx • 154: 27 251647 210168 1889 795228 137892 2018363 2834302 PCI-MSI-X eth2:v5-Rx • 162: 27 85615 2221420 286245 779341 363 415259 1628786 PCI-MSI-X eth2:v6-Rx • 170: 27 1119768 1060578 892101 1312734 813 495187 2266459 PCI-MSI-X eth2:v7-Rx • 178: 1834310 371384 149915 104323 27463 16021786 461 2405659 PCI-MSI-X eth2:v8-Tx • 186: 45 0 158 0 0 1 23 0 PCI-MSI-X eth2:lsc

More Related