1 / 9

Oxford STEP09 Report

Oxford STEP09 Report. Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009. Storage access. Our main lesson from STEP09 was the big change in ATLAS’ use of storage from lots of small files to fewer large ones.

kelda
Download Presentation

Oxford STEP09 Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

  2. Storage access • Our main lesson from STEP09 was the big change in ATLAS’ use of storage from lots of small files to fewer large ones. • This moved the SE bottleneck from authentication on head node to bandwidth on the disk pools. • We implemented network channel bonding on the pools to immediate effect. • We had five disk pools taking most of the load; with more servers we should have higher overall bandwidth. Oxford STEP09 Report

  3. Some pool nodes not stressed • Some smaller pool nodes were marked read only and hence had no STEP09 data Oxford STEP09 Report

  4. Earlier tests stressed the head nodes network more. • STEP09 had little load on head node, and only sporadic network activity Oxford STEP09 Report

  5. Data Transfer – small but successful • Oxford are on an 18% share of AODs. • MCDISK had over 10TB of old reconTotal: 337240 files in 548 datasets, size 10902.36GBc.f. Glasgow where we have 8 recon datasets. Oxford STEP09 Report

  6. Observations - ATLAS • Apparently ATLAS were keeping an eLog, we weren’t aware of it (or forgot about it) until the test was essentially over. • ATLAS pilot jobs were coming in with Role=Production and Role=Pilot. It makes no sense and it breaks things; please stop. • ATLAS’ pilot factory got wedged by a dead CE. We have a dual redundant pair of CEs feeding the main SL4 system, t2ce02 and t2ce04, one of which died. Initially the factory was only using t2ce02, then, when reconfigured, refused to use t2ce04 because of the backlog of (dead) pilots that had been sent to t2ce02. • The direct (non rfcp file stager) access kills us; we can’t run many jobs like that at once. Oxford STEP09 Report

  7. Observations - LHCb • We were sent very few jobs, as far as we know they ran fine. • Er. That’s it. Oxford STEP09 Report

  8. What next? • It would be useful to run more tests: • on the individual access methods, one at a time, • Then a repeat of the mixture, but with different DNs for each method. • Likely upgrades: • More pools online. We’ve been waiting for DPM 1.7 to do some necessary maintenance. Once that’s done we should have about twelve active pool nodes rather than five. • 10Gb Ethernet? (Not soon, and probably only on new kit) Oxford STEP09 Report

  9. Conclusions • Oxford has no local ATLAS expert so we suffered from a lack of awareness of the three different submission methods. • Have been working with Peter Love to understand why the pilot job submission method was failing. • We would like to have a post mortem meeting with Brian to make sure we understand what caused our other various problems. Oxford STEP09 Report

More Related