1 / 21

Testing the UK Tier 2 Data Storage and Transfer Infrastructure

Testing the UK Tier 2 Data Storage and Transfer Infrastructure. C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13 October 2006. Outline. What are we testing and why? What is the setup? Hardware and Software Infrastructure Test Procedures

smithkirby
Download Presentation

Testing the UK Tier 2 Data Storage and Transfer Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13 October 2006

  2. Outline • What are we testing and why? • What is the setup? • Hardware and Software Infrastructure • Test Procedures • Lessons and Successes • RAL Castor • Conclusions and Future

  3. What and Why • What: • Set up systems and people to test the rates at which the UK Tier 2 sites can import and export data • Why: • Once the LHC experiments are up and running Tier 2 sites will need to absorb data from and upload data to the Tier 1s at quite alarming rates: • ~1Gb/s for a medium sized Tier 2 • UK has a number of “experts” in tuning DPM/dCache, this should spread some of this knowledge • Get local admins at the sites to learn a bit more about their upstream networks

  4. Why T2 → T2 • CERN is driving the Tier 0 to Tier 1 and the Tier 1 to Tier 1 but the Tier 2s need to get ready. • But no experiment has a use case that calls for transfers between Tier 2 sites? • Test the network/storage infrastructure at each Tier 2 site. • Too many sites to test each against T1 • T1 busy with T0 → T1 and T1 → T1 • T1 → T2 tests run at then end of last year

  5. Physical Infrastructure • Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM • Generally the network path is: • Departmental Network • Site/University Network • Metropolitan Area Network • JANET (UK’s educational/research backbone) • Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s

  6. Network Infrastructure Dept Site/Uni MAN UK Backbone

  7. Physical Infrastructure • Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM • Generally the network path is: • Departmental Network • Site/University Network • Metropolitan Area Network • JANET (UK’s educational/research backbone) • Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s

  8. Software Used • This is a test of the Grid software stack as well as the T2 hardware. Therefore we try to use that: • Data Sink/Source is the SRM compliant SE • Transfers are done using the File Transfer Service (FTS) • filetransfer script used to submit and monitor the FTS transfers: http://www.physics.gla.ac.uk/~graeme/scripts/|filetransfer • Transfers are generally done over the production network to the production software without special short term tweaks

  9. File Transfer Service • Fairly recent addition of the LCG Middleware • Manages the transfer of SRM files from one SRM server to another, manages bandwidth and queues and retries failed transfers • Defines “Channels” to transfer files between sites • Generally each T2 has three channels defined • T1 → Site • T1 ← Site • Elsewhere → Site • Each channel sets connection parameters, limits on the number of parallel transfers, VO shares, etc.

  10. Setup • GridPP had done some T1 ↔ T2 transfer tests last year • Three sites which had already demonstrated > 300 Mb/s Transfer rates in the previous tests chosen as reference sites • Each site to be tested nominated a named individual to “own” the tests for their site

  11. Procedure • Three weeks before the official start of the tests, the reference sites started testing against each other • Confirmed that they could still achieve the necessary rates • Tested the software to be used in the test • Each T2 site was assigned a reference site to be it’s surrogate T1 and a time slot to perform 24hr read and write tests • Basic site test was: • Beforehand copy 100 1GB “canned” to the source SRM • Repeatedly transfer these files to the sink for 24hrs • Reverse the flow and copy data from the reference for 24hrs • Rate is simply (No Files successfully transferred * Size) / Time

  12. Issues / Lessons • Loss of a reference site before we even started • despite achieving very good rates in the previous tests, no substantive change by the site and heroic efforts, it could not sustain >250Mb/s • Tight timescale • Tests using each reference site were scheduled for each working day so if a site missed its slot or had a problem during the test there was no room to catch up

  13. Issues / Lessons • Lack of pre-test tests • Sites only had a 48hr slot for two 24 hour tests and reference sites were normally busy with other tests for so there was little opportunity for sites to tune their storage/channel before the main tests • Network Variability • Especially prevalent during the reference site tests • Performance could vary hour by hour by as much as 50% for no reason apparent on the LANs at either end • In the long term, changes upstream (new Firewall, or rate limiting by your MAN) can reduce previous good rates to a trickle

  14. Issues / Lessons • Needed a better recipe • With limited opportunities for site admins to try out the software a better recipe for prepare and run the test would have helped. • Email communication wasn’t always ideal • Would have been better to get phone numbers for all the site contacts • Ganglia bandwidth plots seem to under estimate the rate

  15. What worked • Despite the above — The tests • Community Support • Reference sites got early experience running the tests and could help the early sites who in turn could help the next wave and so on • Service Reliability • The FTS much more reliable than in previous tests • Some problems with the myproxy service stopping causing transfers to stop • Sites owning the tests

  16. Where are we now? • 14 out of 19 sites have participated, and have successfully completed 21 out of 38 tests • >60 TB of data has been transferred between sites • Max recorded transfer rate: 330Mb/s • Min recorded transfer rate: 27Mb/s

  17. Now

  18. RAL Castor • During the latter part of the tests the new CASTOR at RAL was ready for testing • We had a large pool of sites which had already been tested and admin familiar with the test software who could quickly run the same tests with the new CASTOR as the endpoint • This enabled us to run tests against CASTOR and get good results whilst still running the main tests • In turn helped the CASTOR team in their superhuman efforts to get CASTOR ready for CMS’s CSA06 tests

  19. Conclusions • UK Tier Twos have started to prepare for the data challenges that LHC running will bring • Network “weather” is variable and can have a big effect • As can any one of the upstream network providers

  20. Future • Work with sites with low rates to understand and correct them • Keep running tests like this regularly: • sites that can do 250Mb/s now should be doing 500Mb/s by next spring and 1GB/s by this time next year

  21. Thanks… • Most of the actual work for this was done by Jamie, who co-ordinated everything and the sysadmins, Grieg, Mark, Yves, Pete, Winnie, Graham, Olivier, Alessandra and Santanu, who ran the tests and Matt, who kept the central services running.

More Related