210 likes | 233 Views
Testing the UK Tier 2 Data Storage and Transfer Infrastructure. C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13 October 2006. Outline. What are we testing and why? What is the setup? Hardware and Software Infrastructure Test Procedures
E N D
Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13 October 2006
Outline • What are we testing and why? • What is the setup? • Hardware and Software Infrastructure • Test Procedures • Lessons and Successes • RAL Castor • Conclusions and Future
What and Why • What: • Set up systems and people to test the rates at which the UK Tier 2 sites can import and export data • Why: • Once the LHC experiments are up and running Tier 2 sites will need to absorb data from and upload data to the Tier 1s at quite alarming rates: • ~1Gb/s for a medium sized Tier 2 • UK has a number of “experts” in tuning DPM/dCache, this should spread some of this knowledge • Get local admins at the sites to learn a bit more about their upstream networks
Why T2 → T2 • CERN is driving the Tier 0 to Tier 1 and the Tier 1 to Tier 1 but the Tier 2s need to get ready. • But no experiment has a use case that calls for transfers between Tier 2 sites? • Test the network/storage infrastructure at each Tier 2 site. • Too many sites to test each against T1 • T1 busy with T0 → T1 and T1 → T1 • T1 → T2 tests run at then end of last year
Physical Infrastructure • Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM • Generally the network path is: • Departmental Network • Site/University Network • Metropolitan Area Network • JANET (UK’s educational/research backbone) • Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s
Network Infrastructure Dept Site/Uni MAN UK Backbone
Physical Infrastructure • Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM • Generally the network path is: • Departmental Network • Site/University Network • Metropolitan Area Network • JANET (UK’s educational/research backbone) • Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s
Software Used • This is a test of the Grid software stack as well as the T2 hardware. Therefore we try to use that: • Data Sink/Source is the SRM compliant SE • Transfers are done using the File Transfer Service (FTS) • filetransfer script used to submit and monitor the FTS transfers: http://www.physics.gla.ac.uk/~graeme/scripts/|filetransfer • Transfers are generally done over the production network to the production software without special short term tweaks
File Transfer Service • Fairly recent addition of the LCG Middleware • Manages the transfer of SRM files from one SRM server to another, manages bandwidth and queues and retries failed transfers • Defines “Channels” to transfer files between sites • Generally each T2 has three channels defined • T1 → Site • T1 ← Site • Elsewhere → Site • Each channel sets connection parameters, limits on the number of parallel transfers, VO shares, etc.
Setup • GridPP had done some T1 ↔ T2 transfer tests last year • Three sites which had already demonstrated > 300 Mb/s Transfer rates in the previous tests chosen as reference sites • Each site to be tested nominated a named individual to “own” the tests for their site
Procedure • Three weeks before the official start of the tests, the reference sites started testing against each other • Confirmed that they could still achieve the necessary rates • Tested the software to be used in the test • Each T2 site was assigned a reference site to be it’s surrogate T1 and a time slot to perform 24hr read and write tests • Basic site test was: • Beforehand copy 100 1GB “canned” to the source SRM • Repeatedly transfer these files to the sink for 24hrs • Reverse the flow and copy data from the reference for 24hrs • Rate is simply (No Files successfully transferred * Size) / Time
Issues / Lessons • Loss of a reference site before we even started • despite achieving very good rates in the previous tests, no substantive change by the site and heroic efforts, it could not sustain >250Mb/s • Tight timescale • Tests using each reference site were scheduled for each working day so if a site missed its slot or had a problem during the test there was no room to catch up
Issues / Lessons • Lack of pre-test tests • Sites only had a 48hr slot for two 24 hour tests and reference sites were normally busy with other tests for so there was little opportunity for sites to tune their storage/channel before the main tests • Network Variability • Especially prevalent during the reference site tests • Performance could vary hour by hour by as much as 50% for no reason apparent on the LANs at either end • In the long term, changes upstream (new Firewall, or rate limiting by your MAN) can reduce previous good rates to a trickle
Issues / Lessons • Needed a better recipe • With limited opportunities for site admins to try out the software a better recipe for prepare and run the test would have helped. • Email communication wasn’t always ideal • Would have been better to get phone numbers for all the site contacts • Ganglia bandwidth plots seem to under estimate the rate
What worked • Despite the above — The tests • Community Support • Reference sites got early experience running the tests and could help the early sites who in turn could help the next wave and so on • Service Reliability • The FTS much more reliable than in previous tests • Some problems with the myproxy service stopping causing transfers to stop • Sites owning the tests
Where are we now? • 14 out of 19 sites have participated, and have successfully completed 21 out of 38 tests • >60 TB of data has been transferred between sites • Max recorded transfer rate: 330Mb/s • Min recorded transfer rate: 27Mb/s
RAL Castor • During the latter part of the tests the new CASTOR at RAL was ready for testing • We had a large pool of sites which had already been tested and admin familiar with the test software who could quickly run the same tests with the new CASTOR as the endpoint • This enabled us to run tests against CASTOR and get good results whilst still running the main tests • In turn helped the CASTOR team in their superhuman efforts to get CASTOR ready for CMS’s CSA06 tests
Conclusions • UK Tier Twos have started to prepare for the data challenges that LHC running will bring • Network “weather” is variable and can have a big effect • As can any one of the upstream network providers
Future • Work with sites with low rates to understand and correct them • Keep running tests like this regularly: • sites that can do 250Mb/s now should be doing 500Mb/s by next spring and 1GB/s by this time next year
Thanks… • Most of the actual work for this was done by Jamie, who co-ordinated everything and the sysadmins, Grieg, Mark, Yves, Pete, Winnie, Graham, Olivier, Alessandra and Santanu, who ran the tests and Matt, who kept the central services running.