140 likes | 379 Views
My Eponymous Tests. Steve Lloyd. The Steve Lloyd Tests. Started as an attempt to run ATLAS jobs at all UK sites Has grown into a major enterprise: ATLAS tests UK tests SAM test monitoring LHCb test monitoring RB/WMS tests BDII tests LFC tests SE tests FRC monitoring Network tests.
E N D
My Eponymous Tests Steve Lloyd
The Steve Lloyd Tests • Started as an attempt to run ATLAS jobs at all UK sites • Has grown into a major enterprise: • ATLAS tests • UK tests • SAM test monitoring • LHCb test monitoring • RB/WMS tests • BDII tests • LFC tests • SE tests • FRC monitoring • Network tests http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html The Steve Lloyd Tests
ATLAS Tests • Three tests - one every 20 minutes to each site • Hello World - Just runs Athena HelloWorld from the installed ATLAS software – no compilation • New Package – Builds a ‘New Package’ from scratch – essentially a Hello World but this is compiled and linked on the WN • User Analysis – Builds a physics analysis job on the WN, copies a file of Z0e+e- events from the local SE, analyses them and calculates the Z0 mass Latest software version Previous software version Old software version The Steve Lloyd Tests
UK Tests • These are a variation on the ATLAS Tests but instead of being sent to each site individually they are sent to “.ac.uk”. They are the User Analysis ATLAS test (analyse Z0 data from the local SE) The Steve Lloyd Tests
RB/WMS Tests • Hello World jobs are sent to any UK CE using each RB/WMS in turn every 15 mins • The results are used to decide on an ‘Auto RB’ and ‘Auto WMS’ which is updated dynamically and used as the RB/WMS for the other tests • These are available to download as config files The Steve Lloyd Tests
SE Tests • An attempt is made to copy a small file to each UK SE, read it back and delete it, once an hour CASTOR problems The Steve Lloyd Tests
SAM Tests • These are just downloaded from the SAM web pages every hour and a record kept • Two versions now: • What we call the SAM Tests - Ops critical tests – used for funding allocations • LHCb SAM Tests • Also Summary Table and History Plots The Steve Lloyd Tests
FCR • This polls the FCR database every 10 mins and keeps a record of the percentage of time a site is excluded – only seems to be used by ATLAS and CMS (+ some minor VOs) The Steve Lloyd Tests
LFC and BDII BDII Tests: "lcg-info" queries are submitted to each UK top level BDII every 15 mins LFC Tests: "lcg-lr" queries are submitted to each UK top level LFC every 15 mins This isn’t really a sufficient test as recently the LFC was ‘down’ but this was OK The Steve Lloyd Tests
Network Tests • The most recent test – and the trickiest! • Originally submitted a job to each site then copied a file from each SE to the WN the job was on and timed the transfer • Now changed to run in three parts (the first runs independently) • Copy a file from the Tier-1 to each SE (done from the UI – no job involved) • Run a job on each site and: • Copy the file from the local SE to the WN • Copy the file from the WN to every SE Lots of empty entries! The Steve Lloyd Tests
Problems - Timing • For each transfer get two times: • T1 - The time from “transfer took x msec” • T2 - The elapsed time between issuing the command and it completing (longer) • At present 3 files are copied – 100MB, 500MB and 1GB • If at least 2 out of 3 are successful: • For each of T1 and T2 calculate the time using difference between that for the largest and smallest file • If answers from T1 and T2 agree to within 30% average them • There can hence be several causes of failure: • Job fails • 2 or more transfers fail • The larger file takes less time that the smaller (yes it does happen) • T1 and T2 don’t agree to within 30% • Still looking at better algorithms/robustness The Steve Lloyd Tests
Problems – Zombie Files • Many files apparently left over after they are supposed to have been deleted • Lcg-del returns nothing for these files • There don’t appear to be any replicas so they seem to be entries left in the catalogue The Steve Lloyd Tests
Local UI My PC Webserver File Server Scripts Result DBs RAID Server Outputs Logfiles etc Problems - Complexity ATLAS/UK tests RB Tests SE/Network Tests BDII/LFC Tests SAM Results FCR Status GOC Status Solution (soon) – one dedicated machine If isolated would also allow remote access by Jeremy et al The Steve Lloyd Tests
Conclusions • A pretty comprehensive and complete set of ‘User experience’ end to end tests • Main problems – network test resilience and overall complexity • Missing features – • CMS specific tests • POSIX IO rather than SE to WN copy (tried a few times without success) • Any other tests? The Steve Lloyd Tests