280 likes | 390 Views
TESTING FAX USING SSS and FDR datasets. 2 nd April 2013. DETAILS. Dataset: user.flegger .*.data12_8TeV .00212172. physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00 500GB WNs: UC3 and UCT3 Discovery: Global redirector Running against: fax.mwt2.org
E N D
TESTING FAX USING SSS and FDR datasets 2nd April 2013
DETAILS • Dataset: • user.flegger.*.data12_8TeV.00212172.physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00 • 500GB • WNs: UC3 and UCT3 • Discovery: Global redirector • Running against: fax.mwt2.org • Ramp-up: 4 jobs a minute • Full data copy – split in 138 jobs for each site • Average input size: 3.62 GB • Duration does not include time for job to start • Duration does not include dq2-put time. Ilija Vukotic ivukotic@uchicago.edu
Jobs Ilija Vukotic ivukotic@uchicago.edu
MWT2 • 2 jobs hanging – finish with no error, but only next day • UCT3 show the same efficiency as UC3 • Avg. cpu eff.: 76.5% • Avg. dur. 5:59 • Avg. rate: 290 kB/s • Total rate: 39 MB/s Ilija Vukotic ivukotic@uchicago.edu
AGLT2 • 4 jobs hanging – finish with no error, but next day • Avg. CPU efficiency: 70.5% • Avg. dur. 6 h 14 min • Avg. rate: 165 kB/s • Total rate: 22MB/s Ilija Vukotic ivukotic@uchicago.edu
BU • 18 jobs hanging • Avg. CPU efficiency: 35% • Avg. dur. 11 h 2 min • Avg. rate: 108 kB/s • Total rate: 14 MB/s Ilija Vukotic ivukotic@uchicago.edu
MWT2 – 300 branches • 48 jobs in parallel • Avg. CPU efficiency: 17% • Avg. dur. 3 h 20 min • Avg. rate: 926 kB/s • Total rate: 44 MB/s Ilija Vukotic ivukotic@uchicago.edu
Conclusion 1 • Rechecked that dq2-put times were not included. • Times seems to be properly measured. • Need to solve mystery of huge CPU times. • Maybe will have to move to c++ version. Ilija Vukotic ivukotic@uchicago.edu
SSS doing XRDCP • The same DS. • But doing simple xrdcp to /dev/null. • Up to 290 jobs in parallel (UC3 and UCT3) Ilija Vukotic ivukotic@uchicago.edu
SSS doing XRDCP • Wanted to do all sites that are in FAX and have FDR dataset. • Most did not work: • When asked through glrd.usatlas.org. • Some of them even when asked directly. • Some work for 5-10 files but then give up. • Some work on repeated queries. • ML monitor not adequate anymore. • CERN, some UK sites sending all traffic • Something strange with AGLT2 numbers • Something wrong with ML Ilija Vukotic ivukotic@uchicago.edu
SSS doing XRDCP Errors mostly Last server error 10000 ('’) Error accessing path/file for … (BNL) Very strange error in setting up environment. Not FAX related. Created //.asetup. Please look and (optional) edit it. AtlasSetup(WARNING): Unable to write ${HOME} save file mkdir: cannot create directory `//workarea': Permission denied /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/utilities/createUserASetup.sh: line 40: //.asetup: Permission denied Ilija Vukotic ivukotic@uchicago.edu
Results Ilija Vukotic ivukotic@uchicago.edu
Conclusion 2 • Automatic tests for SSB are not enough. • In absence of users that would report problem, will need additional manual checks from time to time. • Monitoring needs to be validated from beginning till the end. • Huge difference in rates – need cost matrix ASAP • Rates observed sound reasonable. • Our understanding would hugely benefit from perfSonar tests over the same links. Ilija Vukotic ivukotic@uchicago.edu
TESTING FAX USING HC and FDR datasets 2nd April 2013
20019750 • RC pilot • Data from SLAC only Ilija Vukotic ivukotic@uchicago.edu
20019750 7 worked 3 did not start 4 failed Ilija Vukotic ivukotic@uchicago.edu
20019750 SWT2_CPB Log put error: Error copying the file: 256, cp: cannot create regular file /xrd/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SWT2_CPB.25/user.gangarbt.32893735._ SLAC Put error: Error copying the file: 256, cp: accessing `/xrootd/atlas/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SLAC.43/user.gangarbt.32887595.EXT0._00418.HWWSkimmedNTUP.root?oss.cgroup=ATLASUSERDISK': Transport endpoint is not connected QMUL Get error: Staging input file failed MWT2 Download: 2444 seconds ROMA1 Finished: 44 Timed out:12 FZK Finished: 4 Timed out: 46 Get error: Staging input file failed ECDF Finished: 36 Failed: 11 pilotErrorDiag: Too little space left on local disk to run job CERN Get error: Staging input file failed BU Finished 23 Failed:12 Not enough local space for staging input files and run the job AGLT Finished: 17 BNL Finished: 231 Failed:8 – lost heart beat or unspecified. OU_OCHEP_SWT2, JINR,FZU – did not start Ilija Vukotic ivukotic@uchicago.edu
20019750 Ilija Vukotic ivukotic@uchicago.edu
20019749 • RC pilot • Data from anywhere Ilija Vukotic ivukotic@uchicago.edu
20019749 The same idea as 20019749 but much more sites and random files: user.flegger.*… Did not work as I expected it: each site was always running against a random but same dataset. Ilija Vukotic ivukotic@uchicago.edu
20019749 Ilija Vukotic ivukotic@uchicago.edu
Conclusion 3 • While there are many fails, some seem easy to fix (not enough space on disk, etc.) • Some are the same ones observed in SSS based tests. • We need to look at performance. Often it is better to fail than have very low performance. How low is unacceptably low? • Need to start looking at site that are not part of FAX. Ilija Vukotic ivukotic@uchicago.edu
Direct FDR HC jobs Ilija Vukotic ivukotic@uchicago.edu
conclusion • Testing: • Need faster turn around. • Would it help: • Each 6 hours one HC submitted job at each ANALY queue • Against a very stable door • With tools we have now there is no way to precisely stress test sites. • Fill up table at the slide 21. make it green • Monitoring: • ML almost useless now. • Need full validation, specially CERN FAX dashboard Ilija Vukotic ivukotic@uchicago.edu
Systematic FDR load tests in progress US cloud results. 10 jobs * 10 SMWZ files ~ 50GB CPU limited Factors affecting spreads: pair-wise network latency, throughput, storage “business”
Systematic FDR load tests in progress US cloud results
Systematic FDR load tests in progress EU cloud results
Systematic FDR load tests in progress EU cloud results