FAX status report

FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014

Content • Status • Coverage • Traffic • Failover • Overflow • Changes in localSetupFAX • Monitoring changes • Changes in GLED collector, dashboard • Failover & overflow monitoring • FaxStatusBoard • Meetings • Tutorial – 23 -27 June – dedicated to instructing on xAOD and the new analysis model • ROOTIO – 25-27 June

FAX topology • Topology change in North America • added East and West • will serve CA cloud • all hosted at BNL • Will need NL cloud redirector

FAX in Europe To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann

FAX in North America To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August)

FAX in Asia To come: Beijing (~two weeks) Tokyo Australia (few weeks)

Status • Most sites running stably • Glitches do happen but are fixed usually in few hours • SSB issues solved • New sites added • IFAE • PIC • IN2P3-LPC • In need of restart: • UNIBE-LHEP

Coverage • Now auto-updated Twiki page • https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FaxCoverage • Coverage is good (~85%), but we should aim for >95% ! • Info fetched from http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary

Traffic • Slowly increasing • Max peak output record broken • Still small to what we expect will come

Failover • Running stably

Overflow status • All the chain ready • I have set all the US queues to allow 3 Gbps to be both delivered to and delivered from sites. • Test tasks submitted to sites that don’t have the data so that transfertype=FAX is invoked. • This does not test the JEDI decision making (the one based on cost matrix) • Waiting for actual jobs to check the full chain • Users not yet instructed to use JEDI client • Waiting for JEDI monitor

Overflow tests • Test is the hardest IO test – 100% events, all branches read, standard TTC/no AsyncPrefetch. • Site specific FDR datasets (10 DSs, 744 files, 2.7TB) • All the source/destination combinations of US sites • All of it submitted in 3 batches, but not all started simultaneously. Affected by priority degradation. • Three input files per job. • If site is copy2scratch pilot does xrdcp to scratch, if not jobs access files remotely.

Overflow tests • Error rate • Total 9188 jobs • Finished 9052 • Failed 117 – 1.3% • 24 – OU reading OU (no FAX involved) • 66 – reading from WT2 (files are corrupted) • 27 – 0.29 % -actual FAX errors where SWT2 did not deliver the files. Will be investigated. • The rest are “Payload run out of memory”

Overflow tests • Jobs reading from local scratch - for comparison Scout jobs Scout jobs • Direct access site • Reading locally • Per job: • 7.2 MB/s • 67% CPU eff • 71 ev/s • Copy2scratch site • Per job: • 11.0 MB/s • 97% CPU eff • 109 ev/s

Overflow tests • Jobs reading remote sources No saturation Possibly a start of saturation • Direct access site • Reading remotely • Per job: • 4.2 MB/s • 43% CPU eff • 42 ev/s • Direct access site • Reading remotely • Per job: • 3.5 MB/s • 29% CPU eff • 34 ev/s

Overflow tests • MWT2 reading from OU and SWT2 simultaneously • In aggregate reached 850 MB/s – limit for MWT2 at that time.

Cost matrix source destination http://1-dot-waniotest.appspot.com/

localSetupFAX • Added command fax-ls – Made by Shuwei YE. • Will finally replace isDSinFAX • He will move all the other tools to Rucio • Change in fax-get-best-redirector • Each time does three queries • SSB to get endpoints and their status • AGIS to get sites, hosting the endpoints • AGIS to get site coordinates • Each call returns hundreds of kb’s • Can’t scale to large number of requests • Solution: • Made a GoogleAppEngine servlets that each 30 min take info from SSB and AGIS and deliver it from memory • Information slimmed to what is actually needed: ~several kb • Now requests served in few tens of ms. • “Infinitely” scalable

Monitoring – collector, dashboard • Problem: support of multi-VO sites • Meeting: Alex, Matevz, me • Issues: • Site name: • ATLAS reports it • CMS not or badly, will fix it • Requesting user’s VO • ATLAS does it • CMS not strict about it. US-CMS uses GUMS. Will fix it. • Proposal: • During the summer Matevz develops XrdMon that can handle multi-VO messages • Sends messages from multi-VO sites to a special “mixed” AMQ. Dashboard splits traffic according to user’s VO. Details: https://docs.google.com/document/d/1Syx3_vkwCfc5lj2lQzbUUrKT0Je238w6lcwVL7IY1GY/edit#

Monitoring • Failover • Not flexible enough • Overflow • No monitoring yet • Need to compare jobs grouped by transfer type

FAX status report