Analysis support issues, virtual analysis facilities, and Tier 3s

Analysis support issues, virtual analysis facilities, and Tier 3s Doug Benjamin (Duke University) With inputs from Hiro Ito, Val Hendrix,

How do US Atlas physics work today? • As Rik mentioned, surveys are being conducted (US ATLAS 128 response and increasing – ATLAS 266) ATLAS wide US ATLAS

US ATLAS analysis grid frequency

US ATLAS local usage

Difference in cultures ATLAS vs US ATLAS (local usage) US Analyzers – rely more heavily on local resources than collegues

What data do people now use? US and ATLAS analyzers not that different

What types of D3PD’s are used? Provides an indication what types of D3PD’s we should have in US

How do people get their data? • dq2-get very popular amount analyzers • Seen in surveys and dq2 traces

Dq2-get frequency and importance

Interim Conclusions • US ATLAS Physicists use a variety of resources at their disposal • Most use the grid , significant fraction do not ~ 25% (similar number for ATLAS) • ARRA funding was a success local resources are used • US Physicists make frequent and heavy use of local resources • US Physicists share with others in ATLAS ( 47%) • dq2-get extremely popular way to fetch data • Easy to use, instant feedback • Easy to misuse and cause troubles up stream • Need a solution that provides “instant” gratification , yet protects the system as a whole. With Rucio provide this?

Helping the users US ATLAS survey ATLAS survey Better communication is needed - We need to reach ~ 25% of the analyzers. All ATLAS analyzers should know about DAST. It is a resource for them

Moving toward the future • 2014/2015 will be important time for ATLAS and US ATLAS • Machine will be at full energy • ATLAS wants to have a trigger rate to tape ~ 1 kHZ (vs ~ 400 Hz now) • We need to produce results as fast as before (maybe faster) • ARRA funded computing will be 5 years old and beginning to age out • US funding of LHC computing likely flat at best. • Need to evolve the ATLAS computing model • Need at least factor of 5 improvement (more data and need to go faster) • Need to evolve how people do their work • Fewer local resources will be available in the future • US ATLAS users need to move towards central resources • But…. Need to avoid a lost in functionality and performance

Additional functionalities • What about giving users space on the grid? • What about EOS?

EOS (or another similar technology) • If US ATLAS were to provide you with EOS space in the US how useful would it be to be to you?

US ATLAS Local group disk 3 primary uses • Output of common datasets for US ATLAS analyzers • Should try to have sufficient space in the ATLAS group space tokens • For example large amount of space in Top group space tokens (NET,SWT2) allow Top group responsible (me) to direct all of current data ntuples to US • Output of datasets used by a few individuals • Some users are currently using a lot of space • Number of users who know that they can write into this space is limited. (need to better education users) • Lack of quotas an issue • “Cache space” • Used effectively should reduce the need for post processing DATRI request. • By have the User jobs send output directly from analysis jobs – makes it easier for user to fetch the output • Can we separate this space out? Should it be separate? • Rucio (next-gen data management) will provide “cloud” wide quotas for users.

Future of Tier 3 sites • Why do research groups like local computing? • Groups like the clusters that they have because now they are easy to maintain (We succeeded!!!) • They give them instant access to their resources • They give unlimited quota to use what they want for what ever they want • We need to have a multi-faceted approach to the Tier 3 evolution • Need to allow and encourage sites with the funding to continue to refresh their existing clusters. • We do not want to give the impression in any way that we would no welcome addition DOE/NFS funding to be used in this way • What about cast of compute nodes from Tier 1/Tier 2 sites? • Need to use beyond pledged resources to the benefit of all US ATLAS users and groups • If we can not get sufficient funding to fund the refresh of the existing Tier 3’s • Beyond Pledged resources are a key component to the solution.

New technologies on horizon • Federated storage • With caching Tier 3’s can really reduce data management • WAN data access • User code must be improved to reduce latencies • Decouples Storage and CPU • Analysis Queues • Can be used to make use of beyond pledges resources for data analysis not just MC production • Cloud computing • Users are becoming more comfortable for cloud computing (gmail, icloud) • Virtual Tier 3’s • All are coupled • Done right, they should allow us to do more with the resources that we will have

Analysis queues • Test at BNL for “beta” unofficial analysis queue • As a test of MAPR storage, BNL team setup a special queue • Doug B ran several instances of his standard 2011 test job, SMWZ ntuple, ~250 files, 10 subjobs, 25 files, simple cut and count • Each subjob runs less than 10 minutes (1300 Hz) • Initial results – show additional tuning is needed • Some subjobs waited 45 minutes to start • System started from cold start, plenty of job slots • Will repeat the same tests at Duke • First need to rename files in xrootd (get rid of _sub* etc)

Latency analysis (by Hiro) • ”autopyfactoryhas the cycle period of 360s. “ “I am guessing this is what happen (I copied the time from various logs)”Details for one subjobba 1. job created at 2012-10-24 16:17 2. pyfactory sleep cycle also starts at 2012-10-24 16:17 (nothing happens for next 6 minutes.) 3. pyfactor submit to condor at 2012-10-24 16:23:58 4. pilot wrapper on worker nodes start at Wed Oct 24 16:26:21 UTC 2012 5. pilot starts on 24 Oct 16:27:07 6. prun starts at 24 Oct 16:27:26 7. prun ends at 24 Oct 16:28:24 8. data output written to BNL dCache at 24 Oct 16:28:36 9. data output register to LFC at 24 Oct 16:28:40 10. job pilot ends at 24 Oct 16:29:25 11. log output written to BNL dcache at 24 Oct 16:31:00 12. log output register to BNL LFC at 24 Oct 16:31:02 13. job registered to be success at 24 Oct 16:31:03 Need to tune the various sources of latency – process is beginning

Virtual Analysis clusters( Tier 3’s) • R&D effort by Val Hendrix, Sergey Panitkin, DB and qualification task for student HenrikOhman • Part time effort for all of us, is a issue toward progress • We been working on a variety of clouds (by necessity and not necessarily by choice) • Public clouds – EC2 (micro spot instances), google (while still free) • Research clouds – Future grid (we have an approve project) • We are scrounging resouces where we can (not always the most efficient) • Val has made progress on configuration and contextuation • Sergey has done a lot of work with Proof • Sergey and I have both independently working data access/handling using Xrootd and the federation • We could use a stable Private cloud testbed.

Virtual analysis clusters (cont.) • Panda is used for work load management in most cases (Proof is only exception) • Perhaps we can use Proof on demand • Open issues • Configuration and contextualization - We are collaborating with CERN and BNL on puppet - outlook looks promising • What is the scale of the virtual clusters? • Personal analysis cluster • Research group analysis cluster (institution level) • Multi institution analysis cluster (many people , several groups to country wide) • Proxies for the Panda Pilots (personal or robotic) • What are the latency incurred by this dynamic system? Spin up and tear down time? • What leave to data reuse do we want (ie caching)? • How is the data handling done in and out of the clusters • Initial activity used Federation storage as source • What is the performance penalty of virtualization

Virtual analysis cluster (future) • Present snapshot of results at upcoming T1/T2/T3 at CERN • Got extension of trial period Google Compute Engine, Sergey will continue to work on analysis clusters • Get Henrikto ramp up (he owes us 80 work days – full time) • I have been asked to follow his work, by Fernando • Henrikwill focus on Google cloud and Futuregrid (openstack) • LBL (Beatte et al) received grant from Amazon to run on EC2. • Using Val’s scripts I will work with them to get them started right away • Data will come from US ATLAS federated storage - Might have to transfer files to local group disk, Need a good solution for output in long term • Need to help them migrate to private clouds once grant ends • Work with John and Jose on APF

Conclusions • US ATLAS Tier 3 program has been an unqualified success • We must start the process for the future • New technologies are maturing that can be used and have the potential of being transformative • So additional resources from the facilities would go a long way toward helping • Virtual Tier 3 are making slow process • New labor has been added

Analysis support issues, virtual analysis facilities, and Tier 3s