GridPP use- interoper- communic- ability

GridPP use-interoper-communic-ability Tony Doyle

Introduction • Is the system usable? • How will GridPP and NGS interoperate? • Communication and discussion introduction “Gridability”

A. “Usability” (Prequel) • GridPP runs a major part of the EGEE/LCG Grid, which supports ~3000 users • The Grid is not (yet) as transparent as end-users want it to be • The underlying overall failure rate is ~10% • User (interface)s, middleware and operational procedures (need to) adapt • (see talk by Jeremy for more info. on performance and operations) • Procedures to manage the underlying problems such that system is usable are highlighted “Gridability”

“Active” User requires thousands of CPU hours EGEE CPU hours(1 April 2006 to 31 July 2006 ) 5 million hours “Gridability”

Virtual Organisations • Users are grouped into Virtual Organisations • Users/VO varies from 1 to 806 members (and growing..) • Broadly four classes of VO • LHC experiments • EGEE supported • Worldwide (mainly non-LHC particle physics) • Local/regional e.g. UK PhenoGrid • Sites can choose which VOs to support, subject to MOU/funding commitments • Most GridPP sites support ~20 VOs • GridPP nominally allocates 1% of resources to EGEE non-HEP VOs • GridPP currently contributes 30% of the EGEE CPU resources “Gridability”

User View? • Perspective matters • This is not • a usability survey • unbiased • representative • Straw poll • users overcame initial registration hurdles within ~two weeks • users adapt to Grid in (un-)coordinated ways • The Grid was sufficiently flexible for many analysis applications “Gridability”

Analysis Object Data Analysis Object Data Analysis Object Data AOD Physics Analysis ESD: Data or Monte Carlo Event Tags Collaboration -wide Tasks Event Selection Calibration Data Analysis, Skims INCREASINGDATAFLOW Raw Data Analysis Groups Physics Objects Physics Objects Physics Objects Individual Physicists Physics Analysis “Gridability”

User evolution Number of UK Grid users (exc. Deployment Team) Quarter: 05Q4 06Q2 06Q3 Value: 1342 1831 2777 Many EGEE VOs supported c.f. 3000 EGEE target Number of active users (> 10 jobs per month) Quarter: 05Q4 06Q1 06Q2 Value: 83 166 201 Fraction: 6.2% 11.0% Viewpoint: growing fairly rapidly, but not as active as they could be? depends on the “active” definition “Gridability”

Know your users? UK-enabled VOs 806 atlas 763 dzero 577 cms 566 dteam 150 lhcb 131 alice 75 bio65 dteamsgm41 esr 31 ilc 27 atlassgm 27 alicesgm 21 cmsprg 18 atlasprg 17 fusn 15 zeus 13 dteamprg 13 cmssgm 11 hone 9 pheno 9 geant 7 babar 6 aliceprg 5 lhcbsgm 5 biosgm 3 babarsgm 2 zeussgm 2 t2k 2 geantsgm 2 cedar 1 phenosgm 1 minossgm 1 lhcbprg 1 ilcsgm 1 honesgm 1 cdf “Gridability”

Scriptor Job details Logical Folders Job Monitoring Job builder Log window User Interface • The GUI is relatively low-level (jobs, file collections) • Dynamic panels for higher level functions Dockable windows Screenshot of the Ganga GUI “Gridability”

ATLAS • GANGA software framework (jointly with LHCb) • data challenges • producing Monte Carlo data • 10 million CPU hours • per year • LHCb • DIRAC software to submit analysis jobs using Grid • 2006 analysis job completion efficiency improved to 91% • CMS • Monte Carlo production, data transfer, job submission • CMS transfers top a petabyte a month for the last three months Complex Applications “Gridability”

Particle physicists collaborate, play roles and delegate e.g. “prg” production group “sgm” software group managers Underpinned by Memoranda of Understanding Current MoU signatories: China France Germany Italy India Japan Netherlands Pakistan Portugal Romania Taiwan UK USA Pending signatures: Australia Belgium Canada Czech Republic Nordic Poland Russia Spain Switzerland Ukraine Negotiation w.r.t. resource and service level WLCG MoU “Gridability”

Resource allocation • Need to assign quotas and priorities to VOs and measure delivery • VOMS provides group/role information in the proxy • Tools to control quotas and priorities in site services being developed • So far only at whole-VO level • Maui batch scheduler is flexible, easy to map to groups/roles • Sites set the target shares • Can publish VO/group-specific values in GLUE schema, hence the RB can use them for scheduling • Accounting tool (APEL) measures CPU use at global level (UK task) • Storage accounting currently being added • GridPP monitors storage across UK • Privacy issues around user-level accounting, being solved by encryption “Gridability”

User Support • Becoming vital as the number of users grows • But modest effort available in the various projects • Global Grid User Support (GGUS) portal at Karlsruhe provides a central ticket interface • Problems are categorised • Tickets are classified by an on-duty Ticket Process Manager, and assigned to an appropriate support unit • UK (GridPP) contributes support effort • GGUS has a web-service interface to ticketing systems at each ROC • Other support units are local mailing lists • Mostly best-effort support, working hours only • Currently ~tens of tickets/week • Manageable, but may not scale much further • Some tickets slip through the net “Gridability”

Documentation & Training • Need documentation and training for both system managers and users • Mostly expert users up to now, but user community is expanding • Induction of new VOs is a particular problem – no peer support • EGEE is running User Fora for users to share experience • Next in Manchester in May ’07 (with OGF) • EGEE has a dedicated training activity run by NeSC/Edinburgh • Documentation is often a low priority, little dedicated effort • The rapid pace of change means that material requires constant review • Effort on documentation is now increasing • GridPP has appointed a documentation officer • GridPP web site, wiki • Installation manual for admins is good • There is also a wiki for admins to share experience • Focus is now on user documentation • New EGEE web site – coming soon “Gridability”

Alternative view? • The number of users in the Grid School for the Gifted is ~manageable now • The system may be too complex, requiring too much work by the “average user”? • Or the (virtual) help desk may not be enough? • Or the documentation may be misleading? • Or.. • Having smart users helps (the current ones are) “Gridability”

B. “Interoperability” • GridPP/NGS meeting - Nottingham EMCC, September 2006 • Present: Tony Doyle, David Britton, Paul Jeffreys, David Wallom, Robin Middleton, Andy Richards, Stephen Pickles, Steven Young, Dave Colling, Peter Clarke, Neil Geddes • Agenda: • Ultimate goals and the model for achieving them and any constraints • Timetables • Required software (in both directions) “Gridability”

B. “Interoperability” • Goals: A general discussion on what we might hope to achieve and why. • Several key points made... • Open question whether we ever need to actually have any closer partnership • GridPP is focused on a relatively immediate goal and will always be constrained in some way by the broader LCG requirements • NGS should be further from the bleeding edge in grid developments • NGS affiliation and partnership model exists • GridPP T2's all have MoUs which will need revamping under GridPP3. This will be an ideal opportunity to formalise any relationship between GridPP (T2's) and the NGS. • It is unclear who is using EGEE (in the UK) and who could or would want to use it • EGEE-UKI needs to do a better PR job within the UK • Phenogrid are registering with EGEE “Gridability”

B. “Interoperability” • The current "minimal software stack" approach of NGS is being reviewed as a greater variety of partner resources are considered (data centres and research facilities) • Different "stacks" will be relevant to different sorts of partners i.e. there is likely to be a range of "NGS Profiles“ • For the foreseeable future, NGS is likely to exist in a world with multiple parallel software stacks and it will not be possible merge them • Installing parallel stacks or profiles is not a problem if they are easy to install and do not interfere • One possibility is that the different NGS profiles would reflect Different stacks such as GT4 or gLite • Operations-can we present accounting information consistently “Gridability”

B. “Interoperability” • What benefit is there in a GridPP site joining NGS ? • much less relevant for sites where the resources are essentially dedicated for HEP. Where there are shared facilities with other fields then the generic and shared nature of the NGS can provide ready made interfaces for the broader communities. We are clearly a long way form being able to merge both activities completely. e.g. GridPP requirements on monitoring and accounting could not currently be met by NGS nodes and NGS would not require all partners to report a la GridPP. (Of course this does not preclude project specific layers such as this accounting on top of the basic NGS profiles, for relevant partner). • There is a concern that "joining" the NGS would put an additional load on the GridPP sites. Looking further ahead of course, the intention is that this is not the case, but that supporting the standard NGS profiles is exactly the same work as required to meet (a subset of) the GridPP requirements. This can only be guaranteed if there is sufficient representation of GridPP sites within the NGS. “Gridability”

B. “Interoperability” • Next steps/timetable • GridPP3 MoUs - No action required. Can wait until next year and should be informed by lessons learned over the next 6-12 months. GridPP sites currently meet the minimal requirements for NGS through the standard GridPP installations. • If Sites enable the NGS VO then this effectively gives NGS affiliation if they wish. • Formal Affiliation would, however, require that the interface be monitored by NGS. Agreed that the next step should be to understand in detail what is actually required for NGS partnership. “Gridability”

B. “Interoperability” • Next steps/timetable • Agreed to focus on two sites, Glasgow and LeSC. Aim to be ready to achieve NGS “partnership” by Christmas 2006. • The decision as to whether or not to actually apply for formal partnership can be left to later in the year. • The principal goal is to understand the steps and requirements etc. • It was agreed that NGS should provide a Glite CE for core NGS nodes which would allow the nodes To be a part of the EGEE/LCG SAM infrastructure. • Accounting and monitoring are areas which are still developing and where it is not clear what the best solution is (for NGS) • Meet once more before Christmas.. “Gridability”

=> Implementation… • GU should concentrate on delivering: 1. A job submission mechanism 2. A method to prepare the job's environment what input files, etc. This means we can offer 1. gsissh login to head node, with access to some shared space (e.g. the home directory for the NGS pool accounts). 2. job submission from head node to the gatekeeper, which can use either GRAM (globus-job-submit) or EGEE methods (edg-job-submit) This would seem to qualify us as an NGS partner site, comparing with http://www.grid-support.ac.uk/index.php?option=content&task=view&id=143 • The SLAs on offer seem none too onerous “Gridability”

"T0-T1-T2 Service Challenges" Panel Members: Tony Cass, Jeremy Coles, Dave Colling, John Gordon, Dave Kant, Mark Leese, Jamie Shiers. [notes recorded by: Neasan O'Neill] "Analysis on the Grid" Panel Members: Roger Barlow, Giuliano Castelli, David Grellscheid, Mike Kenyon, Gennady Kuznetsov, Steve Lloyd, Andrew McNab, Caitriana Nicholson, James Werner. [notes recorded by: Giuseppe Mazza] "How is/will data be managed at the T1/T2s?" Panel Members: Phil Clark, Greig Cowan, Brian Davies, Alessandra Forti, David Martin, Paul Millar, Jens Jensen, Sam Skipsey, Gianfranco Sciacca, Robin Tasker, Paul Trepka. [notes recorded by: Tom Doherty] "Experiment Service Challenges" Panel Members: Dave Colling, Catalin Condurache, Peter Hobson, Roger Jones, Raja Nandakumar, Glenn Patrick. [notes recorded by: Caitriana Nicholson] "Beyond GridPP2 and e-Infrastructure" Panel Members: Pete Clarke, Dave Britton, Tony Doyle, Neil Geddes, John Gordon, Neasan O'Neill, Joanna Schmidt, John Walsh, Pete Watkins. [notes recorded by: Duncan Rand] "Site Installation and Management" Panel Members: Tony Cass, Pete Gronbech, Dave Kelsey, Winnie Lacesso, Colin Morey, Mark Nelson, Derek Ross, Graeme Stewart, Steve Thorn, John Walsh. [notes recorded by: Mark Leese] "What is a workable Tier-2 Deployment Model?" Panel Members: Olivier van der Aa, Jeremy Coles, Santanu Das, Alessandra Forti, Pete Gronbech, Peter Love, Giuseppe Mazza, Duncan Rand, Graeme Stewart, Pete Watkins. [notes recorded by: Gianfranco Sciacca] "What is Middleware Support?" Panel Members: Mona Aggarwal, Tom Doherty, Barney Garrett, Jens Jensen, Andrew McNab, Robin Middleton, Paul Millar, Robin Tasker. [notes recorded by: Catalin Condurache] C. “Communicability” “Gridability”

1. "LCG Service Challenges" • This was a session which brought out the detailed planning of Service Challenges. 1. SC is a great idea which is a kind of reality check: “reality” is imminent data, increasing complexity of experiment-led initiatives, and more users 2. Need more documentation and support: still true(!) despite effort 3. Time scales and deadlines are needed for deployment: well known and widely communicated via Jamie – Jeremy… 4. Storage model is important issue especially for storage group: increasingly large issue – dedicated discussion 5. Communication on experience: forthcoming discussions will be discussed at DTeam and PMB meetings 6. Networks will play an important part in SC4: underpins file transfer tests, but needs to be embedded within these - disk performance (being understood) v network performance (many [hidden] variables) “Gridability”

There was a list of specific actions • Implement a better user support model ONGOING • Support the deployment of an SRM at every Tier-2 site DONE • Revisit site plans for implementing promised resources DONE • Support the installation of any required local catalogues at sites GENERALLY LIMITED TO TIER-1. DONE • Investigate the experiment VO box requests. Make a recommendation to Tier-2s. Revisit as GridPP. NOT REQD. (CURRENTLY) • Better understand network links to sites (we do not want to saturate links) ONGOING • Schedule transfer tests from Tier-1 to Tier-2 test rates and stability DONE AND ONGOING • Work closer with experiments? CAN IMPROVE “Gridability”

There was a list of specific actions • user support (mail lists, web form, TPMs, GGUS integration) NEED TO ENSURE USERS “KNOW” (AND KEEP REMINDING THEM) • SRM at T2 (almost done) DONE • site plans revised (SRIF3, FEC) ONGOING • local catalogues (wiki, SC3, plan for rest) • VO boxes (review group) DISAPPEARING.. • network links (10 easy questions, wiki) FIREWALL+GRID http://www.ggf.org/documents/GFD.83.pdf • T1-T2 tests (plan, stalled, dcache/dpm) DONE • Experiment links (some progress) MORE REQD. “Gridability”

2. "Running Applications on the Grid" (Why won't my jobs run?) Summary • A number of people say things working are well - pleasant surprise - easier than LSF! A SUBSET OF USERS ATTEND GRIDPP MEETINGS • VO setup and requirements: don't want each VO to have to talk to each site. VO should provide list of requirements for site to support VO. THERE ARE A LARGE NUMBER OF RESPONSIBILITIES TO BE HANDLED BY EACH EXPT. • Certificates: need to improve situation. Once over this hurdle using the grid is plainer sailing. INTRINSIC TIME DEPENDENCE OF CA-RA-USER TRUST ESTABLISHMENT (NECESSARY) • Data management issues more of a problem than job or RB problems. How to get information to user re failures and support channels. INCREASINGLY TRUE – MANY AD-HOC DELETIONS FOLLOWING E.G. FTS FAILURES • Monitoring real file transfers would be an interesting addition. USER MECHANISMS TO TRACE OVERALL PROGRESS, BUT NOT MANY INDIVIDUAL USER TOOLS/SCRIPTS APPEARING E.G. TNT (Tag Navigator Tool) PLUG-IN TO GANGA FOR ATLAS FILE COLLECTIONS WOULD NEED TO COMMUNICATE WITH THE MonAMI FTS PLUG-IN “Gridability”

3. "Grid Documentation" (What documentation is needed/missing? Is it a question of organisation?) • Could updates to documents be raised at meetings? • A mailing list specifically for document updates may be useful. • Competition between different solutions to one problem. • For all experiments - link in all documentation and give responsibility to a line manager (for example) to oversee its maintenance. • What are the mechanisms or how do we find out what is inadequate within a document - a document should be checked every few months to point out its inadequacies => should a review process be set up by SB. • Roles and responsibilities should be established. • Important documents should be highlighted - and index of useful doc's and what sources of documents are available may be useful. • Much progress made by Stephen Burke in many of these areas. Steve attends PMB “Gridability”

5. "Beyond GridPP2 and e-Infrastructure" • (What is the current status of planning?) • EGEE II may be superseded by European infrastructure – EGEE III NOW BEING PLANNED • DTI planning a UK infrastructure • Integrate better with NGS - SEE EARLIER SLIDES • More things developed by GridPP will be supported centrally – NEED TO CONVINCE UK COMMUNITY OF THE USEFULNESS AND ADAPTABILITY OF GLITE AS A COMPONENT PART OF PERVASIVE INFRASTRUCTURE “Gridability”

6. "Managing Large Facilities in the LHC era" • (What works? What doesn't? What won't) • Sys admins seem happy with their package managers. • We should share common knowledge (about software tools) more. ONGOING • Extra Costs (over and above the price of the hardware) involved in having large clusters. ONGOING • IMPROVED, BUT CAN IMPROVE FURTHER METRIC: DT (INSTALL – USER AVAILABILTY) + AVAILABILITY “Gridability”

7. "What is a workable Tier-2 Deployment Model?“ • Conclusion: Deployment is under control • testing has made good progress • operations still an issue METRIC: DT (INSTALL – USER AVAILABILTY) + OVERALL AVAILABILITY * # SYSTEM MANAGER(S) “EXCELLENT” T2 SUPPORT STRUCTURE REQD. “Gridability”

8. "What is Middleware Support?" • (really all about) • gLite test bed • EGEE2 - dedicated testing/certification system • using wiki was good idea. Consolidate into documents. • need some structure to make sure wiki doesn't get out of control. • need some moderators for the wiki. • developers not getting correct requirements for s/w.sysadmin questions not the same questions that were in the minds • of the developers.. • bad if the wiki is incorrect. • need someone to move what is in the wiki to some sort of more formal docs (LaTeX or DocBook) which has been properly checked and signed off by the developers. • ONGOING, LIMITED PROGRESS – INTRINSIC LIMITATION? (THERE WILL ALWAYS BE OUT OF DATE/LIMITED DOCUMENTATION?) • NEED A DOCUMENTATION REVIEW CHALLENGE? “Gridability”

Conclusion • All sessions were felt to be worthwhile • Some produced hard actions • Some areas have made progress since • Positive correlation between subjects which made progress and where GridPP had existing structures in place (Deployment, Documentation) • Counter examples, middleware, experiments • Let’s do this again but next time take more care to task people with subsequent progress and look for new structures to deliver results. • “MAKE IT SO” • The logical end of a talk on “Gridability” (or the emperors new clothes?) “Gridability”

GridPP use- interoper- communic- ability