Evaluation of the EDG Testbed and Middleware

HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications – WP8) s.burke@rl.ac.uk UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1

Introduction Updated from the CHEP talk ~ 1 year ago  Some things have changed, some not! Based on D8.4 report (EDG only here, 2.0/2.1 releases) Achievements of WP8 Updated use case analysis mapping HEPCAL to EDG Lessons learnt UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 2

OBJECTIVES ACHIEVEMENTS Evaluate EDG Application Testbed, and integrate into experiment tests as appropriate.  Further successful evaluation of 1.4.n throughout the summer.  Evaluation of EDG 2.0 on the EDG Application Testbed since October, and of EDG 2.1 since December  EIPs (Loose Cannons) helped testing of EDG components on the LCG Cert TB prior to LCG-1 start in September. Liaise with LCG regarding EDG/LCG integration and the development of the LCG service.  Performed stress tests on LCG-1. Continue work with experiments on data challenges throughout the year. All 6 experiments have conducted data challenges of different scales throughout 2003 on EDG App TB or LCG/Grid.it. UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 3

ACHIEVEMENTS OBJECTIVES Continued work in Architectural Task Force (ATF)  Walkthroughs of HEP use cases helped to clarify interfacing problems.  Extension of HEPCAL use cases covering key areas in Biomedicine and Earth Sciences. Reactivation of the Application Working Group (AWG)  Basis of first proposal for common application work in EGEE Work with LCG/GAG (Grid Applications group) in further refinement of HEP requirements  HEPCAL-2 requirements document for the use of grid by thousands of individual users.  In addition further refined the original HEPCAL document Developments of tutorials and documentation for the user community  WP8 has played a substantial role in course design, implementation and delivery UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 4

Use Case Analysis EDG release 2.0 has been evaluated against the HEPCAL Use Cases Of the 43 Use Cases:  13 (was 10) are fully implemented  4 (was 8) are largely satisfied, but with some restrictions or complications  11 (was 8) are partially implemented, but have significant missing features  15 (was 17) are not implemented Missing functionality is mainly in:  Virtual data (not considered by EDG)  Metadata catalogues and file collections (still needs more work)  Authorisation, job control and optimisation (partly delivered but not integrated) UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 5

Lessons Learnt - General Having real users on an operating testbed on a fairly large scale is vital – many problems emerged which had not been seen in local testing. Problems with configuration are at least as important as bugs - integrating the middleware into a working system takes as long as writing it! Grids need different ways of thinking by users and system managers. A job must run anywhere it lands. Sites are not uniform so jobs should make as few demands as possible. UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 6

Job Submission Limitations seen in 1.4 are largely gone  Efficiency over 90% in stress tests (1600 jobs)  Failures are ~ 1% in normal use (after resubmission)  Most failures now at globus/site level, not broker Can still be sensitive to poor or incorrect information from Information Providers  Info providers have improved, configuration generally better  No “black hole” sites lately (but still possible) Still hard to diagnose errors (“invalid script response”???) Advanced features (checkpointing, DAGMAN, interactivity, accounting, …) largely untested, some not integrated UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 7

Information Systems R-GMA is a big improvement on MDS  Tables, SQL queries, much easier to publish, …  Largely a personal view, experiments have mostly not used it yet Took a very long time to become stable – during the D8.4 evaluation R- GMA availability was O(75%) Latest version installed for the EU review looks much better – total end-to- end efficiency now > 95%, R-GMA is ~100% (but testbed is now lightly loaded) NO SECURITY!  And no Registry/schema replication Need to check published information for accuracy (or at least sanity!) GLUE schema is not in EDG/LCG control, and has proved very hard to change UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 8

Replica Management Now mostly “just works”  Command line tools are fairly intuitive  Sometimes processes can hang  Orphan processes sometimes left behind when job ends  Some inconsistencies found when used with POOL Interaction with SE schema is still unclear  Works, but gives artificial restrictions on NFS access Bulk operations, mirroring and client-server architecture lost with GDMP Java command-line tools are very slow (tens of seconds) Fault tolerance is important: error conditions should leave things in a consistent state, failures should be re-tried where possible UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 9

Replica Catalogues Oracle/MySQL catalogues are much better than LDAP in 1.4 Tested up to O(100k) entries, no degradation seen  But need to cope with millions  At 10 seconds per file it would take ~ 4 months to register a million files! Queries can be very slow due to inefficient transport of data  30 minutes to return 45k entries  Java runs out of memory on bigger queries Distributed LRC + RLI not deployed NO SECURITY! (Integrated but not deployed) Still no consistency checking against SE content UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 10

Mass Storage Always the most problematic area, and still not solved LCG2 still using “classic SE”, but only a stop-gap SRM should be the solution (?), WP5 SE is the EDG version Works, but many rough edges, really still a prototype  No disk space management  Error reporting is poor, not fault-tolerant  Too much logging, not helpful for a system manager  Configuration is complex and fragile  … Also dCache, CASTOR SRM, Enstore SRM …  But still not production-quality? What is the way forward? UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 11

VO Management Current LDAP-based system works fairly well, but has many limitations  VO servers are a single point of failure VOMS looks good, but not yet deployed or fully integrated  Or documented! Middleware groups seem to have a different security model to VOMS designers  E.g. they usually assume one and only one VO  VO defines service (Replica Catalogue, SE namespace) and not authorisation Experiments will need to gain experience about how a VO should be run UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 12

User View of the Testbed Site configuration is very complex, there is usually one way to get it right and many ways to be wrong  LCFG is a big help in ensuring uniform configuration  Middleware should be self-configuring (and self-checking) as far as possible Need well-defined certification procedures, checked on an ongoing basis (sites decay with a half-life of ~ a few weeks) Services should fail gracefully when they hit resource limits  The grid must be robust against failures and misconfiguration. Large grids will ~ always be broken, so errors are not exceptional! Many HEP experiments require outbound IP connectivity from worker nodes  Still no solution, discussion is needed Scalability? Still only ~ 20 sites – 1 job/minute! UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 13

Gaps Disk space management on worker nodes  Some discussion, nothing appeared Analysis of scheduling algorithms  EstimatedResponseTime is not optimal Pre-replication by the broker Information about networking at the LAN level  Where are the network bottlenecks? Distribution of experiment software (now being tackled in LCG) Enforcement of quotas (whose job is this?) Documentation UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 14

Evaluation of the EDG Testbed and Middleware

Evaluation of the EDG Testbed and Middleware

Presentation Transcript