1 / 14

Evaluation of the EDG Testbed and Middleware

This document provides an assessment of the European Data Grid (EDG) testbed and middleware, including achievements, objectives, use case analysis, lessons learned, job submission, information systems, and replica management.

alycea
Download Presentation

Evaluation of the EDG Testbed and Middleware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications – WP8) s.burke@rl.ac.uk UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1

  2. Introduction Updated from the CHEP talk ~ 1 year ago  Some things have changed, some not! Based on D8.4 report (EDG only here, 2.0/2.1 releases) Achievements of WP8 Updated use case analysis mapping HEPCAL to EDG Lessons learnt UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 2

  3. OBJECTIVES ACHIEVEMENTS Evaluate EDG Application Testbed, and integrate into experiment tests as appropriate.  Further successful evaluation of 1.4.n throughout the summer.  Evaluation of EDG 2.0 on the EDG Application Testbed since October, and of EDG 2.1 since December  EIPs (Loose Cannons) helped testing of EDG components on the LCG Cert TB prior to LCG-1 start in September. Liaise with LCG regarding EDG/LCG integration and the development of the LCG service.  Performed stress tests on LCG-1. Continue work with experiments on data challenges throughout the year. All 6 experiments have conducted data challenges of different scales throughout 2003 on EDG App TB or LCG/Grid.it. UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 3

  4. ACHIEVEMENTS OBJECTIVES Continued work in Architectural Task Force (ATF)  Walkthroughs of HEP use cases helped to clarify interfacing problems.  Extension of HEPCAL use cases covering key areas in Biomedicine and Earth Sciences. Reactivation of the Application Working Group (AWG)  Basis of first proposal for common application work in EGEE Work with LCG/GAG (Grid Applications group) in further refinement of HEP requirements  HEPCAL-2 requirements document for the use of grid by thousands of individual users.  In addition further refined the original HEPCAL document Developments of tutorials and documentation for the user community  WP8 has played a substantial role in course design, implementation and delivery UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 4

  5. Use Case Analysis EDG release 2.0 has been evaluated against the HEPCAL Use Cases Of the 43 Use Cases:  13 (was 10) are fully implemented  4 (was 8) are largely satisfied, but with some restrictions or complications  11 (was 8) are partially implemented, but have significant missing features  15 (was 17) are not implemented Missing functionality is mainly in:  Virtual data (not considered by EDG)  Metadata catalogues and file collections (still needs more work)  Authorisation, job control and optimisation (partly delivered but not integrated) UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 5

  6. Lessons Learnt - General Having real users on an operating testbed on a fairly large scale is vital – many problems emerged which had not been seen in local testing. Problems with configuration are at least as important as bugs - integrating the middleware into a working system takes as long as writing it! Grids need different ways of thinking by users and system managers. A job must run anywhere it lands. Sites are not uniform so jobs should make as few demands as possible. UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 6

  7. Job Submission Limitations seen in 1.4 are largely gone  Efficiency over 90% in stress tests (1600 jobs)  Failures are ~ 1% in normal use (after resubmission)  Most failures now at globus/site level, not broker Can still be sensitive to poor or incorrect information from Information Providers  Info providers have improved, configuration generally better  No “black hole” sites lately (but still possible) Still hard to diagnose errors (“invalid script response”???) Advanced features (checkpointing, DAGMAN, interactivity, accounting, …) largely untested, some not integrated UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 7

  8. Information Systems R-GMA is a big improvement on MDS  Tables, SQL queries, much easier to publish, …  Largely a personal view, experiments have mostly not used it yet Took a very long time to become stable – during the D8.4 evaluation R- GMA availability was O(75%) Latest version installed for the EU review looks much better – total end-to- end efficiency now > 95%, R-GMA is ~100% (but testbed is now lightly loaded) NO SECURITY!  And no Registry/schema replication Need to check published information for accuracy (or at least sanity!) GLUE schema is not in EDG/LCG control, and has proved very hard to change UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 8

  9. Replica Management Now mostly “just works”  Command line tools are fairly intuitive  Sometimes processes can hang  Orphan processes sometimes left behind when job ends  Some inconsistencies found when used with POOL Interaction with SE schema is still unclear  Works, but gives artificial restrictions on NFS access Bulk operations, mirroring and client-server architecture lost with GDMP Java command-line tools are very slow (tens of seconds) Fault tolerance is important: error conditions should leave things in a consistent state, failures should be re-tried where possible UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 9

  10. Replica Catalogues Oracle/MySQL catalogues are much better than LDAP in 1.4 Tested up to O(100k) entries, no degradation seen  But need to cope with millions  At 10 seconds per file it would take ~ 4 months to register a million files! Queries can be very slow due to inefficient transport of data  30 minutes to return 45k entries  Java runs out of memory on bigger queries Distributed LRC + RLI not deployed NO SECURITY! (Integrated but not deployed) Still no consistency checking against SE content UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 10

  11. Mass Storage Always the most problematic area, and still not solved LCG2 still using “classic SE”, but only a stop-gap SRM should be the solution (?), WP5 SE is the EDG version Works, but many rough edges, really still a prototype  No disk space management  Error reporting is poor, not fault-tolerant  Too much logging, not helpful for a system manager  Configuration is complex and fragile  … Also dCache, CASTOR SRM, Enstore SRM …  But still not production-quality? What is the way forward? UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 11

  12. VO Management Current LDAP-based system works fairly well, but has many limitations  VO servers are a single point of failure VOMS looks good, but not yet deployed or fully integrated  Or documented! Middleware groups seem to have a different security model to VOMS designers  E.g. they usually assume one and only one VO  VO defines service (Replica Catalogue, SE namespace) and not authorisation Experiments will need to gain experience about how a VO should be run UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 12

  13. User View of the Testbed Site configuration is very complex, there is usually one way to get it right and many ways to be wrong  LCFG is a big help in ensuring uniform configuration  Middleware should be self-configuring (and self-checking) as far as possible Need well-defined certification procedures, checked on an ongoing basis (sites decay with a half-life of ~ a few weeks) Services should fail gracefully when they hit resource limits  The grid must be robust against failures and misconfiguration. Large grids will ~ always be broken, so errors are not exceptional! Many HEP experiments require outbound IP connectivity from worker nodes  Still no solution, discussion is needed Scalability? Still only ~ 20 sites – 1 job/minute! UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 13

  14. Gaps Disk space management on worker nodes  Some discussion, nothing appeared Analysis of scheduling algorithms  EstimatedResponseTime is not optimal Pre-replication by the broker Information about networking at the LAN level  Where are the network bottlenecks? Distribution of experiment software (now being tackled in LCG) Enforcement of quotas (whose job is this?) Documentation UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 14

More Related