140 likes | 153 Views
This document provides an assessment of the European Data Grid (EDG) testbed and middleware, including achievements, objectives, use case analysis, lessons learned, job submission, information systems, and replica management.
E N D
HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications – WP8) s.burke@rl.ac.uk UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1
Introduction Updated from the CHEP talk ~ 1 year ago Some things have changed, some not! Based on D8.4 report (EDG only here, 2.0/2.1 releases) Achievements of WP8 Updated use case analysis mapping HEPCAL to EDG Lessons learnt UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 2
OBJECTIVES ACHIEVEMENTS Evaluate EDG Application Testbed, and integrate into experiment tests as appropriate. Further successful evaluation of 1.4.n throughout the summer. Evaluation of EDG 2.0 on the EDG Application Testbed since October, and of EDG 2.1 since December EIPs (Loose Cannons) helped testing of EDG components on the LCG Cert TB prior to LCG-1 start in September. Liaise with LCG regarding EDG/LCG integration and the development of the LCG service. Performed stress tests on LCG-1. Continue work with experiments on data challenges throughout the year. All 6 experiments have conducted data challenges of different scales throughout 2003 on EDG App TB or LCG/Grid.it. UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 3
ACHIEVEMENTS OBJECTIVES Continued work in Architectural Task Force (ATF) Walkthroughs of HEP use cases helped to clarify interfacing problems. Extension of HEPCAL use cases covering key areas in Biomedicine and Earth Sciences. Reactivation of the Application Working Group (AWG) Basis of first proposal for common application work in EGEE Work with LCG/GAG (Grid Applications group) in further refinement of HEP requirements HEPCAL-2 requirements document for the use of grid by thousands of individual users. In addition further refined the original HEPCAL document Developments of tutorials and documentation for the user community WP8 has played a substantial role in course design, implementation and delivery UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 4
Use Case Analysis EDG release 2.0 has been evaluated against the HEPCAL Use Cases Of the 43 Use Cases: 13 (was 10) are fully implemented 4 (was 8) are largely satisfied, but with some restrictions or complications 11 (was 8) are partially implemented, but have significant missing features 15 (was 17) are not implemented Missing functionality is mainly in: Virtual data (not considered by EDG) Metadata catalogues and file collections (still needs more work) Authorisation, job control and optimisation (partly delivered but not integrated) UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 5
Lessons Learnt - General Having real users on an operating testbed on a fairly large scale is vital – many problems emerged which had not been seen in local testing. Problems with configuration are at least as important as bugs - integrating the middleware into a working system takes as long as writing it! Grids need different ways of thinking by users and system managers. A job must run anywhere it lands. Sites are not uniform so jobs should make as few demands as possible. UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 6
Job Submission Limitations seen in 1.4 are largely gone Efficiency over 90% in stress tests (1600 jobs) Failures are ~ 1% in normal use (after resubmission) Most failures now at globus/site level, not broker Can still be sensitive to poor or incorrect information from Information Providers Info providers have improved, configuration generally better No “black hole” sites lately (but still possible) Still hard to diagnose errors (“invalid script response”???) Advanced features (checkpointing, DAGMAN, interactivity, accounting, …) largely untested, some not integrated UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 7
Information Systems R-GMA is a big improvement on MDS Tables, SQL queries, much easier to publish, … Largely a personal view, experiments have mostly not used it yet Took a very long time to become stable – during the D8.4 evaluation R- GMA availability was O(75%) Latest version installed for the EU review looks much better – total end-to- end efficiency now > 95%, R-GMA is ~100% (but testbed is now lightly loaded) NO SECURITY! And no Registry/schema replication Need to check published information for accuracy (or at least sanity!) GLUE schema is not in EDG/LCG control, and has proved very hard to change UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 8
Replica Management Now mostly “just works” Command line tools are fairly intuitive Sometimes processes can hang Orphan processes sometimes left behind when job ends Some inconsistencies found when used with POOL Interaction with SE schema is still unclear Works, but gives artificial restrictions on NFS access Bulk operations, mirroring and client-server architecture lost with GDMP Java command-line tools are very slow (tens of seconds) Fault tolerance is important: error conditions should leave things in a consistent state, failures should be re-tried where possible UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 9
Replica Catalogues Oracle/MySQL catalogues are much better than LDAP in 1.4 Tested up to O(100k) entries, no degradation seen But need to cope with millions At 10 seconds per file it would take ~ 4 months to register a million files! Queries can be very slow due to inefficient transport of data 30 minutes to return 45k entries Java runs out of memory on bigger queries Distributed LRC + RLI not deployed NO SECURITY! (Integrated but not deployed) Still no consistency checking against SE content UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 10
Mass Storage Always the most problematic area, and still not solved LCG2 still using “classic SE”, but only a stop-gap SRM should be the solution (?), WP5 SE is the EDG version Works, but many rough edges, really still a prototype No disk space management Error reporting is poor, not fault-tolerant Too much logging, not helpful for a system manager Configuration is complex and fragile … Also dCache, CASTOR SRM, Enstore SRM … But still not production-quality? What is the way forward? UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 11
VO Management Current LDAP-based system works fairly well, but has many limitations VO servers are a single point of failure VOMS looks good, but not yet deployed or fully integrated Or documented! Middleware groups seem to have a different security model to VOMS designers E.g. they usually assume one and only one VO VO defines service (Replica Catalogue, SE namespace) and not authorisation Experiments will need to gain experience about how a VO should be run UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 12
User View of the Testbed Site configuration is very complex, there is usually one way to get it right and many ways to be wrong LCFG is a big help in ensuring uniform configuration Middleware should be self-configuring (and self-checking) as far as possible Need well-defined certification procedures, checked on an ongoing basis (sites decay with a half-life of ~ a few weeks) Services should fail gracefully when they hit resource limits The grid must be robust against failures and misconfiguration. Large grids will ~ always be broken, so errors are not exceptional! Many HEP experiments require outbound IP connectivity from worker nodes Still no solution, discussion is needed Scalability? Still only ~ 20 sites – 1 job/minute! UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 13
Gaps Disk space management on worker nodes Some discussion, nothing appeared Analysis of scheduling algorithms EstimatedResponseTime is not optimal Pre-replication by the broker Information about networking at the LAN level Where are the network bottlenecks? Distribution of experiment software (now being tackled in LCG) Enforcement of quotas (whose job is this?) Documentation UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 14