160 likes | 322 Views
Status of the European DataGrid Project. Charles Loomis (LAL/CNRS) LAL December 12, 2002. European DataGrid (EDG). European DataGrid EU-funded, 3-year project (2001-3) Goals: develop grid middleware deploy onto working testbed demonstrate grid technology with working applications
E N D
Status of the European DataGrid Project Charles Loomis (LAL/CNRS) LAL December 12, 2002
European DataGrid (EDG) European DataGrid • EU-funded, 3-year project (2001-3) • Goals: • develop grid middleware • deploy onto working testbed • demonstrate grid technology with working applications • Strong application component unique! 6 Partners; 21 Associates
EDG Goals Transparent Access • Allow users transparent access to authorized resources with single authentication. • Allow users to delegate authorization to services. • High-level selection of resources, including datasets. Virtual Organizations • Allow groups of people to acquire resources from sites. • Allow organization to manage resource use among members. Optimization • Allow optimal use of resources at site and grid levels.
Information Systems User Interface Resource Broker Computing Element Storage Element Site X EDG Architecture Global Batch System: • Centralized Architecture. • Heavy infrastructure. query MDS Replica Catalogs submit retrieve publish state broker chooses optimal site for job submit retrieve
Comments Optimization of Resources Centralized Architecture • Resource Broker • must know state of grid and schedule effectively • requires knowledge of site policies and user/job details • Information System (MDS & RC) • must respond quickly to high-volume and high-rate queries Central Points-of-Failure • Resource Broker (redundancy at VO-level) • MDS (unique hierarchy; some redundancy possible) With high-rate submissions: • RB requires lots of memory, CPU, disk space. • MDS requires lots of file descriptors, CPU.
Certification Authorities User Computing Element Storage Element Site X Virtual Organizations Authentication & Authorization request certificate ~15 National CAs France, INFN, … /C=FR/O=CNRS/OU=LAL/CN=Charles Loomis/Email=loomis@lal.in2p3.fr proxy sent for authentication Update CRL register accept/reject request retrieve membership lists ~10 Different VOs ATLAS, CMS, …
Comments Infrastructure • ~15 National CAs as production service • 10 Virtual Organizations • High-Energy Physics: ALICE, BaBar, ATLAS, CMS, DZero, LHCb • Earth Observation • Biomedical Applications • Misc.: WP6, ITeam, Guidelines Limited Central Points-of-Failure • VO Membership Server (for VO members) • Certification Authority (for CA members) Caching, infrequent updates minimize problems; compromise security.
Development Testbed (1.4.0) To facilitate testing and integration of new middleware. 3 sites (3 countries) Deployment & Use Application Use • CMS Event Simulation • ATLAS Event Simulation • Regular Tutorial Use Stability • Filled Grid this week! Production Testbed (1.4.0) • For applications to use & stress software in “semi-production” environment. • 8 sites (5 countries)
Globus Experience GSI Security (OK) • Some limitations with size of proxies. GridFTP (OK) • Recent protocol change because of security fix. Replica Catalog (OK, limited) • Unannounced, unnecessary schema change. GateKeeper/JobManager (Poor) • Race conditions under load leading to failures. • High resource use; poor response to errors. Information System-MDS (Poor) • Serious problems with stability. • Query times increase dramatically under load.
Globus Experience (cont.) Interaction • Generally responsive to identified problems. • Little advance warning of major changes. • Schema changes. • Rewrite of JobManager/Batch System interface. Testing • Essentially non-existent by Globus. • Major delays in EDG because of MDS and Gatekeeper. • Finding/testing/fixing of major problems done outside Globus. Globus “high-level” services inappropriate for production environment.
Condor Experience CondorG • Used for reliable job submission from Resource Broker. • Responsive to problems and provide quick fixes. • Encountered few problems in our testing. Condor • Supported “batch” system for EDG. • Largely untested, but expect to use with next major release.
Typical Failure Modes Operations: • CRL generation (CA); CRL update (sites) • Network accessibility (VO LDAP servers) • Misconfiguration of services (typically SE) Poor implementation (BUGS) • Most catastrophic ones eliminated. Resource Exhaustion • File descriptors, ports, disk space. Design Limitations • Central points-of-failure (RB, MDS).
Future Developments EDG Plans: • Advanced data management • Real “Storage Element”. • Replica Location Service (distributed Replica Catalog) • Replica Manager (higher-level user interface) • Job Management • job splitting, checkpointing • interactive jobs • Replace MDS with R-GMA. • More robust, consistent security model. • Local resources better tied to grid credentials. OGSA (Open Grid Services Architecture) • New services written as web services. • Probably no complete conversion with EDG lifetime.
SlashGrid Grid File System: • Uses grid credentials for access to local files. • Frees grid user from local unix account. • Simplifies mapping of users to accounts. • Allows true account recycling. More Uses: • Could hide remote access to data. • Provide compatibility to Globus security model. • … Implementation: • User-space daemon on top of CODA kernel module. • Plug-in interface allows easy extension.
Certification Authorities User Computing Element Storage Element Site X VOMS Authentication & Authorization (VOMS) request certificate ~15 National CAs France, INFN, … proxy sent for authentication and authorization Update CRL request “ticket” accept/reject request Local Authorization Decision!
Conclusions Software & Testbed: • Production-quality security infrastructure in place. • Production and development testbeds: • Deployed. • Starting to see heavy use by end-users. • Reasonable stability for the first time. • Failure modes: • Moving from bugs and operations problems to design and resource limitations. Unanswered Questions: • Can optimization be achieved? At what level? • How can resources be limited, reserved, and shared? • Can efficient scheduling be done with inhomogeneous site policies?