470 likes | 598 Views
EGEE Middleware Robin Middleton with much (most) material from Bob Jones Frederic Hemmer Erwin Laure. GridPP EB-TB Meeting, 13 th May 2004. www.eu-egee.org. EGEE is a project funded by the European Union under contract IST-2003-508833. Contents. Introduction EGEE structure, activities, …
E N D
EGEE MiddlewareRobin Middletonwith much (most) material fromBob JonesFrederic HemmerErwin Laure GridPP EB-TB Meeting, 13th May 2004 www.eu-egee.org EGEE is a project funded by the European Union under contract IST-2003-508833
Contents • Introduction • EGEE structure, activities, … • Middleware = JRA1 • Organisation • Design Team • Service Oriented Architecture • Initial Components • EGEE Middleware Prototype • Integration, Testing & SCM • JRA1 & External Links
Overview • 70 partners (funded) + many unfunded contributions • ~11 Federations • ~ €32M EU funds ~ €60M in total • 2 years (initially) • started 1st April 2004 • The EGEE Vision • To deliver production level Grid services, the essential elements of which are manageability, robustness, resilience to failure, and a consistent security model, as well as the scalability needed to rapidly absorb new resources as these become available, while ensuring the long-term viability of the infrastructure. • To carry out a professional Grid middleware re-engineering activity in support of the production services. This will support and continuously upgrade a suite of software tools capable of providing production level Grid services to a base of users which is anticipated to rapidly grow and diversify. • To ensure an outreach and training effort which can proactively market Grid services to new research communities in academia and industry, capture new e-Science requirements for the middleware and service activities, and provide the necessary education to enable new users to benefit from the Grid infrastructure.
VDT EDG . . . LCG-1 LCG-2 EGEE-1 EGEE-2 AliEn LCG . . . Globus 2 based Web services based EGEE EGEE Implementation • From day 1 (1st April 2004) Production grid service based on the LCG infrastructure running LCG-2 grid middleware (SA) LCG-2 will be maintained until the new generation has proven itself (fallback solution) • In parallel develop a “next generation” grid facility (JRA) Produce a new set of grid services according to evolving standards (Web Services) Run a development service providing early access for evaluation purposes Will replace LCG-2 on production facility in 2005
Orientation Equivalent EDG Work Packages / Groups WP6 WP7 WP1-5 & 6 QAG Security Group WP7 WP12 WP11 WP11 WP8-10 ? • EGEE includes 11 activities • Services • SA1: Grid Operations, Support and Management • SA2: Network Resource Provision • Joint Research • JRA1: Middleware Engineering and Integration • JRA2: Quality Assurance • JRA3: Security • JRA4: Network Services Development • Networking • NA1:Management • NA2:Dissemination and Outreach • NA3: User Training and Education • NA4:Application Identification and Support • NA5:Policy and International Cooperation
Services Activities • SA1 : Grid Operations & Support • Objectives: Create & operate a production quality infrastructure 48 partners, approx 45% of total project budget regional structure Builds on the existing LCG infrastructure to provide expanded grid facility for many application domains • SA2 : Network Resource Provision • Objectives: Ensure EGEE access to network services provided by GEANT and the NRENs to link users, resources and operational management 3 partners, approx 1.5% of total project budget Most work will be associated with defining SLR/S/As
Joint Research Activities • JRA1: Middleware Engineering and Integration • Objectives Provide robust, supportable middleware components Integrate grid services to provide a consistent functional basis for the EGEE grid infrastructure Verify the middleware forms a dependable and scalable infrastructure that meets the needs of a large, diverse eScience user community • 5 partners, approx 16% of total project budget • Middleware design team active • Core software team has been working quickly to produce the design of an initial prototype • Taking input from HEP ARDA project as well as final requirements/assessments from EDG project • Initial prototype foreseen at end of April • Not all services implemented, not for general distribution • EDG testbed infrastructure being reused for JRA1 clusters
Joint Research Activities (II) • JRA4: Network Services Development • Objectives : Network oriented joint research to provide end-to-end services • Network reservation, performance monitoring and diagnostics tools • Explore links to how Grid resources are organise/allocated • Investigation of potential impact IPv6 on grids • 5 partners, approx 2.5% of total project budget • Tight collaboration with DANTE and the NRENs, especially through future GN2 project and potential network oriented FP6 projects • JRA3: Security • Objectives : Enable secure European Grid infrastructure operation • Overall security architecture and framework • Policies to be adopted by other EGEE activities (middleware, operations etc.) • 5 partners, approx 3% of total project budget • JRA2: Quality Assurance • Objectives: Foster production & delivery of quality Grid software & operations. • 2 partners, approx 2% of total project budget Many procedures and guidelines already defined
Networking Activities • NA4: Application identification and support • Objectives: Identify and support a broad range of applications from diverse domains, starting with the pilot domains: HEP and Biomedical 20 partners, approx 12.5% of total project budget ARDA project is interface with HEP applications Initial BMI applications identified Industrial forum set-up in a self-financing mode • NA3: User Training and Induction • Objectives: Develop training programme addressing beginners and advanced users. Internal EGEE induction courses. • 22 partners, approx 4% of total project budget • Plans for initial training courses well advanced • Will be able to offer training in the summer on dedicated infrastructure • NA2: Dissemination and Outreach • Objectives: Disseminate the benefits of the EGEE infrastructure to new user communities • 20 partners, approx 5% of total project budget
F. Gagliardi A. Blatecky E. Jessen T. Priol D. Snelling B. Jones F. Hemmer E.Laure V. Breton F. Harris A. Edlund ? I. Bird C. Vistoli M. Aktinson J. Dyer G. Zaquine A. Aimar J-P. Gautier J. Orellana Who’s who
JRA1 Middleware (Re-)Engineering & Integration
Software Clusters • Tools, Testing & Integration (CERN) clusters • Development clusters: • UK • CERN • IT/CZ • Nordic • Clusters have a reasonable sized (distributed) development testbed • Taken over from EDG • Nordic cluster to be finalized • Link with Integration & Tools clusters established • Clusters up and running! • Nordic (security) cluster JRA3
Design Team • Formed in December 2003 • Current members: • UK: Steve Fisher • IT/CZ: Francesco Prelz • Nordic: David Groep • VDT: Miron Livny • CERN: Predrag Buncic, Peter Kunszt, Frederic Hemmer, Erwin Laure • Started service design based on component breakdown defined by the LCG ARDA RTAG • Leverage experiences and existing components from AliEn, VDT, and EDG. • A workingdocument • Overall design & API’s • https://edms.cern.ch/document/458972 • Basis for architecture (DJRA1.1) and design (DJRA1.2) document
Guiding Principles • Lightweight (existing) services • Easily and quickly deployable • Interoperability • Allow for multiple implementations • Resilience and Fault Tolerance • Co-existence with deployed infrastructure • Run as an application • Service oriented approach • Follow WSRF standardization • No mature WSRF implementations exist to date, hence: start with plain WS – WSRF compliance is not an immediate goal • Review situation end 2004
UK IT/CZ UK CERN IT/CZ CERN High Level Service Decomposition • Taken from the ARDA blueprint Nordic • Some services have no clear attribution to cluster(according to TA) • Some services involve collaboration of multiple clusters
Initial Focus • Data management • Storage Element • SRM based; allow POSIX-like access • Workload management • Computing Element • Allow pull and push mode • Information and monitoring • Security • Need to integrate components with quite different security models • Start with a minimalist approach based on VOMS and myProxy
strategic QoS tactical Portability Storage Element • ‘Strategic’ SE • High QoS: reliable, safe.. • Has usually an MSS • Place to keep important data • Needs people to keep running • Heavyweight • ‘Tactical’ SE • Volatile, ‘lightweight’ space • Enables sites to participate in an opportunistic manner • Lower QoS
Storage Element Interfaces • SRM interface • Management and control • SRM (with possible evolution) • Posix-like File I/O • File Access • Open, read, write • Not real posix (like rfio) Management SRM interface POSIXAPI File I/O User rfio dcap chirp aio dCache NeST Castor Disk
Catalogs • File Catalog • Filesystem-like view on logical file names • Replica Catalog • Keep track of replicas of the same file • (Meta Data Catalog) • Attributes of files on the logical level • Boundary between generic middleware and application layer
Metadata Catalog Metadata Replica Catalog File Catalog LFN GUID MasterSURL SURL SURL Files and Catalogs Scenario
EDG Broker Task Queue CE Globus gatekeeper AliEnCE CondorG Change UID Local batch queue GT2, GT3, Unicore Computing Element • Layered service interfacing • various batch systems (LSF, PBS, Condor) • Grid systems like GT2, GT3, and Unicore • CondorG as queuing system on the CE • Allows CE to be used in push and pull mode • Call-out module to change job ownership (security) • Lightweight service • should be possible to dynamically install e.g. within an existing globus gatekeeper
Information Service • Adopt a common approach to information and monitoring infrastructure. • There may be a need for specialised information services • e.g. accounting, package management, grid information, monitoring, provenance, logging • these may be built on an underlying information service • A range of visualisation tools may be used • Using R-GMA
Authentication/Authorization • Different models and mechanisms • Authentication based on Globus/GSI, AFS, SSH, X509, tokens • Authorization • AliEn: exploits mechanism of RDBMS backend • EDG: gridmap file; VOMS credentials and LCAS/LCMAPS • VDT: gridmap file; CAS, VOMS (client) Security and protection at a level acceptable by fabric managers and end users needs to be discussed and “blessed” in advance.
A minimalist approach to security • Need to integrate components with quite different security model • Start with a minimalist approach • Based on VOMS (proxy issuing) and myProxy (proxy store) • User stores proxy in myProxy from where it can be retrieved by access services and sent to other services • Credential chain needs to be preserved • Allow service to authenticate client • Local authorization could be done via LCAS if required • User is mapped to group accounts or components like LCMAPS are used to assign local user identity
Initial prototype components for April’04 To be extended/ changed (e.g. WMS) • Access service: • AliEn shell, APIs • Information & Monitoring: • R-GMA • CE: • AliEn CE, Globus gatekeeper, CondorG • Security: • VOMS, myProxy • Workload mgmt: • AliEn task queue; • SE: • SRM (Castor), GridFTP, GFAL, aoid • File Transfer Service: • AliEn FTD • File and Replica Catalog: • AliEn File Catalog, RLS Towards a prototype • Focus on key services discussed; exploit existing components • Initially an ad-hoc installation at Cern and Wisconsin • Aim to have first instance ready by end of April • Open only to a small user community • Expect frequent changes (also API changes) based on user feedback and integration of further services • Enter a rapid feedback cycle • Continue with the design of remaining services • Enrich/harden existing services based on early user-feedback This is not a release! It’s purely an ad-hoc installation
Planning • Evolution of the prototype • Envisaged status at end of 2004: • Key services need to fulfill all requirements (application, operation, quality, security, …) and form a deployable release • Remaining services available as prototype • Need to develop a roadmap • Incremental changes to prototype (where possible) • Early user feedback through ARDA and early deployment on SA1 pre-production service • Detailed release plan being planned • Converge prototype work with integration & testing activities • Need to get rolling now! • First components will start using SCM in May
Integration • A master Software Configuration Plan is being finalized now • It contains basic principles and rules about the various areas of SCM and Integration (version control, release management, build systems, bug tracking, etc) • Compliant with internationally agreed standards (ISO 10007-2003 E, IEEE SCM Guidelines series) • Most EGEE stakeholders have already been involved in the process to make sure everybody is aware of, contributes to and uses the plan • An EGEE JRA1 Developer's Guide will follow shortly in collaboration with JRA2 (Quality Assurance) based on the SCM Plan • It is of paramount importance to deliver the plan and guide as early as possible in the project lifetime
Testing • The 3 initial testing sites are CERN, NIKHEF and RAL • More sites can join the testing activity at a later stage ! • Must fulfil site requirements • Testing activities will be driven by the test plan document • Test plan being developed based on user requirements documents: • Application requirements from NA4: HEPCAL I&II, AWG documents, Bio-informatics requirements documents from EDG • Deployment requirements being discussed with SA1 • ARDA working document for core Grid services • Security: work with JRA3 to design and plan security testing • The test plan is a living document: it will evolve to remain consistent with the evolution of the software • Coordination with NA4 testing and external groups (e.g. Globus) established • Solid steps towards MJRA1.3 (PM5)
Convergence with Integr & Tstg • Development clusters need to get used to SCM • During May, initial components of the prototype need to follow SCM • Proposed components: • R-GMA • VOMS • RLS • GFAL (is this 3rd party?) • SRM (will there be an EGEE implementation or just 3rd party?) • New developments need to follow SCM from the beginning • ISSUE: perl modules seem not to fit well
Convergence with Integr & Tstg II • IT/CZ • Put EDG code under SCM for training purposes and be prepared to move components to EGEE when needed • VOMS in May • UK • Full R-GMA under SCM in May • CERN/DM • RLS in May • GFAL ?
Development Roadmap • Prototype work as starting point • Priorities need to be adjusted based on user feedback • Incremental, frequent releases • All discussions and decisions take place in the design team • Project-wide body being formed to oversee this activity • PTF Project Technical Forum • Boundary conditions: • Architecture document due end of Month 3 (June) • Design document due end of Month 5 (August)
JRA1/SA1 - Process description • No official delivery of requirements from SA1 to JRA1 stated in the TA • The definition, discussion and agreement of the requirements has already started, done through dedicated meetings • This is an ongoing process: • Not all the requirements defined yet • Set of requirements agreed, need basic agreement to start working! But can be reviewed at any time there is a valid reason for it
JRA1/SA1 - Requirements • Middleware delivery to SA1 • Release management • Deployment scenarios • Middleware configuration • JRA1 will provide a standard set of configuration files and documentation with examples that SA1 can use to design tools. Format to be agreed between SA1-JRA1 • It is the responsibility of SA1 to provide configuration tools to the sites • Enforcement of the procedures • Platforms to support • Primary platform: Red Hat Enterprise 3.0, gcc 3.2.3 and icc8 compilers (both 32 and 64-bits). • Secondary platform: Windows (XP/2003), vc++ 7.1 compiler (both 32 and 64-bits) • Versions for compilers, libraries, third party software • Programming languages • Packaging and software distribution • Others • Sites must be allowed to organize the network as they wish, internal or external connectivity, NAT, firewall, etc, all must be possible, no special constraints. WNs must not require Outgoing IP connectivity; Not inbound connectivity either.
JRA1/JRA3 • A lot of progress has been achieved here • Security Group formed, JRA1 members identified • First meeting scheduled on May 5-6, 2004 • GAP analysis planned by then • VOMS Administration support clarified • Handled by JRA3 • Issue: VOMS effort reporting
JRA1/JRA4 • SCM plan presented and discussed • More discussions on which components of JRA4 will be required in the overall architecture/design need to take place
American Involvement in JRA1 • UWisc • Miron Livny part of the design Team • Condor Team actively involved in reengineering resource access • In collaboration with Italian Cluster • ISI • Identification of potential contributions started (e.g. RLS) • Focused discussions being planned • Argonne • Collaboration on Testing started • Support for key Globus Components enhancements being discussed
JRA1 and other activities • NA4 • HEP: ARDA project started; ensures close relations between HEP and middleware • Bio: activities with similar spirit needed – focused meeting tentatively being planned for May • SA1 • Revision of requirements (platforms) • JRA2 • QAG started • Monthly meeting established • JRA3 • Necessary structures established • Focused Meeting in May • JRA4 • Architectural components required need to be clarified • Other projects • Potential drain of resources for dissemination activities
UK Cluster • R-GMA • Interface to various graphical tools • Monitoring largely driven by application and infrastructure needs • Information system: clarify role in job-submission/data mgmt cycles (e.g. role of GLUE) • Interface to other monitoring systems (e.g. Grid3) • Understand R-GMA role in • Accounting • Job provenance • Logging & bookkeeping • …
IT/CZ Cluster • Resource Access (aka ‘CE’) • Interface to various batch systems • Starting with the integration of CondorG • JobFetcher (implementing ‘pull’ model) • Site policy mgmt, enforcement, and advertisement • WMS • High level optimizer components at TaskQueue • Matchmaking • Job adjustment • VO policy management and enforcement • TaskQueue interactions
IT/CZ Cluster II • Accounting • LCG accounting system (usage records) has to be considered • Role of DGAS needs to be understood • L&B • Assessment of its role in • Accounting • Job provenance • … • Relationship to R-GMA • VOMS • Relationship to JRA3 • Integration into Access Service
CERN/DM Cluster • SE • Posix-like file I/O • GFAL/aio relationship • SRM interface • Will EGEE provide an implementation; will we ship an implementation; will we just make it a requirement? • Space reservation not in v1.1 – migration path to v2.1? • File Catalog • Schema evolution/customization to different user-groups • Server implementation • Metadata catalog interaction
Replica Catalog Deployment model (wrt File Catalog) Schema evolution Distributed Catalog Metadata Catalog Mostly in application domain File Transfer Service Overlaps with WMS/CE Local and global (VO) policy enforcement Error handling and recovery; transaction handling and boundaries; load-balancing and fail-over modes Upgrade resilience Data subscription service GDMP functionality How does it relate to FTS Integration into Access Service CERN/DM Cluster II