360 likes | 503 Views
The EU DataGrid. The European DataGrid Project Team http://www.eu-datagrid.org/. Erwin.Laure@cern.ch. Tutorial Roadmap. Project Introduction Security Architecture The EDG Testbed Coffee Break Specific Middleware Issues Job Management Data Management
E N D
The EU DataGrid The European DataGrid Project Team http://www.eu-datagrid.org/ Erwin.Laure@cern.ch
Tutorial Roadmap • Project Introduction • Security Architecture • The EDG Testbed Coffee Break • Specific Middleware Issues • Job Management • Data Management • Monitoring & Fabric Management • Application Examples
The EU DataGrid Project Introduction The European DataGrid Project Team http://www.eu-datagrid.org/
Contents • The EDG Project scope • Achievements • EDG structure • Middleware Workpackages: Goals, Achievements • DataGrid in Numbers • Relation to Sister Projects
The Grid Vision • Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource • From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location, • central control, • omniscience, • existing trust relationships.
Grids: Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic
Goals • DataGrid is a project funded by European Union whose objective is to exploit and build the next generation computing infrastructure providing intensive computation and analysis of shared large-scale databases. • Enable data intensive sciences by providing world wide Grid test beds to large distributed scientific organizations ( “Virtual Organizations, Vos”) • Start ( Kick off ) : Jan 1, 2001 End : Dec 31, 2003 • Applications/End Users Communities : HEP, Earth Observation, Biology • Specific Project Objectives: • Middleware for fabric & grid management • Large scale testbed • Production quality demonstrations • Collaborate and coordinate with other projects (Globus, Condor, CrossGrid, DataTAG, etc) • Contribute to Open Standards and international bodies ( GGF, Industry&Research forum)
DataGrid Main Partners • CERN – International (Switzerland/France) • CNRS - France • ESA/ESRIN – International (Italy) • INFN - Italy • NIKHEF – The Netherlands • PPARC - UK
Assistant Partners Industrial Partners • Datamat (Italy) • IBM-UK (UK) • CS-SI (France) Research and Academic Institutes • CESNET (Czech Republic) • Commissariat à l'énergie atomique (CEA) – France • Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) • Consiglio Nazionale delle Ricerche (Italy) • Helsinki Institute of Physics – Finland • Institut de Fisica d'Altes Energies (IFAE) - Spain • Istituto Trentino di Cultura (IRST) – Italy • Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany • Royal Netherlands Meteorological Institute (KNMI) • Ruprecht-Karls-Universität Heidelberg - Germany • Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands • Swedish Research Council - Sweden
Project Schedule • Project started on 1/Jan/2001 • Testbed 0 (early 2001) • International test bed 0 infrastructure deployed • Globus 1 only - no EDG middleware • Testbed 1 ( 2002 ) • First release of EU DataGrid software to defined users within the project: • HEP experiments (WP 8), Earth Observation (WP 9), Biomedical applications (WP 10) • Testbed 2 (End 2002) • Builds on Testbed 1 to extend facilities of DataGrid • Focus on production quality • Testbed 3 (2003) • Advanced functionality; currently being deployed. • Project stops on 31/Dec/2003
DataGrid Work Packages • The EDG collaboration is structured in 12 Work Packages • WP1: Work Load Management System • WP2: Data Management • WP3: Grid Monitoring / Grid Information Systems • WP4: Fabric Management • WP5: Storage Element • WP6: Testbed and demonstrators – Production quality International Infrastructure • WP7: Network Monitoring • WP8: High Energy Physics Applications • WP9: Earth Observation • WP10: Biology • WP11: Dissemination • WP12: Management
Apps Mware Globus DataGrid Architecture Local Application Local Database Local Computing Grid Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Book-keeping SQL Database Services Grid Fabric services Fabric Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management
Local Application Local Database Application Developers Grid Application Layer Operating Systems System Managers Scientists Data Management Metadata Management Object to File Mapping CertificateAuthorities Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler FileSystems Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Book-keeping SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management UserAccounts BatchSystems PBS, LSF Storage Elements EDG Interfaces MassStorage Systems HPSS, Castor Computing Elements
WP1: Work Load Management Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management • Goals • Maximize use of resources by efficient scheduling of user jobs • Achievements • Definition of architecture for scheduling & res. mgmt. and accounting & reservation • Development of "super scheduling" component using application data and computing elements requirements • Support for MPI jobs • Logical job check pointing • Interactive jobs Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Bookkeeping SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management
EDG middleware architecture: The Workload Management System (WP1) • WP1 is responsible for the Workload Management System (WMS). The WMS is currently composed by the following parts: • User Interface (UI) : access point for the user to the GRID ( using JDL) • Resource Broker (RB) : the broker of GRID resources, matchmaking • Job Submission System (JSS) : Condor-G; interfacing batch systems • Information Index (II) : an LDAP server used as a filter to select resources • Logging and Bookkeeping services (LB) : MySQL databases to store Job Info
WP1: Work Load Management Local Application Local Database Grid Application Layer Job Managem. Data Managem. Metadata Managem. Object to File Mapping Components Job Description Language Resource Broker Job Submission Service Information Index User Interface Logging & Bookkeeping Service Collective Services Grid Scheduler ReplicaManager Info & Monitor Underlying Grid Services SQL Database Services Authorization Authentication Accounting Computing Element Services Storage Element Services Replica Catalog Logging & Book-keeping Fabric services Monitoring Fault Tolerance Node Installation Management Fabric Storage Management Resource Managem. Config Management • Implementation: • UI : python (LB client : C++) • RB : C++ • JSS : C++, python • II : LDAP server • LB: MySQL, C++ • Input/Output Sandboxes: GridFTP • WMS main interfaces: • Globus Gatekeeper • WP2 Replica Catalog APIs • WP3 Information Systems • WP7 network monitoring info providers • End User (using JDL files, on the UI)
WP2: Data Management Local Application Local Database • Goals • Coherently manage and share petabyte-scale information volumes in high-throughput production-quality grid environments • Achievements • Survey of existing tools and technologies for data access and mass storage systems • Definition of architecture for data management • Deployment of Grid Data Mirroring Package (GDMP) in Testbed 1 • Deployment of EDG Replica Manager in Testbed 2 • Close collaboration with Globus, PPDG/GriPhyN & Condor • Common design of RLS • Working with GGF on standards Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Bookkeeping SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management
EDG middleware architecture: WP2 (Data Management ) • WP2 is responsible for Data Management, which includes file and replica management, metadata access and data security. WP2 components: • Replica Manager: the main manager for triggering replica execution all over the GRID, including replica optimization and interfacing the replica catalog service • Replica Catalog: a GRID service used to resolve Logical File Names into a set of corresponding Physical File Names – Globus Replica Catalog and Replica Location Service (RLS) • GDMP: the GRID Data Mirroring Package, used to create replicas of any filetype all over the GRID Storage Elements in a synchronized way, by automatic updating the replica catalog • Spitfire: provides a Grid enabled middleware service for access to relational databases : it consists of the Spitfire Server module and the Spitfire Client libraries and command line executables.
WP2: Data Management Local Application Local Database Grid Application Layer Metadata Managem. Object to File Mapping Job Managem. Data Managem. Deployed Components GridFTP Replica Manager - edg-replica-manager and Reptor Replica Catalog - globus-replica-catalog GDMP Spitfire Collective Services Grid Scheduler Replica Manager Info & Monitor Underlying Grid Services SQL Database Services Computing Element Services Replica Catalog Storage Element Services Authorization Authentication Accounting Logging & Book-keeping Fabric services Monitoring Fault Tolerance Node Installation Management Fabric Storage Management Resource Managem. Config Management • Implementation: • RM: C++ • Reptor: Java based Web Services • RC : Globus Replica Catalog wrapper • GDMP : C++ • Spitfire : Java, Web Services • WP2 main interfaces: • The GRID Storage Element • WP1 Resource Broker APIs • WP3 GRID Info services • WP7 network monitoring info providers • End User (using GDMP)
WP3: Grid Monitoring Services Local Application Local Database • Goals • Provide information system for discovering resources and monitoring status • Achievements • Survey of current technologies • Coordination of schemas in testbed 1 • Development of Ftree caching backend based on OpenLDAP (Light Weight Directory Access Protocol) to address shortcoming in MDS v1 • Relational Grid Monitoring Architecture (R-GMA) • GRM and PROVE adapted to grid environments to support end-user application monitoring Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorizat ion Authentication and Accounting Logging & Book-keeping SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management
WP3 : GRID Monitoring and Info Providers • WP3’s task is to provide information aboutThe Grid itself This includes information about resources (ComputingElements, StorageElements and the Network), for which the Globus MDS is a common solution; and job status information (as implemented by WP1's Logging and Bookkeeping). Grid applications This is information published by user jobs. This is used for performance monitoring. • R-GMA • relational implementation of the GGF GMA • interoperable with MDS
WP3: GRID Monitoring Local Application Local Database Grid Application Layer Job Managem. Data Managem. Metadata Managem. Object to File Mapping Components MDS / FTree R-GMA GRM/Prove Collective Services ReplicaManager Info & Monitor Grid Scheduler Underlying Grid Services SQL Database Services Authorization Authentication Accounting Computing Element Services Storage Element Services Replica Catalog Logging & Book-keeping Fabric services • Implementation: • MDS : LDAP, Globus GRIS, GIIS • FTree : OpenLDAP, caching • R-GMA : Java , C++, MySQL, TomCat • GRM / PROVE : P-GRADE Monitoring Fault Tolerance Node Installation Management Fabric Storage Management Resource Managem. Config Management • WP3 main interfaces: • WP1 Resource Broker ( InfoIndex) • WP2 RM optimizer • all GRID services producing info (SE,CE..) • WP7 network monitoring
WP4: Fabric Management Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping • Goals • manage clusters (~thousands) of nodes • Achievements • Survey of existing tools, techniques and protocols • Defined an agreed architecture for fabric management • Initial implementations deployed at several sites in testbed 1 & 2 Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Book-keeping SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management
EDG middleware architecture: WP4 : Fabric Management Components • WP4 is responsible to deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes. The computing fabric is called the Computing Element in EDG. • User Job Control and Management(Grid and local jobs) on fabric batch and/or interactive CPU services • Gridification – Grid interface to fabric resources • Resource Management – manage underlying batch services • Automated System Administration for Computing Fabric Elements. These subsystems are reserved for system administrators and operators for performing system maintenance • Configuration Management • Installation Management • Fabric Monitoring
WP4: Fabric Management Local Application Local Database Grid Application Layer Job Managem. Data Managem. Metadata Managem. Object to File Mapping Components LCFG Fabric Monitoring PBS & LSF info providers Image installation Config. Cache Mgr Collective Services ReplicaManager Info & Monitor Grid Scheduler Underlying Grid Services SQL Database Services Authorization Authentication Accounting Computing Element Services Storage Element Services Replica Catalog Logging & Book-keeping Fabric services Node Installation Management Fabric Storage Management Monitoring Fault Tolerance Resource Managem. Config Management • Implementation: • LCFG : C++, XML, HTTP • WP4 main interfaces: • WP1 Resource Broker ( InfoIndex) • WP2 Data management • WP5 Storage Element • WP3 GRID Info Services
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Bookkeeping SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP5: Mass Storage Management • Goals • Provide common user and data export/import interfaces to existing local mass storage systems • Achievements • Review of Grid data systems, tape and disk storage systems and local file systems • Definition of Architecture and Design for DataGrid Storage Element • Collaboration with Globus on GridFTP/RFIO • Collaboration with PPDG on control API • First attempt at exchanging Hierarchical Storage Manager (HSM) tapes • SRM compliant interface to MSS
WP5 : Mass Storage Management • WP5 delivers the Grid interface to Storage. • Its service, the Storage Element (SE) is interfacing to underlying Mass Storage Systems or simple storage services. • Main interfaces: • Data, gridftp will be used to transfer files over the WAN and the files will optionally be available to local nodes by NFS. • Information, Existing MDS information providers will be extended to provide the extra information in the GLUE storage schema. • Control, functions such as reservation, pinning, deletion, and transfer time estimation. Will provide an SRM 2 interface.
Local Application Local Database WP5: Mass Storage Management Grid Application Layer Job Managem. Data Managem. Metadata Managem. Object to File Mapping • Achievements • Definition of Architecture and Design for DataGrid storage Element • Collaboration with Globus on GridFTP/RFIO • Collaboration with PPDG on control API • Staging from/to CASTOR at CERN succesfully implemented and tested • Succesfully Interfaced to GDMP • Supported Storage Systems: • UNIX disk systems • HPSS (High Performance Storage System) • CASTOR (through RFIO) • GridFTP servers • DMF • Enstore Collective Services ReplicaManager Info & Monitor Grid Scheduler Underlying Grid Services SQL Database Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication Accounting Logging & Book-keeping Fabric services Node Installation Management Fabric Storage Management Monitoring Fault Tolerance Resource Managem. Config Management • WP5 (SE) main interfaces: • WP1 Resource Broker & JSS • WP2 RM, RC • WP7 for GRIDftp monitoring • WP3 GRID Info Services
WP6 additions to Globus Globus EDG release WP6: TestBed Integration Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping • Goals • Deploy testbeds for the end-to-end application experiments & demos • Integrate successive releases of the software components • Achievements • Integration of EDG sw and deployment • Working implementation of multiple Virtual Organizations (VOs) s & basic security infrastructure • Definition of acceptable usage contracts and creation of Certification Authorities group • Definition of test plan • User’s, administrator’s, and developer’s guides Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Bookkeeping SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management Components Globus packaging & EDG config Build tools End-user documents
Tasks for the WP6 integration team • Testing and integration of the Globus package • Exact definition of RPM lists (components) for the various testbed machine profiles (CE service , RB, UI, SE service , NE, WN, ) – check dependencies • Perform preliminary centrally (CERN) managed tests on EDG m/w before green light for spread EDG testbed sites deployment • Provide, update end user documentation for installers/site managers, developers and end users • Define EDG release policies, coordinate the integration team staff with the various WorkPackage managers – keep high inter-coordination. • Assign the reported bugs to the corresponding developers/site managers (BugZilla) • Complete support for the iTeam testing VO
WP6: TestBed Integration and demonstrators Local Application Local Database • WP6 goals: the EDG testbed • Integration of EDG sw releases and deployment all over the EDG testbed : the integration team • Working implementation of multiple VOs & basic security infrastructure • Definition of acceptable usage contracts and creation of Certification Authorities group • Set up of the Authorization Working Group to manage authorization policies on the testbed • 2 Testbeds: • Dev. TB for integration • Application TB for application usage • Certification TB planned Grid Application Layer Job Managem. Data Managem. Metadata Managem. Object to File Mapping Collective Services ReplicaManager Info & Monitor Grid Scheduler Underlying Grid Services SQL Database Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication Accounting Logging & Book-keeping Fabric services Node Installation Management Fabric Storage Management Monitoring Fault Tolerance Resource Managem. Config Management Components Support for test-VO, mkgridmap tools Globus packaging & EDG config Build tools, CVS central s/w repository End-user documents
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Logging & Bookkeepgin SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP7: Network Services • Goals • Review the network service requirements for DataGrid • Establish and manage the DataGrid network facilities • Monitor the traffic and performance of the network • Deal with the distributed security aspects • Achievements • Analysis of network requirements for testbed 1 & study of available network physical infrastructure • Use of European backbone GEANT since Dec. 2001 • Initial network monitoring architecture defined and first tools deployed • Collaboration with Dante & DataTAG • Working with GGF (Grid High Performance Networks) & Globus (monitoring/MDS) • Network cost estimation for workload and data management Components network monitoring tools: PingER Udpmon Iperf
Applications (WP8-10) High Energy Physics Earth Observation Science Applications Biomedical Applications
DataGrid in Numbers People >350 registered users 12 Virtual Organisations 16 Certificate Authorities >200 people trained 278 man-years of effort 100 years funded Testbeds >15 regular sites >10’000s jobs submitted >1000 CPUs >5 TeraBytes disk 3 Mass Storage Systems Software 50 use cases 18 software releases >300K lines of code Scientific applications 5 Earth Obs institutes 9 bio-informatics apps 6 HEP experiments
GriPhyN PPDG iVDGL Related Grid Projects Through links with sister projects, there is the potential for a truely global scientific applications grid Demonstrated at IST2002 and SC2002 in November