360 likes | 543 Views
European DataGrid Project status and plans. Peter Kunszt, CERN DataGrid, WP2 Manager Peter.Kunszt@cern.ch. Outline. EU DataGrid Project EDG overview Project Organisation Objectives Current Status overall and by WP Plans for next releases and testbed 2 Conclusions. The Grid vision.
E N D
European DataGrid Project status and plans Peter Kunszt, CERN DataGrid, WP2 Manager Peter.Kunszt@cern.ch
Outline EU DataGrid Project • EDG overview • Project Organisation • Objectives • Current Status overall and by WP • Plans for next releases and testbed 2 • Conclusions
The Grid vision • Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource • From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location, • central control, • omniscience, • existing trust relationships.
Grids: Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic
EU DataGrid Project Objectives • DataGrid is a project funded by European Union whose objective is to exploit and build the next generation computing infrastructure providing intensive computation and analysis of shared large-scale databases. • Enable data intensive sciences by providing world wide Grid test beds to large distributed scientific organisations ( “Virtual Organisations, VO”) • Start ( Kick off ) : Jan 1, 2001 End : Dec 31, 2003 • Applications/End Users Communities : HEP, Earth Observation, Biology • Specific Project Objetives: • Middleware for fabric & grid management • Large scale testbed • Production quality demonstrations • To collaborate with and complement other European and US projects • Contribute to Open Standards and international bodies ( GGF, Industry&Research forum)
DataGrid Main Partners • CERN – International (Switzerland/France) • CNRS - France • ESA/ESRIN – International (Italy) • INFN - Italy • NIKHEF – The Netherlands • PPARC - UK
Assistant Partners • Industrial Partners • Datamat (Italy) • IBM-UK (UK) • CS-SI (France) • Research and Academic Institutes • CESNET (Czech Republic) • Commissariat à l'énergie atomique (CEA) – France • Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) • Consiglio Nazionale delle Ricerche (Italy) • Helsinki Institute of Physics – Finland • Institut de Fisica d'Altes Energies (IFAE) - Spain • Istituto Trentino di Cultura (IRST) – Italy • Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany • Royal Netherlands Meteorological Institute (KNMI) • Ruprecht-Karls-Universität Heidelberg - Germany • Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands • Swedish Research Council - Sweden
Project Schedule • Project started on 1/Jan/2001 • TestBed 0 (early 2001) International test bed 0 infrastructure deployed Globus 1 only - no EDG middleware • TestBed 1 ( now ) First release of EU DataGrid software to defined users within the project: HEP experiments (WP 8), Earth Observation (WP 9), Biomedical applications (WP 10) • Successful Project Review by EU: March 1st 2002 • TestBed 2 (October 2002) Builds on TestBed 1 to extend facilities of DataGrid • TestBed 3 (March 2003) & 4 (September 2003) • Project stops on 31/Dec/2003
EDG Highlights • The project is up and running! • All 21 partners are now contributing at contractual level • total of ~60 man years for first year • All EU deliverables (40, >2000 pages) submitted • in time for the review according to the contract technical annex • First test bed delivered with real production demos • All deliverables (code & documents) available via www.edg.org • http://eu-datagrid.web.cern.ch/eu-datagrid/Deliverables/default.htm • requirements, surveys, architecture, design, procedures, testbed analysis etc.
Working Areas • The DataGrid project is divided in 12 Work Packages distributed in four Working Areas Applications Middleware Testbed Management Infrastructure
Work Packages WP1: Work Load Management System WP2: Data Management WP3: Grid Monitoring / Grid Information Systems WP4: Fabric Management WP5: Storage Element WP6: Testbed and demonstrators WP7: Network Monitoring WP8: High Energy Physics Applications WP9: Earth Observation WP10: Biology WP11: Dissemination WP12: Management
Collect requirements for middleware Take into account requirements from application groups Survey current technology For all middleware Core Services testbed Testbed 0: Globus (no EDG middleware) First Grid testbed release Testbed 1: first release of EDG middleware WP1: workload Job resource specification & scheduling WP2: data management Data access, migration & replication WP3: grid monitoring services Monitoring infrastructure, directories & presentation tools WP4: fabric management Framework for fabric configuration management & automatic sw installation WP5: mass storage management Common interface for Mass Storage Sys. WP7: network services Network services and monitoring Objectives for the first year of the project
Local Application Local Database Local Computing Apps Grid Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Mware Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Grid Globus Fabric services Fabric Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management DataGrid Architecture
Local Application Local Database Application Developers Grid Application Layer Operating Systems System Managers Scientists Data Management Metadata Management Object to File Mapping CertificateAuthorities Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler FileSystems Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management UserAccounts BatchSystems PBS, LSF Storage Elements EDG Interfaces MassStorage Systems HPSS, Castor Computing Elements
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP1: Work Load Management • Goals • Maximise use of resources by efficient scheduling of user jobs • Achievements • Analysis of work-load management system requirements & survey of existing mature implementations Globus & Condor (D1.1) • Definition of architecture for scheduling & res. mgmt. (D1.2) • Development of "super scheduling" component using application data and computing elements requirements • Issues • Integration with software from other WPs • Advanced job submission facilities Components Job Description Language Resource Broker Job Submission Service Information Index User Interface Logging & Bookkeeping Service
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP2: Data Management • Goals • Coherently manage and share petabyte-scale information volumes in high-throughput production-quality grid environments • Achievements • Survey of existing tools and technologies for data access and mass storage systems (D2.1) • Definition of architecture for data management (D2.2) • Deployment of Grid Data Mirroring Package (GDMP) in testbed 1 • Close collaboration with Globus, PPDG/GriPhyN & Condor • Working with GGF on standards • Issues • Security: clear methods handling authentication and authorization • Data replication - how to maintain consistent up to date catalogues of application data and its replicas Components GDMP Replica Catalog SpitFire
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorizat ion Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP3: Grid Monitoring Services • Goals • Provide information system for discovering resources and monitoring status • Achievements • Survey of current technologies (D3.1) • Coordination of schemas in testbed 1 • Development of Ftree caching backend based on OpenLDAP (Light Weight Directory Access Protocol) to address shortcoming in MDS v1 • Design of Relational Grid Monitoring Architecture (R-GMA) (D3.2) – to be further developed with GGF • GRM and PROVE adapted to grid environments to support end-user application monitoring Components MDS/Ftree R-GMA GRM/PROVE
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP4: Fabric Management • Goals • manage clusters (~thousands) of nodes • Achievements • Survey of existing tools, techniques and protocols (D4.1) • Defined an agreed architecture for fabric management (D4.2) • Initial implementations deployed at several sites in testbed 1 • Issues • How to install reference platform and EDG software on large numbers of hosts with minimal human intervention per node • How to ensure the node configurations are consistent and handle updates to the software suites Components LCFG PBS & LSF info providers Image installation Config. Cache Mgr
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP5: Mass Storage Management • Goals • Provide common user and data export/import interfaces to existing local mass storage systems • Achievements • Review of Grid data systems, tape and disk storage systems and local file systems (D5.1) • Definition of Architecture and Design for DataGrid Storage Element (D5.2) • Collaboration with Globus on GridFTP/RFIO • Collaboration with PPDG on control API • First attempt at exchanging Hierarchical Storage Manager (HSM) tapes • Issues • Scope and requirements for storage element • Inter-working with other Grids Components Storage Element info. providers RFIO MSS staging
Local Application Local Database Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management WP7: Network Services • Goals • Review the network service requirements for DataGrid • Establish and manage the DataGrid network facilities • Monitor the traffic and performance of the network • Deal with the distributed security aspects • Achievements • Analysis of network requirements for testbed 1 & study of available network physical infrastructure (D7.1) • Use of European backbone GEANT since Dec. 2001 • Initial network monitoring architecture defined (D7.2) and first tools deployed in testbed 1 • Collaboration with Dante & DataTAG • Working with GGF (Grid High Performance Networks) & Globus (monitoring/MDS) • Issues • Resources for study of security issues • End-to-end performance for applications depend on a complex combination of components • Components network monitoring tools: PingER Udpmon Iperf
Local Application Local Database WP6 additions to Globus Globus Grid Application Layer Data Management Metadata Management Object to File Mapping Job Management Collective Services Information & Monitoring Replica Manager Grid Scheduler Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services Fabric services Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management EDG release WP6: TestBed Integration • Goals • Deploy testbeds for the end-to-end application experiments & demos • Integrate successive releases of the software components • Achievements • Integration of EDG sw release 1.0 and deployment • Working implementation of multiple Virtual Organisations (VOs) s & basic security infrastructure • Definition of acceptable usage contracts and creation of Certification Authorities group • Issues • Procedures for software integration • Test plan for software release • Support for production-style usage of the testbed Components Globus packaging & EDG config Build tools End-user documents
Release Plan . . . . . . . . . Release feedback Release Plan++ WP meetings Software Release Procedure • Coordination meeting Gather feedback on previous release Review plan for next release • WP meeting Take basic plan and clarify effort/people/dependencies • Sw development Performed by WPs in dispersed institutes and run unit tests • Software integration Performed by WP6 on frozen sw Integration tests run http://edms.cern.ch/document/341943 • Acceptance tests Performed by Loose Cannons et al. • Roll-out Present sw to application groups Deploy on testbed Coord. meeting themed tech. meets Component n Component 1 WP1 WP7 WP3 Globus EDG release Distributed EDG release testbed 1: Dec 11 2001 ~100 participants Roll-out. meeting Software release Plan http://edms.cern.ch/document/333297
TestBed 1 Sites Status Web interface showing status of servers at testbed 1 sites
Testbed Sites (>40) Dubna Moscow Lund Estec KNMI RAL Berlin IPSL Prague Paris Brno CERN Lyon Santander Milano Grenoble PD-LNL Torino Madrid Marseille BO-CNAF HEP sites Pisa Lisboa Barcelona ESRIN ESA sites Roma Valencia Catania Francois.Etienne@in2p3.fr - Antonia.Ghiselli@cnaf.infn.it DataGrid Testbed
Job arguments Data Type : raw/dst Run Number :xxxxxx Number of evts :yyyyyy Number of wds/evt:zzzzzz Rep Catalog flag : 0/1 Mass Storage flag : 0/1 Generate Raw events on local disk pfn local ? n y Copy raw data From SE to Local disk Raw/dst ? Get pfn from Rep Catalog Read raw events Write dst events Move to SE, MS? raw_xxxxxx_dat.log raw_xxxxxx_dat.log SE dst_xxxxxx_dat.log MS Add lfn/pfn to Rep Catalog Move to SE, MS ? SE MS Add lfn/pfn to Rep Catalog Write logbook On client node Write logbook On client node Generic HEP application flowchart Initial testbed usage Physicists from LHC experiments submit jobs with their application software that uses: • User interface (job submission language etc.) • Resource Broker & Job submission service • Information Service & Monitoring • Data Replication First simulated ALICE event generated by using the DataGrid Job Submission Service [reale@testbed006 JDL]$ dg-job-submit gridpawCNAF.jdl Connecting to host testbed011.cern.ch, port 7771 Transferring InputSandbox files...done Logging to host testbed011.cern.ch, port 1 5830 =========dg-job-submit Success ============ The job has been successfully submitted to the Resource Broker. Use dg-job-status command to check job current status. Your job identifier (dg_jobId) is: https://testbed011.cern.ch:7846/137.138.181.253/185337169921026?testbed011.cern.ch:7771 ======================================== [reale@testbed006 JDL]$ dg-job-get-output https://testbed011.cern.ch:7846/137.138.181.253/185337169921026?testbed011.cern.ch:7771 Retrieving OutputSandbox files...done ============ dg-get-job-output Success ============ Output sandbox files for the job: - https://testbed011.cern.ch:7846/137.138.181.253/185337169921026?testbed011.cern.ch:7771 have been successfully retrieved and stored in the directory: /sandbox/185337169921026
Data mining on genomic databases (exponential growth) Indexing of medical databases (Tb/hospital/year) Collaborative framework for large scale experiments (e.g. epidemiological studies) Parallel processing for Databases analysis Complex 3D modelling Biomedical applications
Earth Observations • ESA missions: • about 100 Gbytes of data per day (ERS 1/2) • 500 Gbytes, for the next ENVISAT mission (launched March 1st) • EO requirements for the Grid: • enhance the ability to access high level products • allow reprocessing of large historical archives • improve Earth science complex applications (data fusion, data mining, modelling …)
Development & Production testbeds • Development • Initial set of 5 sites will keep small cluster of PCs for development purposes to test new versions of the software, configurations etc. • Production • More stable environment for use by application groups • more sites • more nodes per site (grow to meaningful size at major centres) • more users per VO • Usage already foreseen in Data Challenge schedules for LHC experiments • harmonize release schedules
Planned intermediate release schedule TestBed 1: November 2001 Release 1.1: January 2002 Release 1.2: July 2002 Release 1.3: internal release only Release 1.4: August 2002 TestBed 2: October 2002 Similar schedule will be made for 2003 Each release includes feedback from use of previous release by application groups planned improvements/extension by middle-ware WPs more use of WP6 software infrastructure feeds into architecture group Extension of testbed more users, sites & nodes-per-site split testbed into development and production sites investigate inter-operability with US grids Iterative releases up to testbed 2 incrementally extend functionality provided via each Work Package better integrate the components improve stability Testbed 2 (autumn 2002) extra requirements interactive jobs job partitioning for parallel execution advance reservation accounting & Query optimization security design (D7.6) . . . Plans for 2002
Current release EDG 1.1.4 Deployed on testbed under RedHat 6.2 Finalising build of EDG 1.2 GDMP 3.0 GSI-enabled RFIO client and server EDG 1.3 (internal) Build using autobuild tools – to ease future porting Support for MPI on single site EDG 1.4 (august) Support RH 6.2 & 7.2 Basic support for interactive jobs Integration of Condor DAGman Use MDS 2.2 with first GLUE schema EDG 2.0 (Oct) Still based on Globus 2.x (pre-OGSA) Use updated GLUE schema Job partitioning & check-pointing Advanced reservation/co-allocation Release Plan details See http://edms.cern.ch/document/333297 for further details
Issues • Support for production testbed • Effort for testing • Software Release Procedure: Integrated testing • CA explosion, CAS introduction and policy support • Packaging & distribution • S/W licensing • Convergence on Architecture • Impact of OGSA
Issues - Actions • Support for production testbed – support team and dedicated site • Effort for testing – test team • Software Release Procedure: Integrated testing – expand procedure • CA explosion, CAS introduction and policy support – security group’ssecurity design • Packaging & distribution – ongoing • S/W licensing – has been addressed, see http://www.edg.org/license • Convergence on Architecture – architecture group • Impact of OGSA – design of OGSA services in WP2, WP3
Future Plans • Expand and consolidate testbed operations • Improve the distribution, maintenance and support process • Understand, refine Grid operations • Evolve architecture and software on the basis of TestBed usage and feedback from users • GLUE • Converging to common documents with PPDG/GriPhyN • OGSA interfaces and components • Prepare for second test bed in autumn 2002 in close collaboration with LCG • Enhance synergy with US via DataTAG-iVDGL and InterGrid • Promote early standards adoption with participation to GGF and other international bodies • Explore possible Integrated Project within FP6
Learn more on EU-DataGrid • For more information, see the EDG website • http://www.edg.org/ • EDG Tutorials at ACAT: • Tuesday 15.00-17.00 • Wednesday 17.30-19.30 • EDG Tutorials at GGF5 in Edinburgh 25.7.2002 – see http://www.gridforum.org/ • Cern School of Computing Vico Equense, Italy, 15-28 September 2002 • Programme includes Grid Lectures by Ian Foster and Carl Kesselman and a hands-on tutorial on DataGrid, http://cern.ch/CSC/