290 likes | 448 Views
Status and Prospective of EU Data Grid Project. Alessandra Fanfani (University of Bologna) On behalf of EU DataGrid project. http://www.eu-datagrid.org. Outline: EU DataGrid project HEP Application experience Future perspective. 9.8 M Euros EU funding over 3 years
E N D
Status and Prospective of EU Data Grid Project Alessandra Fanfani (University of Bologna) On behalf of EU DataGrid project http://www.eu-datagrid.org • Outline: • EU DataGrid project • HEP Application experience • Future perspective
9.8 M Euros EU funding over 3 years 90% for middleware and applications (HEP , Earth Observation, Biomedical) 3 year phased developments & demos Total of 21 partners Research and Academic institutes as well as industrial companies Extensions (time and funds) on the basis of first successful results: DataTAG (2002-2003) www.datatag.org CrossGrid (2002-2004) www.crossgrid.org GridStart (2002-2004) www.gridstart.org Project started on Jan. 2001 Testbed 0 (early 2001) International test bed 0 infrastructure deployed Globus 1 only - no EDG middleware Testbed 1 ( early 2002 ) First release of EU DataGrid software to defined users within the project Testbed 2 (end 2002) Builds on Testbed 1 to extend facilities of DataGrid Focus on stability Passed 2nd annual EU review Feb. 2003 Testbed 3 (2003) Advanced functionality & scalability Currently being deployed Project stops on Dec. 2003 The EU DataGrid Project
Through links with sister projects, there is the potential for a truly global scientific applications grid Main components of EDG 2.0 release build the basis for LCG middleware LHC Computing Gridwww.cern.ch/lcg GriPhyN PPDG iVDGL Related Grid Projects
EDG Middleware Architecture Local Computing Local Application Local Database APPLICATIONS Grid Grid Application Layer Data Management Metadata Management Job Management Collective Services Grid Scheduler Information & Monitoring Replica Manager Underlying Grid Services Computing Element Services Storage Element Services Replica Catalog Authorization Authentication and Accounting Service Index SQL Database Services M / W Grid Fabric services GLOBUSCondorG (via VDT) Fabric Monitoring and Fault Tolerance Node Installation & Management Fabric Storage Management Resource Management Configuration Management
Workload Management System • The user interacts with Grid via a Workload Management System (WMS) • The Goal of WMS is the distributed scheduling and resource management in a Grid environment. • Resource Broker tries to match user requirements with available resources • Software installed at potential sites • Ensure data locality • Efficient usage of resources
Data Management • High level data management on the Grid • Location of data • Replication of data • Efficient access to data • Provide basic, consistent interface to disk and mass to storage systems (Hides the Storage Resource Manager )
Information & Monitoring • R-GMA Relational implementation of GMA from GGF • Makes use of GLUE schema (inter-operability with US grids) • Interoperable with MDS • Deals with information on • The Grid itself • Resources and Services • Job status information • Grid applications
DataGrid in Numbers People >350 registered users 12 Virtual Organisations 16 Certificate Authorities >300 people trained 278 man-years of effort 100 years funded Testbeds >15 regular sites • 40 sites using EDG sw(i.e. Taiwan, Korea) >10’000s jobs submitted >1000 CPUs >15 TeraBytes disk 3 Mass Storage Systems Software 50 use cases 18 software releases Current release 1.4 Release 2.0 being tested >300K lines of code Scientific applications 5 Earth Obs institutes 9 bio-informatics apps 6 HEP experiments
DataGrid Scientific Applications • Developing grid middleware to enable large-scale usage by • scientific applications • Development on computing side but also focus on the real use by the applications! Bio-informatics • Data mining on genomic databases (exponential growth) • Indexing of medical databases (Tb/hospital/year) Earth Observation • about 100 Gbytes of data per day (ERS 1/2) • 500 Gbytes, for the ENVISAT mission Particle Physics • Simulate and reconstruct complex physics phenomena millions of times • LHC experiments will generate 6-8 PetaBytes/year
Positive Signs: Large increase in users. Many sites interested in joining. Pushing real jobs through system. 100 GB 19 GB 200 GB TOTAL: >1.5 TB Nb. of evts 1 MB 1 GB 1 TB SEs CEs Application Usage of Release 1.4 Disk Usage EDG 1.4 evaluated for review in Feb. 2003 CPU Usage HEP Simulation Disk Usage (CERN) CEs Successful 2nd annual EU review: funding agencies were happy about the real use by the application
HEP Applications • Intense usage of application testbed in 2002 and early 2003, in particular by HEP experiments: • ATLAS, CMS, ALICE, LHCb, Babar, D0 activities within DataGrid documented in detail in deliverable D8.3https://edms.cern.ch/document/375586/1.2 • ATLAS and CMS task forces very active and successful • Several hundred ATLAS simulation jobs of length 4-24 hours were executed & data was replicated using grid tools • CMS Generated ~250K events for physics studies with ~10,000 jobs in 3 week period • Since project review: ALICE and LHCb have been generating physics events • Babar and D0 performed more basic tests with analysis and Monte-Carlo production jobs
Joint evaluation fromAtlas/CMS work on Release 1.4 • Results were obtained from focused task-forcesof Experiments and EDG people • Good interactionwith EDG middleware providers • Fast turnaround in bug fixing and installing new software • Test were labour intensive since software was developing and the overall system was fragile • There are essential developments needed in • Data Management (robustness and functionality) • Information Systems (robustness and scalability) • Workload Management (scalability for high rates, batch submissions,stability) • Mass Storage Support (gridified support due in EDG 2.0) • Release 2.0 should fix the major problems
Release 2.0 • Major new developments in all middleware areas • Addressing the key shortcomings identified: • WMS stability and scalability WMS re-factored • Replica catalog stability and scalability Replica Location Service • Data management usability DM re-factored • Information system stability and scalability R-GMA • Unified access to MSS new SE service • Fabric monitoring infrastructure • Providing new functionalities • Upgrade underlying software
HEP experience:the CMS example joint effort involving CMS, EDG, EDT and LCG people • CMS/EDG Stress Test Goals: • Verification of the portability of the CMS Production environment into a grid environment; • Verification of the robustness of the European DataGrid middleware in a production environment; • Production of data for the Physics studies of CMS • Use as much as possible the High-level Grid functionalities provided by EDG: • Workload Management System (Resource Broker), • Data Management (Replica Manager and Replica Catalog), • MDS (Information Indexes), • Virtual Organization Management, etc. • Interface (modify) the CMS Production Tools to the Grid provided access method • Measure performances, efficiencies and reason of job failures to have feedback both for CMS and EDG
CMS/EDG Middleware and Software • Middleware was: EDG from version 1.3.4 to version 1.4.3 • Resource Broker server • Replica Manager and Replica Catalog Servers • MDS and Information Indexes Servers • Computing Elements (CEs) and Storage Elements (SEs) • User Interfaces (UIs) • Virtual Organization Management Servers (VO) and Clients • EDG Monitoring, etc… • CMSsoftwaredistributed as rpms and installed on the CE • CMSProduction tools(IMPALA,BOSS) installed on User Interface • Monitoringwas done trough: • Job monitoring and bookkeeping: BOSS Database, EDG Logging & Bookkeeping service • Resources monitoring : Nagios, web based tool developed by the DataTag project • EDG monitoring system (MDS based): collected regularly by scripts running as cron jobs and stored for offline analysis • BOSS database: permanently stored in the MySQL database Both sources are processed by a tool (boss2root) to put the information in a Root tree to perform analysis On line Off line
CMKIN Job CMSIM Job Write to Grid Storage Element Write to Grid Storage Element Read from Grid Storage Element Grid Storage Output data (ntuples) Output data (Fz files) CMS jobs description • CMS official jobs for “Production” of results used in Physics studies : Real-life testing • Production in 2 steps: • CMKIN : MC Generation of the proton-proton interaction for a physics channel (dataset) 125 events ~ 1 minute ~ 6 MB ntuples • CMSIM: Detailed simulation of CMS Detector 125 events~ 12 hours~ 230 MBFZ files Dataset eg02_BigJets * PIII 1GHz 512MB 46.8 SI95 “Short” jobs “Long” jobs
CMS EDG CE CE CE CE parameters CMS software CMS software CMS software CMS software Job output filtering Runtime monitoring read write JDL SE X data registration Push data or info WN Pull info CMS production components interfaced to EDG • Four submitting UIs: Bologna/CNAF (IT), Ecole Polytechnique (FR), Imperial College (UK), Padova/INFN (IT) • Several Resource Brokers (WMS), CMS-dedicated and shared with other Applications: one RB for each CMS UI + “backup” • Replica Catalog at CNAF, MDS (and II) at CERN and CNAF, VO server at NIKHEF SE RefDB BOSS DB Workload Management System SE UI IMPALA/BOSS input data location SE CE Replica Manager SE
NIKHEF • Imperial College RAL . • Ecole Poly • Legnaro & Padova • CERN • Lyon • CNAF Bologna EDG hardware resources *Dedicated to CMS Stress Test add new (CMS) sites to provide extra resources
Statistics of CMS/EDG Stress Test Nb of jobs distribution of job: Executing CEs Executing Computing Element Total EDG Stress Test jobs = 10676, successful =7196 , failed = 3480
CMS/EDG Production CMSIM “long” jobs job submitted from UI: Nb of events ~260K eventsproduced ~7 sec/event average ~2.5 sec/event peak (12-14 Dec) Hit some limit of implement. (RC,MDS) Upgrade of MW 20 Dec CMS Week 30 Nov
Main results and observations • RESULTS • Could distribute and run CMS software in EDG environment • Generated ~250K events for physics with ~10,000 jobs in 3 week period • OBSERVATIONS • Were able to quickly add new sites to provide extra resources • Fast turnaround in bug fixing and installing new software • Test was labour intensive (since software was developing and the overall system was fragile) • WMS: At the start there were serious problems with long jobs- recently improved • Data Management: Replication Tools were difficult to use and not reliable, and the performance of the Replica Catalogue was unsatisfactory • Information system: The Information System based on MDS performed poorly with increasing query rate • The system is sensitive to hardware faults and site/system mis-configuration • The user tools for fault diagnosis are limited • EDG 2.0 should fix the major problems providing a system suitable for full integration in distributed production
EU DataGrid Summary and Outlook • The focussing of the project on stability has improved the manner in which the software is build and supported • The application testbed has reached the highest level of maturity that can be achieved using the available grid middleware and supporting manpower • Steady increase in the size of the testbed until a peak of approx 1000 CPUs at 15 sites • Intense usage of application testbed (release 1.3 and 1.4) in the past year significant achievements in the use of EDG middleware by the experiments : • Real use is possible but labour intensive • Results were obtained by task-force which pointed to areas in the middleware which required development and reconfiguration • The problems in performance encountered by the experiments are addressed in the release EDG 2.0. • There is a strong connection with the LHC Computing Grid. LCG have a new grid service modeled on the EDG testbed and includes EDG 2.0 components Outlook: A production quality infrastructure is needed EGEE Continuous, stable Grid operation represents the most ambitious objective of EGEE and require the largest effort
EGEE vision:Enabling Grids for E-science in Europe http://www.cern.ch/egee • Goal • Create a wide European Grid production qualityinfrastructure on top of present and future EU RN infrastructure • Build on • EU and EU member states major investments in Grid Technology • Exploit International connections (US and AP) • Several pioneering prototype results • Large Grid development team (>60 people) • Requires major EU funding effort • Approach • Leverage current and planned national and regional Grid programmes (e.g. LCG) • Work closely with relevant industrial Grid developers, NRENs and US-AP projects Applications EGEE Geant network
EGEE Proposal • Proposal submitted to EU IST 6th framework call on 6th May 2003 • Executive summary(exec summary: 10 pages; full proposal: 276 pages) http://agenda.cern.ch/askArchive.php?base=agenda&categ=a03816&id=a03816s5%2Fdocuments%2FEGEE-executive-summary.pdf • Two-year project conceived as part of a four year programme 9 regional federations covering 70 partners in 26 countries
EGEE Operation Management Regional Operations Centre • managing the overall Grid infrastructure Core Infrastructure Centre EGEE Activities • Service Activities: deliver production level Grid Infrastructure(52% of funding) • Integration of national and international Grid infrastructures • Essential elements: manageability, robustness, resilience to failure,consistent security model, scalability to rapidly absorb new resources • Joint Research Activity: Engineering development (24% of funding) • Re-Engineering of grid middleware (OGSA environment) to improve the services provided by the Grid infrastructure • Networking Activities:Management, Dissemination, Training and Applications (24% of funding) • The Applications Interface Activity will start with two Pilot applications in high energy physics and bio/medical • regional deployment and support of services
EGEE Status • EGEE proposal passed thresholds at first EU review (June 2003) • Follow-up hearing held at Brussels on 1st July 2003 to answer written questions from the EU reviewers on details of the project • Evaluation Summary Report received from Brussels (17th July 2003) • Number of detailed recommendations made • EU budget estimated at 31.5M€ • Negotiate budget details during summer and produce Technical Annex (details of negotiated tasks and budgets) • Informal EGEE/EU meeting held in Brussels 24th July 2003 • Foreseen project start date: 1st April 2004 Good match with existing EU DataGrid and related project expected completion All partners are requested to assign resources already during summer 2003 to start engineering investigations and architecture design work so that project can start on time
EGEE Summary • EGEE is a project to develop and establish a reliable infrastructure that provides high quality grid service to a wide range of users • HEP is one of the two pilot application areas selected to guide the implementation and certify the performance and functionality of this evolving European Grid infrastructure • International connection : participation and collaboration with non EU countries (Russia, US, AP) is desirable and will be pursued