590 likes | 778 Views
The EU INFN Grid. Marco Verlato (INFN-Padova) Workshop on Scientific Instruments and Sensors on the Grid Trieste 23 April 2007. Outline. A little of history The INFN GRID project The EGEE project The EGEE middleware The INFN Production Grid: Grid.it Long-term sustainability: EGI and IGI
E N D
The EU INFN Grid Marco Verlato (INFN-Padova) Workshop on Scientific Instruments and Sensors on the Grid Trieste 23 April 2007
Outline • A little of history • The INFN GRID project • The EGEE project • The EGEE middleware • The INFN Production Grid: Grid.it • Long-term sustainability: EGI and IGI • Summary
The INFN GRID project • The 1° National Project (Feb. 2000) aiming to develop the grid technology and the new e-infrastructure to solve LHC (and e-Science) computing requirements • e-Infrastructure = Internet + new WEB and Grid Services on top of a physical layer composed by Network, Computing, Supercomputing and Storage Resources, made properly available in a shared fashion by the new Grid services • Since then many Italian and EU projects made this a reality • Many scientific sectors in italy, EU and the entire World base now their research activities on the Grid • INFN Grid continues to be the national container used by INFN to reach its goals coordinating all the activities: • In the national, european and international Grid projects • In the standardization processes of the Open Grid Forum (OGF) • In the definition of EU policies in the ICT sector of Research Infrastructures • Through its managerial structure: Executive Board, Technical Board…
The INFN GRID portal http://grid.infn.it
The strategy • Clear and stable objectives: development of the technology and of the infrastructure needed for the LHC computing but of general value • Variable instruments:use of projects and external funds ( from EU, MIUR...) to reach the goal • Coordination among all the projects (Executive Board) • Grid middleware & infrastructure Grid needed by INFN and LHC within a number of core European and International projects, often coordinated by CERN • DataGrid, DataTAG, EGEE, EGEE II, WLCG • Often fostered by INFN itself • International collaboration with US Globus and Condor for the middleware and Grid projects like Open Science Grid e Open Grid Forum in order to reach global interoperability among developed services and the adoption of international standards • National pioneer developments of the MW and the national infrastructure in the areas not covered by EU projects via national projects like Grid.it , LIBI, EGG … • Strong contribution to political committees: e-Infrastructure Reflection Group (eIRG ->ESFRI), EU Concertation meetings and with involved Units of Commission (F2 e F3) to establish activities programs (Calls)
CERN Some history … LHC EGEE Grid • 1999 – Monarc Project • Early discussions on how to organise distributed computing for LHC • 2000 – growing interest in grid technology • HEP community was the driver in launching the DataGrid project • 2001-2004 - EU DataGrid project / EU DataTAG project • middleware & testbed for an operational grid • 2002-2005 – LHC Computing Grid – LCG • deploying the results of DataGrid to provide a production facility for LHC experiments • 2004-2006 – EU EGEE project phase 1 • starts from the LCG grid • shared production infrastructure • expanding to other communities and sciences • 2006-2008 – EU EGEE-II • Building on phase 1 • Expanding applications and communities … • … and in the future – Worldwide grid infrastructure?? • Interoperating and co-operating infrastructures?
The EGEE project • EGEE • 1 April 2004 – 31 March 2006 • 71 partners in 27 countries, • federated in regional Grids • EGEE-II • 1 April 2006 – 31 March 2008 • 91 partners in 32 countries • Objectives • Large-scale, production-quality grid infrastructure for e-Science • Attracting new resources and users from industry as well asscience • Maintain and further improve“gLite” Grid middleware
EGEE activities • Service Activities • SA1 – Grid Operations, Support and Management (CERN) • SA2 – Networking Support (CNRS) • SA3 – Integration, Testing and Certification (CERN) • Joint Research Activities • JRA1 – Middleware Re-engineering (INFN) • JRA2 – Quality Assurance (CS-SI) • Networking Activities • NA1 – Management (CERN) • NA2 – Dissemination, Outreach and Communication (CERN) • NA3 – Training and Induction (UEdin) • NA4 – Application Identification and Support (CNRS) • NA5 – Policy and International Cooperation (GRNET)
EGEE Applications • Multitude of applications from a growingnumber of domains • Astrophysics • Computational Chemistry • Earth Sciences • Financial Simulation • Fusion • Geophysics • High Energy Physics • Life Sciences • Multimedia • Material Sciences • ….. Book of abstracts:http://doc.cern.ch//archive/electronic/egee/tr/egee-tr-2006-005.pdf App. Deployment Plan:https://edms.cern.ch/document/722131/2
Mont Blanc (4810 m) Downtown Geneva High Energy Physics Large Hadron Collider (LHC): • One of the most powerful instruments ever built to investigate matter • 40 Million Particle collisions per second • 4 Experiments: ALICE, ATLAS, CMS, LHCb • ~15 PetaBytes/year from the 4 experiments • First beams in 2007
In silico drug discovery • Diseases such as HIV/AIDS, SRAS, Bird Flu etc. are a threat to public health due to world wide exchanges and circulation of persons • Grids open new perspectives to in silico drug discovery • Reduced cost and adding an accelerating factor in the search for new drugs International collaboration is required for: • Early detection • Epidemiological watch • Prevention • Search for new drugs • Search for vaccines Avian influenza: bird casualties
WISDOM http://wisdom.healthgrid.org/
Medical image processing: analysing tumours • Pharmacokinetics: contrast agent diffusion study • co-registration of a time series of volumetric medical images to analyse the evolution of the diffusion of contrast agents Sequential • Computational Costs • 20 Patients: 2623 hours (Co-registration + Parametric Image) • Using a 20-processor Computing Farm: 146 hours • Using the Grid: <20 hours HPC If you have enough resources 20x12=240 computers, EGEE has >30,000 Grid
EGEE Operations Structure • Operations Coord. Centre (OCC) • Regional Operations Centres (ROC) • Front-line support for user and operations issues • Provide local knowledge and adaptations • One in each region – many distributed • Manage daily grid operations – oversight, troubleshooting • “Operator on Duty” • Run infrastructure services
Grid monitoring tools • Tools used by the Grid Operator on Duty team to detect problems • Distributed responsibility • CIC portal • single entry point • Integrated view of monitoring tools • Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) • Grid Operations Centre Core Database (GOCDB) • GIIS monitor (Gstat) • GOC certificate lifetime • GOC job monitor • Others
Site Functional Tests • Site Functional Tests (SFT) • Framework to test (sample) services at all sites • Shows results matrix • Detailed test log available for troubleshooting and debugging • History of individual tests is kept • Can include VO-specific tests (e.g. sw environment) • Normally >80% of sites pass SFTs • Very important in stabilising sites: • Apps use only good sites • Bad sites are automatically excluded • Sites work hard to fix problems
The EGEE support infrastructure ROC C ROC B RC A ROC N VO Support C RC A VO Support B RC A VO Support A RC B RC B RC B RC C RC C RC C VO TPM C ROC C ROC B ROC N VO TPM B VO TPM A CIC Portal GGUS Central System COD Deployment support Middleware support Deployment support Network Support TPM Middleware support Middleware support Network Support Middleware support Other Grids Other Grids Other Grids Middleware support Middleware support Middleware support Other Grids Other Grids Other Grids
Status ~17.5 million jobs run (6450 cpu-years) in 2006; Workloads of the “not HEP VOs” start to be significant – approaching 8-10K jobs per day; and 1000 cpu-months/month • one year ago this was the overall scale of work for all VOs
Grid Virtual Organizations • Routine and large-scale use of EGEE infrastructure. • Virtual Organizations: • 200+ visible on the grid • 100+ registered with EGEE http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_vo.php
Nov. ’06 Dec. ’05 Virtual Organizations Usage History http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_vo.php
EGEE Related projects & other grids Potential for linking ~80 countries
Other FP6 activities of INFN Grid in Europe/1 • To guarantee Open Source Grid Middleware evolutions towards international standards • OMII Europe • …and its availability through an effective repository • ETICS • To contribute to R&D informatics activities • Core Grid • To Coordinate EGEE extension in the world • EUMedGrid • Eu-IndiaGrid • EUChinaGrid • EELA
Other FP6 activities of INFN Grid in Europe/2 • To promote EGEE for new scientific communities • GRIDCC (real time applications and instruments control) • BioInfoGrid (Bioinformatics: Coordinated by CNR) • LIBI (MIUR, Bionfomatics in Italy) • Cyclops (Civil Protection) • To contribute to e-IRG, the e-Infrastructure Reflection Groupborn in Rome the December 2003 • Initiative of Italian Presidency on “eInfrastructures (Internet and Grids) – The new foundation for knowledge-based Societies”Event organised by MIUR, INFN and EU Commission • Representatives in EIRG appointed by EU Science Ministres • Policies and Roadmap for e-Infrastrutture development in EU • To coordinate participation to Open Grid Forum (ex GGF)
EGEE Middleware Distribution • gLite • Exploit experience and existing components from VDT (Condor, Globus), EDG/LCG, and others • Develop a lightweight stack of generic middleware useful to EGEE applications (HEP and Life Sciences are pilot applications) • Pluggable components – cater for different implementations • Follow SOA approach, WS-I compliant where possible • Focus is on re-engineering and hardening • Business friendly open source license • Moving to Apache-2
Condor Globus MyProxy ... EDG . . . VDT LCG Parts of the Grid “ecosystem” 2001 OSG, … DataTAG CrossGrid ... SRM 2004 GridCC NextGrid EGEE DEISA … interactive USA EU Used in Future grids
gLite approach Service-oriented approach: • Lightweight services • Allow for multiple interoperable implementations • Easily and quickly deployable • Use existing services where possible Applications by R.Jones, 24/10/05 Higher-Level Grid Services Workload Management Replica Management Visualization Workflows Grid economies etc. • Provide specific solutions for supported applications • Host services from other projects • More rapid changes than Foundation Grid Middleware • Deployed as application software using procedure provided by grid operations Foundation Grid Middleware Security model and infrastructure Computing (CE) & Storage Elements (SE) Accounting Information providers and monitoring • Application independent • Evaluate/adhere to new stds • Emphasis on robustness/stability over new functionality • Deployed as a software distribution by grid operations
gLite Service Decomposition Overview paper http://doc.cern.ch//archive/electronic/egee/tr/egee-tr-2006-001.pdf Access CLI API Information & Monitoring Services Security Services Authorization Information &Monitoring Service Discovering Auditing Authentication Network Monitoring Data Management Job Mgmt. Services JobProvenance PackageManager MetadataCatalog File & ReplicaCatalog Accounting StorageElement DataMovement WorkloadManagement ComputingElement
Workload Management Access CLI API Information & Monitoring Services Security Services Authorization Information &Monitoring Service Discovering Auditing Authentication Network Monitoring Data Management Job Mgmt Services JobProvenance PackageManager MetadataCatalog File & ReplicaCatalog Accounting StorageElement DataMovement WorkloadManagement ComputingElement
WN Computing Element • Three flavours available now: • LCG-CE (GT2 GRAM) • In production now but will be phased-out soon • gLite-CE (GSI-enabled Condor-C) • Already deployed but still needs thorough testing and tuning. Being done now • CREAM (WS-I based interface) • Deployed on the JRA1 preview test-bed. After a first testing phase will be certified and deployed together with the gLite-CE • Our contribution to the OGF-BES group for a standard WS-I based CE interface • CREAM and WMProxy demo at SC06! • BLAH is the interface to the local resource manager (via plug-ins) • CREAM and gLite-CE • Information pass-through: pass parameters to the LRMS to help job scheduling WMS, Clients Information System Grid Computing Element bdII R-GMA CEMon Site glexec + LCAS/ LCMAPS BLAH LRMS
gLite CE is evolving towards ICE-CREAM • CREAM: web service Computing Element • Cream WSDL allows defining custom user interface • C++ CLI interface allows direct submission • Lightweight • Fast notification of job status changes • via CEMon • Improved security • no “fork-scheduler” • Will support for bulk jobs on the CE • optimization of staging of input sandboxes for jobs with shared files • ICE: Interface to Cream Environment • being integrated in WMS for submissions to CREAM
Keeps submission requests Requests are kept for a while if no matching resources available Repository of resource information available to matchmaker Updated via notifications and/or active polling on resources Finds an appropriate CE for each submission request, taking into account job requests and preferences, Grid status, utilization policies on resources Job management requests (submission, cancellation) expressed via a Job Description Language (JDL) Performs the actual job submission and monitoring Workload Management System (a.k.a. WMS, a.k.a. RB)
Workload Management System (a.k.a. WMS, a.k.a. RB) • WMS helps the user accessing computing resources • Resource brokering, management of job input/output, ... • LCG-RB: GT2 + Condor-G • To be replaced when the gLite WMS proves reliability • gLite WMS: Web service (WMProxy) + Condor-G • Management of complex workflows (DAGs) and compound jobs • bulk submission and shared input sandboxes • support for input files on different servers (scattered sandboxes) • Support for shallow resubmission of jobs • Job File Perusal: file peeking during job execution • Supports collection of information from CEMon, BDII, R-GMA and from DLI and StorageIndex data management interfaces • Support for parallel jobs (MPI) when the home dir is not shared
Direct Acyclic Graph (DAG) is a set of jobs where the input, output, or execution of one or more jobs depends on one or more other jobs A Collection is a group of jobs with no dependencies basically a collection of JDL’s A Parametric job is a job having one or more attributes in the JDL that vary their values according to parameters Using compound jobs it is possible to have one shot submission of a (possibly very large, up to thousands) group of jobs Submission time reduction Single call to WMProxy server Single Authentication and Authorization process Sharing of files between jobs Availability of both a single Job Id to manage the group as a whole and an Id for each single job in the group nodeA nodeB nodeC nodeE nodeD Workflows
Accounting - DGAS • DGAS: accumulates Grid accounting information • User, JobId, user VO, VOMS FQAN(role,capabilities), SI2K, SF2K, system usage (cpuTime, wallTime…),… • allows billing and scheduling policies • levels of granularity: from single jobs to VO or grid aggregations • Privacy: only the user or VO manager can access information • site managers can keep accounting information available just for site internal analysis • Sites can substitute DGAS metering system with their own Account balancing, resource pricing, (billing) accounting data Interface VO Usage accounting Usage Analysis usage records SITE Usage Metering
The Italian e-Infrastructure: Grid.IT • In 2006 ended Grid.IT, the 3+1 years National Project funded by MIUR with 12 M€ (2002-05) • Grid.IT has been crucial to develop and complete the R&D phase for operating a Italian Grid Infrastructure (IGI): • developed tools and services required for a national Grid Operation Centre (GOC) integrated with EU Regional Operation Centres (ROC) • The italian GOC is run by INFN -CNAF in Bologna, but the effort is distributed among other italian centres • Grid.IT allowed to extend the e-Infrastructure support from INFN to other sciences, following the model successfully implemented with the Research Network (GARR) • from INFNet to GARR
The Italian Production Grid • 39 RCs (31 in EGEE) • >3400 CPUs >360 TB • 33 VOs • gLite 3.0 +DGAS http://grid-it.cnaf.infn.it
INFN-GRID Customisations Release INFN-GRID 3.0 is a customisation of gLite 3.0: • Support for regional VOs • DGAS deployment (sensor + HLR server) • Network Monitor Element (NME), interfaced with GridIce for data presentation • Customized tools to install and use the grid: • installation by a customized version of glite-yaim (ig-yaim) • support to interface ig-yaim with a Quattor installation; • UIPnP: a PlugAndPlay User Interface to access the grid as user of every Linux system without RPMs • AFS (read-only) on WN
Operation Structure and Organisation The National Grid Central Management Team (CMT): • Activities: • ‘integration’ and testing of the InfnGrid middleware release (based on gLite m/w release) • deployment procedures and configuration tools • Monitoring and control of the status of the grid services and resources • Responsibilities: • site registration procedure • middleware deployment • certification procedure for all InfnGrid sites • Operation of the GRID services
Operations Support • The Italian ROC provides local front line support to Virtual Organization, Users and Resources Centres • The Italian Roc team is organized in daily shifts: • 2 people per shift, 2 shifts per day, from Monday to Friday. • Activities planned during the shift • Log trouble tickets created, updated and closed, problems on grid services and sites, monitor successful site certification • check the actions of the previous shift and the downtime page • check the status of production grid services and the GRIS status of production CE and SE. • check the status of the production sites using the Site Functional Tests report • Periodic (every 15 days) phone conferences • ROC teams and site managers • Provide and write the ROC report for the weekly EGEE operation meeting
Grid Monitoring • The status of the Italian grid infrastructure is monitored using GridIce, • It is one of the monitoring tools used by EGEE • It is used to control • the status of the submitting queues • Process/daemons status in the services (RB, BDII) • VO view: list of CE and SE available for a the VOs and their status and capacity • Job monitoring
User, Operations and VO support • The user support system provides tickets exchange between: • ROC on Duty and site managers • Site managers and Central management team and viceversa • Site manager and certification team during installation/upgrade • GGUS to ROC ROC to GGUS • Italian ROC ticketing system is built upon a suite of web based tools written in PHP: Xhelp • The support system components are accessible form the main interface of the deployment portal (grid-it.cnaf.infn.it) providing a SSO point of registration/identification certificate-based. • The end-user can open a request, view and follow his own tickets and related replies; • A supporter can view tickets assigned to his own groups, add responses and solutions, and change status/priority • While operating tickets, a side content is always available for all classes of users (related to their access level) • Site Functional Tests, • site downtimes calendaring system • file archive • net query tools • IRC applet, contextual questions and answers • reports from daily shifts
Sustainability: Beyond EGEE-II • Need to prepare for permanent Grid infrastructure • Ensure a reliable and adaptive support for all sciences • Independent of short project funding cycles • Infrastructure managed in collaboration with national grid initiatives
Grids in Europe • Examples of National Grid projects: • Austrian Grid Initiative • Belgium: BEgrid • DutchGrid • France: Grid’5000 • Germany: D-Grid; Unicore • Greece: HellasGrid • Grid Ireland • Italy: INFNGrid; GRID.IT • NDGF • Portuguese Grid • Swiss Grid • UK e-Science: National Grid Service; OMII; GridPP • …