740 likes | 921 Views
Marco Verlato (INFN-Padova). The INFN GRID. EELA WP2 E-infrastructure Workshop Rio de Janeiro, 20-23 August 2007. Outline. A little of history INFNGRID Overview INFNGRID Release INFNGRID Services From developers to production… Monitoring and Accountig Users and Sites Support
E N D
Marco Verlato (INFN-Padova) The INFN GRID EELA WP2 E-infrastructure Workshop Rio de Janeiro, 20-23 August 2007
Outline • A little of history • INFNGRID Overview • INFNGRID Release • INFNGRID Services • From developers to production… • Monitoring and Accountig • Users and Sites Support • Managing procedures
The INFN GRID project • The 1° National Project (Feb. 2000) aiming to develop the grid technology and the new e-infrastructure to solve LHC (and e-Science) computing requirements • e-Infrastructure = Internet + new WEB and Grid Services on top of a physical layer composed by Network, Computing, Supercomputing and Storage Resources, made properly available in a shared fashion by the new Grid services • Since then many Italian and EU projects made this a reality • Many scientific sectors in italy, EU and the entire World base now their research activities on the Grid • INFN Grid continues to be the national container used by INFN to reach its goals coordinating all the activities: • In the national, european and international Grid projects • In the standardization processes of the Open Grid Forum (OGF) • In the definition of EU policies in the ICT sector of Research Infrastructures • Through its managerial structure: Executive Board, Technical Board…
The INFN GRID portal http://grid.infn.it
The strategy • Clear and stable objectives: development of the technology and of the infrastructure needed for the LHC computing but of general value • Variable instruments:use of projects and external funds ( from EU, MIUR...) to reach the goal • Coordination among all the projects (Executive Board) • Grid middleware & infrastructure Grid needed by INFN and LHC within a number of core European and International projects, often coordinated by CERN • DataGrid, DataTAG, EGEE, EGEE II, WLCG • Often fostered by INFN itself • International collaboration with US Globus and Condor for the middleware and Grid projects like Open Science Grid e Open Grid Forum in order to reach global interoperability among developed services and the adoption of international standards • National pioneer developments of the MW and the national infrastructure in the areas not covered by EU projects via national projects like Grid.it , LIBI, EGG … • Strong contribution to political committees: e-Infrastructure Reflection Group (eIRG ->ESFRI), EU Concertation meetings and with involved Units of Commission (F2 e F3) to establish activities programs (Calls)
CERN Some history … LHC EGEE Grid • 1999 – Monarc Project • Early discussions on how to organise distributed computing for LHC • 2000 – growing interest in grid technology • HEP community was the driver in launching the DataGrid project • 2001-2004 - EU DataGrid project / EU DataTAG project • middleware & testbed for an operational grid • 2002-2005 – LHC Computing Grid – LCG • deploying the results of DataGrid to provide a production facility for LHC experiments • 2004-2006 – EU EGEE project phase 1 • starts from the LCG grid • shared production infrastructure • expanding to other communities and sciences • 2006-2008 – EU EGEE-II • Building on phase 1 • Expanding applications and communities … • … and in the future – Worldwide grid infrastructure?? • Interoperating and co-operating infrastructures?
Other FP6 activities of INFN GRID in Europe/1 • To guarantee Open Source Grid Middleware evolutions towards international standards • OMII Europe • …and its availability through an effective repository • ETICS • To contribute to R&D informatics activities • Core Grid • To Coordinate EGEE extension in the world • EUMedGrid • Eu-IndiaGrid • EUChinaGrid • EELA
Other FP6 activities of INFN GRID in Europe/2 • To promote EGEE for new scientific communities • GRIDCC (real time applications and instruments control) • BioInfoGrid (Bioinformatics: Coordinated by CNR) • LIBI (MIUR, Bionfomatics in Italy) • Cyclops (Civil Protection) • To contribute to e-IRG, the e-Infrastructure Reflection Groupborn in Rome the December 2003 • Initiative of Italian Presidency on “eInfrastructures (Internet and Grids) – The new foundation for knowledge-based Societies”Event organised by MIUR, INFN and EU Commission • Representatives in EIRG appointed by EU Science Ministres • Policies and Roadmap for e-Infrastrutture development in EU • To coordinate participation to Open Grid Forum (ex GGF)
FP7:guarantee sustainability • The future of Grids in FP7 after 2008 • EGEE proposed to European Parlament to set up an European Grid Initiative (EGI) in order to: • Guarantee long-term support & development to European e-Infrastructure based on EGEE, DEISA and the Grid national projects being fundend by the National Grid Initiatives (NGI) • Provide a coordination framework at EU level as done for the research networks by Geant, DANTE and the National Networks like GARR • The Commission asked that a plan for long-term sustainability Grid infrastructure (EGI + EGEE-III, …) to be included among the goals of EGEE-II (other than DANTE+ Geant 1-2) • The building of EGI at EU level and of a National Grid Initiave at national level is among the main goals of FP7
The future of INFNGRID :IGI • In 2006 ended Grid.IT, the 3+1 years National Project funded by MIUR with 12 M€ (2002-05) • The future: the Italian Grid Infrastructure (IGI) Association • EU (eIRG, ESFRI) requires the fusion of different pieces of National Grids into a single National Organisation (NGI) to be unique interface to EU --> IGI for Italy • Substantial consensus for the creation of IGI for a common governance of the italian e-Infrastructure from all involved public bodies:INFN Grid, S-PACI, ENEA Grid, CNR, INAF, Centri Nazionali di supercalcolo : CINECA, CILEA, CASPUR, and new consortia “nuovi PON” • Under evaluation with MIUR the evolution of GARR towards a more general body to manage all the components of the infrastructure: Network, Grid, Digital Libraries… • Crucial for INFN in 2007-2008 will be to manage the transition from INFN Grid to IGI, in such a way to preserve and if possible enhance the organisation levels which allowed Italy to reach world leadership and become a leading partner of EGI
Overview INFNGRID Overview
Supported Sites 40 Sites supported: • 31 INFN Sites • 9 NON INFN Sites Total Resources: • About 4600 CPUs • About 1000 TB Disk Storage (+ About 700 TB Tape)
Supported VOs 40 VOs supported: • 4 LHC (ALICE, ATLAS, CMS, LHCB) • 3 cert (DTEAM, OPS, INFNGRID) • 8 Regional (BIO, COMPCHEM, ENEA, INAF, INGV, THEOPHYS, VIRGO) • 1 catch all VO: GRIDIT • 23 Other VOs Recentrly a new regional VO enabled: COMPASSIT
Components of the production Grid Grid is not only CPUs and Storage Other elements are as much fundamental for running, managing and monitoring the grid: • Middleware • Grid Services • Monitoring tools • Accounting tools • Management and control infrastructure • Users
GRID Management Grid management is performed by the Italian Regional Operation Center (ROC). Its main activities are: • Production of the INFNGRID release and test it • Deployment of the release to the sites, support to local administrators and sites certification • Deployment of the release into central grid services • Maintenance of grid services • Periodical check of the resources and services status • Account the resources usage • Support at an Italian level to site managers and users • Support at an European level to site managers and users • Introduction of new Italian sites • Introduction of new regional VOs The IT-ROC is involved in many other activities, not directly related to the production infrastructure, i.e. PreProduction, PreView and Certification Testbeds
The Italian Regional Operation Center (ROC) Operations Coordination Centre (OCC) • Management, oversight of all operational and support activities • Regional Operations Centres (ROC) • providing the core of the support infrastructure, each supporting a number of resource centres within its region • Grid Operator on Duty • Grid User Support (GGUS) • At FZK, coordination and management of user support, single point of contact for users One of 10 existing ROCs in EGEE
Middleware INFNGRID RELEASE
INFNGRID Release LCG EGEE EGEE II LCG 1.0 LCG 2.0 gLite 3.0 2003 2004 2005 2006 2007 2008 1.0 3.0 2.0 INFN-GRID The m/w installed on INFNGRID nodes is a customization of the gLite m/w used in the LCG/EGEE community. The customized INFNGRID release is packaged by the INFN release team (grid-release<at>infn.it). The ROC is responsible for the deployment of the release. At the moment the INFNGRID-3.0-Update28 (based on gLite3.0-Update 28) is deployed.
INFNGRID customizations: why? • VOs not supported by EGEE: define once configuration parameters (e.g. VO servers, poolaccounts, add VOMS certificates, ...) to reduce misconfiguration risks • MPI (requested by non-HEP sciences), additional GridICE config (monitor Wns), AFS read-only (CDF requirement), ... • Deploy additional middleware in a non intrusive way: Since Nov. 2004 VOMS, now in EGEE; DGAS (DataGrid Accounting System); NetworkMonitor (monitor network connection metrics)
INFNGRID customizations • Additional VOs (~20) • GridICE on almost all profiles (including WN) • Preconfigured support for MPI: • WN without home shared, but home synchronization using scp with host based authentication • DGAS accounting: • New profile (HLR server) + additional packages on CE • NME (Network Monitor Element) • Collaboration with CNAF-T1 for Quattor • UI “PnP” • UI installable without administrator privilegies • NTP • AFS (read-only) on WN (needed by CDF VO)
Packages and metapackages • The packages are distributed in repositories available via HTTP • For each release EGEE, there are 2 repositories collecting different types of packages: • Middleware http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/rhel30/ • Security http://linuxsoft.cern.ch/LCG-CAs/current/ • INFNGRID customizations => 3-rd repository • http://grid-it.cnaf.infn.it/apt/ig_sl3-i386
Metapackages management process • 1: starting from EGEE lists, update INFNGRID lists (maintained in SVN repository) • 2: once the lists are ok, to test them generate a first version of INFNGRID metapackages • 3: install and/or upgrade the metapackages on the release testbed • 4: if there are errors, correct and goto 2: • 5: publish the new metapackages on the official repositories so they are available for everybody
Metapackages management • our metapackages are supersets of the EGEE ones: • INFNGRID metapackage = EGEE metapackage + INFNGRID additional rpms • EGEE distributed metapackages • http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/rhel30 • Flat rpm lists are available: • http://glite.web.cern.ch/glite/packages/R3.0/deployment • We maintain a customized copy of the lists and resync them easily • https://forge.cnaf.infn.it/plugins/scmsvn/viewcvs.php/trunk/ig-metapackages/tools/getglists?rev=1888&root=igrelease&view=log • Using another tool (bmpl) we can generate all artifacts starting from the lists • “Our” (INFNGRID) customized metapackages • http://grid-it.cnaf.infn.it/apt/ig_sl3-i386 • HTML files with the lists of the packages (one list per profile) • http://grid-it.cnaf.infn.it/?packages • Quattor templates lists: • http://grid-it.cnaf.infn.it/?quattor
ig-yaim • The package ig-yaim is an extension of glite-yaim. It provides: • Additional functions or functions that override existing ones. Both are stored in functions/local instead of functions/ • e.g to configure NTP, AFS, LCMAPS gridmapfile/groupmapfile, .. • More poolaccounts => ig-users.def instead of users.def • More configuration parameters => ig-site-info.def instead of site-info.def • Both packages (glite-yaim, ig-yaim) are needed!!
Documentation • Documentation is published at each release • Release notes, upgrade and installation guides: • http://grid-it.cnaf.infn.it/?siteinstall • http://grid-it.cnaf.infn.it/?siteupgrade • http://grid-it.cnaf.infn.it/?releasenotes written in LaTeX and published in html, pdf and txt • Additional informations about Updates, various Notes are published also in wiki pages: • https://grid-it.cnaf.infn.it/checklist/modules/dokuwiki/doku.php?id=rel:updates • https://grid-it.cnaf.infn.it/checklist/modules/dokuwiki/doku.php?id=rel:hlr_server_installation_and_configuration • Everything is available for site managers on a central repository
Updates Updates deployment – Since the introduction of gLite3.0, from EGEE there where no more big release changes, but a series of smaller frequent updates (about weekly) – INFNGRID release was updated consequently gLite Updates: 17/10/2006 - gLite Update 06 20/10/2006 - gLite Update 07 24/10/2006 - gLite Update 08 14/11/2006 - gLite Update 09 11/12/2006 - gLite Update 10 19/12/2006 - gLite Update 11 22/01/2007 - gLite Update 12 05/02/2007 - gLite Update 13 19/02/2007 - gLite Update 14 26/02/2007 - gLite Update 15 ……. ……. Steps: • gLite Update announcement • INFNGRID release alignment to announced update(ig-metapackages, ig-yaim) • Local testing • IT-ROC deployment INFNGRID Updates: 27/10/2006 - INFNGRID Update 06/07/08 (+ new dgas, gridice packages) 15/11/2006 - INFNGRID Update 09 19/12/2006 - INFNGRID Update 10/11 29/01/2007 - INFNGRID Update 12 14/02/2007 - INFNGRID Update 13 20/02/2007 - INFNGRID Update 14 27/02/2007 - INFNGRID Update 15 …… ……
INFNGRID services Overview INFNGRID Services Overview
VOMSes Stats VOMS NUMBER OF USERS PER VO VO User argo 17 bio 44 compchem 31 enea 8 eumed 56 euchina 35 gridit 89 inaf 25 infngrid 178 ingv 12 libi 10 pamela 16 planck 16 theophys 20 virgo 9 Cdf 1133 Egrid 28 TOP USERS (about 85% of total proxies): CDF (~50k proxies/month)EUMED (~500 proxies/month) PAMELA (~500 proxies/month) EUCHINA (~400 proxies/month) INFNGRID (Test purposes ~ 200 proxies/month)
General purpose Services - HLRs Accounting: Home Location Register • DGAS (Distributed Grid Accounting System) is used to account jobs running on the farm (grid and not-grid jobs) • 12 HLR (1st level) distributed • 1 experimental 2nd level HLR to aggregate data from 1st level • DGAS2Apel used to send job to the GOC for all sites.
VOs Dedicated Services VO specific services previously run by the INFNGRID Certification Testbed and now moved to production DEVEL RELEASE New DEVEL-INFNGRID-3.1 WMS and LB are coming soon as VO dedicated services into production (atlas, cms, cdf, lhcb) A total of 18 VO dedicated services that will become 25 with the introduction of the 3.1 WMS and LB
FTS channels and VOs • Installed and fully managed via Quattor-Yaim; • 3 hosts as frontend, 1 backend oracle cluster; • Not only LHC VOs • PAMELA • VIRGO • Full standard T1-T1 + T1-T2 + STAR channels • 51 channel agents; • 7 VO agents; • (A prototype of) Monitoring tool available • Agent and Tomcat log file parsing and saved in a mysql db • Web interface: http://argus.cnaf.infn.it/fts/index-FTS.php • Support: • Dedicated department team for Tickets; • Mailing list: fts-support<at>cnaf.infn.it
Testbeds M/W FLOW FROM DEVELOPERS TO PRODUCTION IN EGEE AND INFNGRID
Testbeds VOs VOs SA1 PPS (Pre-Production) JRA1 Developers SA3 (Certification CERN) SA1 EGEE PS (Production) SA1 INFNGRID PS (Production) INFNGRID Release Team INFN Certification TB JRA1/SA1 Preview TB SA1 INFNGRID PS (DEVEL Production) VOs VOs • TESTBEDS • Preview • Certification CERN • Certification INFN • Pre-Production Service (PPS) VOs
Pre-Production Service (PPS) in EGEE • AIM:the last step for m/w testing before being deployed at the production scale • INPUT: CERN Certification (SA3) • SCOPE: EGEE SA1 about 30 sites spread all over Europe (1 Taiwan) • COORDINATION: CERN • USER ALLOWED: all the LHC VOs, diligent, switch and 2 PPS fake VOs • CONTACTS : project-eu-egee-pre-production-service<at>cern.ch http://egee-pre-production-service.web.cern.ch/egee-pre-production-service/ • ACTIVITIES: Main activity is the testing of the installation procedures and basic functionalities of releases/patches done by site-managers. There is limited m/w testing done by users: this is the main pps issue!
Pre-Production Service (PPS) in EGEE • PPS is run as the Production Service: • SAM TESTs • Tickets from COD • GOCDB registration • Etc…
cert-ce-03 cert-ce-01 150 slots production farm Italian Participation to PPS 68 slots production farm • 3 INFN sites: • CNAF • PADOVA • BARI • 2 Diligent sites: • CNR • ESRIN ALL OTHER PPS SITES OUTSIDE INFN prep-se-01 prep-ce-01 prep-ce-02 CNAF cert-mon-01 BARI PADOVA cert-rb-01 cert-bdii-03 Central Services cert-se-01 pccms2 vgridba5 150 slots production farm pps-fts cert-voms-01 pps-lfc cert-mon pps-apt-repo cert-ui-01 CNAF: 2 CE with access to the production farm, 1 SE, 1 mon box + central services (VOMS, UI, BDII, WMS, FTS, LFC, APT REPO) people: D.Cesini, M.Selmi, D.Dongiovanni PADOVA: 2 CE with access to the production farm, 1 SE, 1 Mon Box people: M.Verlato, S.Bertocco BARI: 1 CE with access to the production farm, 1 SE people: G.Donvito
Preview Testbed • It is now an official EGEE activity asked by JRA1 to expose to users those components not yet considered by CERN (SA3) certification. The aim is getting feedback from end-users and sitemanagers. • It is a distributed testbed deployed in few European sites. • A joint SA1-JRA1 effort is needed in order not to dedicate persons at 100% of their time to this activity as acknowledged by TCG and PMB • COORDINATOR : JRA1 (Claudio Grandi) • USER ALLOWED: JRA1/Preview people and all interested users • CURRENT ACTIVITIES: CREAM, gLexec, gPBox • CONTACTS : project-eu-egee-middleware-preview<at>cern.ch • https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEgLitePreviewNowTesting
ALL OTHER PREVIEW SITES OUTSIDE INFN pre-ui-01 cert-pbox-01 egee-rb-08 egee-rb-05 cert-bdii-02 cert-pbox-02 cert-ce-04 cert-ce-06 cert-04 rm1-ce cert-ce-05 pre-ce-01 cream-01 rm1-wn cert-wn-04 pad-wn-02 cert-wn-05 cert-wn-03 cert-wn-06 cert-se-01 cream-02 cream-03 cream-04 CNAF Cream-06 ROMA1 cream-05 PADOVA Physical nodes that run virtual services Central Services Italian Participation to the Preview Testbed • 3 INFN sites: • - CNAF • (D.Cesini, D.Dongiovani) • - PADOVA • (M.Sgaravatto, M.Verlato, S.Bertocco) • ROMA1 • (A. Barchiesi) H/W resources are partly taken from the INFN certification testbed and partly from the jra1 testbed. Preview services deployed in Italy: PADOVA: 1 CREAM CE + 5 WN CNAF: 1 WMS 3.1, 1 BDII, 1 gLiteCE+ 1 WN, 1 UI, 1 DPM-SE (for gpbox) 1 WMS3.1 + 2 gLiteCE + 1 LCG CE + 3 WN + 2 gpbox servers ROMA1: 1 CE + 1 WN for gpbox tests (to be installed) Virtual machines used at cnaf to optimize h/w resources
CNAF SA3 CERN Certification testbed INFN participation wmstest-ce-02 wmstest-ce-06 wmstest-ce-03 wmstest-ce-07 wmstest-ce-04 wmstest-ce-08 wmstest-ce-05 CERN Certification (SA3) • EGEE Activity run by SA3 – It is the official EGEE certification testbed that releases gLite m/w to PPS and to Production. • ACTIVITY: Test and certify all gLite components, release packaging. • COORDINATION: CERN • INFN Involved Sites: CNAF (A.Italiano), MILANO (E.Molinari), PADOVA (A.Gianelle) • Italian Activities: Testing of Information providers, DGAS, WMS Services provided:1 lsf CE + 1 batch system server on a dedicated machine + 1 DGAS HLR + 1 site BDII + 2 WN. All services are located at CNAF. Recently the responsibility of WMS testing passed from CERN to INFN – Main Focus of SA3-Italia
INFNGRID Certification Testbed • Distributed testbed deployed in a few Italian sites where EGEE m/w with INFNGRID customizations and INFNGRID grid products are installed for testing purposes by a selected number of end users and grid-managers before being released. • It is NOT an official EGEE activity and it should not be confused with the CERN certification testbed run by the SA3 EGEE activity. • Most of the server migrated to the PREVIEW TESTBED • SITES and PEOPLE: CNAF (D.Cesini, D.Dongiovani) PADOVA (S.DallaFina, C., Aifitimiei, M.Verlato) TORINO (R.Brunetti, G.Patania,F.Nebiolo) ROMA1 (A.Barchiesi) • CONTACTS : cert-release<at>infn.it • http://grid-it.cnaf.infn.it/certification
INFNGRID Certification Testbed ACTIVITIES / 1 • WMS (CNAF) • No more time to perform detailed test as in the first phase of the certification tb. ( https://grid-it.cnaf.infn.it/certification/?INFN_Grid_Certification_Testbed:WMS%2BLB_TEST ) • Provide resources to VOs or developers and maintain patched and experimental WMS: Experimental WMS 3.0: - 1 ATLAS WMS - 1 ATLAS LB - 1 CMS WMS + LB - 1 CDF WMS + LB - 1 LHCB WMS + LB WMS for developers: - 2 WMS + LB The Experimental WMS were heavily used in the last period because more stable than those officially released due to the long time needed for patches to reach the PS: - bad support from certification - production usage statistics altered recently tagged as INFNGRID DEVEL (see next slide) PRODUCTION services • Support to JRA1 for the installation of WMS 3.1 in the development TB
INFNGRID Certification Testbed ACTIVITIES / 2 • DEVEL RELEASE (PADOVA/CNAF): • -To speed up the flow of patches into the service used by VOs, does not follow the normal m/w certification process • Based on the INFNGRID official release (3.0) • Wiki page on how to transform a normal INFNGRID release into a DEVEL • http://agenda.cnaf.infn.it/materialDisplay.py?contribId=4&materialId=0&confId=18 • -apt repository to maintain control on what is going into the DEVEL release • 1 WMS Server at CNAF • Announced via mail after testing at CNAF • Cannot come with all the guarantees of normally certified m/w • DGAS CERTIFICATION (TORINO) • - 4 physical servers virtualized in a very dynamic way
INFNGRID Certification Testbed ACTIVITIES / 3 • RELEASE INFNGRID CERTIFICATON (PADOVA) • 20 Virtual Machines on 5 Physical Servers • http://igrelease.forge.cnaf.infn.it • STORM – Some resources Provided • - 3 physical servers • SERVER VIRTUALIZATION (all sites)
INFNGRID Certification Testbed Testbed snapshot INFN Certification services CNAF ROMA1 PADOVA cert-rb-02 TORINO cert-rb-03 Experimental Patched WMS PASSED TO DEVEL PROD or used by JRA1 cert-ce-02 cert-wn-01 Ce torino cert-rb-04 cert-bdii-01 ibm139 Ce torino cert-rb-05 Ce torino cert-wn-03 cert-wn-02 cert-rb-06 Ce torino server roma1 cert-rb-07 Reources provided to STORM test Egee-rb-04 DGAS test Virtualization Tests Release DEVEL egee-rb-06 Release3 Release5 Release1 Release INFNGRID Release2 Release4 5 Physical servers X 4 VM = 20 VM