270 likes | 376 Views
GDB 13 th June 2012. Information System Status and Evolution. Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group. Overview. Current status Caching BDII Failover glue-validator Service Info Provider Glue 2.0 status Evolution TEGs recommendations Ongoing work
E N D
GDB 13th June 2012 Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group
Overview • Current status • Caching BDII • Failover • glue-validator • Service Info Provider • Glue 2.0 status • Evolution • TEGs recommendations • Ongoing work • Future work • Conclusions Maria Alandes Pradillo, CERN
Current status of the Information System Maria Alandes Pradillo, CERN
Service Quality Maria Alandes Pradillo, CERN
Caching BDII • Middleware: caching mode is now available • EMI 1 (sl5) • v 1.2.0 -> EMI 1 Update 13 on 17.02.2012. • Not released in UMD. • Caching mode is not the default, it has to be configured. • V 1.3.0 -> EMI 1 Update 14 on 16.03.2012. • Released in UMD-1.6.0 repository on 02.04.2012. • Caching mode is the default. • EMI 2 (sl5,sl6) • v 1.3.0 -> First EMI 2 release on 21.05.2012. • Not yet released in UMD. • What about glite 3.2? • BDII is not supported since 04.06.2012! Maria Alandes Pradillo, CERN
Caching BDII • Current BDII deployment in WLCG: • ~80 top level BDIIs • Nº of EMI 1&2 top levelBDIIs: 19 • Nº of gLite top levelBDIIs: 17 • 29 top levelBDIIspublishingwrongversioninformation! • It isdifficulttoknowwhethercaching has beenenabledornot • Weknow the bdii rpm versionbutwecan’t look into the bdii.conf file Maria Alandes Pradillo, CERN
Caching BDII • CERN instance operational experience • Installed caching BDII for several months. • No noticeable change in service operation. • Reduction in the number of tickets! • Caching BDII hides issues caused by short-lived network glitches. • Experiments experience • CMS • No reported issues related to missing information. • Before, one problem/month. • ATLAS • “Much better than before” • LHCb and Alice not affected Thanks to Ricardo (CERN IT-PES) and Andrea, Alessandro, Stefan and Maarten (CERN IT-ES) Maria Alandes Pradillo, CERN
Caching BDII • How to configure caching mode? • Use variable BDII_DELETE_DELAY in /etc/bdii/bdii.conf • Default is caching on (12h) • BDII_DELETE_DELAY=43200 • For caching off • BDII_DELETE_DELAY=0 • If you are using YAIM • In /opt/glite/yaim/defaults/glite-bdii_top.pre • BDII_DELETE_DELAY=43200 • Redefine BDII_DELETE_DELAY in your site-info.def • Documentation improved • https://tomtools.cern.ch/confluence/display/IS Maria Alandes Pradillo, CERN
Failover • Mechanism for failover: • GFAL, lcg_util, lcg-info and lcg-infosites clients. • LCG_GFAL_INFOSYS variable • List of BDIIs • Syntax: LCG_GFAL_INFOSYS=my-bdii1.$MY_DOMAIN:port1[,my-bdii22.$MY_DOMAIN:port2[...]] • YAIM variable is BDII_LIST. • To be configured with 3 top level BDIIs: • 1st and 2nd in the list: Regional BDIIs • 3rd in the list: CERN BDII • Currently there are no means to automatically check whether this variable is properly defined. • It would be good to reduce the number of top level BDIIs -> ~80 top level BDIIs deployed now! • https://tomtools.cern.ch/confluence/display/IS/WLCG_Support_Proposal Maria Alandes Pradillo, CERN
Data Quality Maria Alandes Pradillo, CERN
glue-validator • glue-validator • Implemented and distributed in EMI 1 and 2. • It is part of BDII core. • Clear documentation for site managers. • https://tomtools.cern.ch/confluence/display/IS • To be done: • Services to set a dependency on glue-validator. • Via BDII core dependency • Cron job to store glue-validator output in log file. • Nagios probes to monitor log file. Maria Alandes Pradillo, CERN
Service Information Provider • glite-info-provider-service is an information provider that publishes information about the service itself. • It is used by most services. • Official support from STFC recently ended. • Maintenance and release into EPEL need to be taken care of. • Many services who want to move to EPEL depend on it. • Future support is not clear • First exchanges with John Gordon • Need to come to a conclusion soon Maria Alandes Pradillo, CERN
Glue 2.0 Maria Alandes Pradillo, CERN
Glue 2.0 status • REMINDER: benefits of Glue 2.0 • Interoperability • Simplified model • i.e. Service Centric. • Additional use cases • i.e. Middleware version information. • i.e. Support to for multi-core processors. https://tomtools.cern.ch/confluence/display/IS/GLUE2UseCaseMultiCoreJobs • Deployment status • In October 2011: 38.58% (147 sites out of 381) • In June 2012: 53.52% (205 out of 383) • EGI is following up Glue 2.0 deployment • End of Sep 2012 "any site-BDII instance earlier than gLite 3.2.10-1 (not GLUE 2.0 compatible)" has to be decommissioned otherwise site will be suspended. Maria Alandes Pradillo, CERN
Glue 2.0 status • As far as middleware is concerned… • Information system is GLUE2.0 compliant since EMI 1 -> Some services MAY publish themselves. • From EMI 2, all services MUST publish themselves. • Glue 2.0 is now a rollout issue • Until bugs in the middleware are reported • Note that clients (GFAL, etc) can’t be migrated until Glue 2.0 is fully deployed. Maria Alandes Pradillo, CERN
Evolution of the Information System Maria Alandes Pradillo, CERN
Input • IS Evolution based on TEGs recommendations • Input from former WLCG Information Officers used by TEGs. • Operations and Workload Management TEGs with recommendations for IS. Maria Alandes Pradillo, CERN
Ops TEG recommendations • Improve Stability • Short term: caching BDII • Long term: refactor information system into “Three pillars”: • Service data (static) for Service Discovery • State data (dynamic) for Service Monitoring • Metadata (quasi-static) for Metadata Catalogue • Improve Validity • Short term: validation tools • Long term: removing unnecessary data • Improve Accuracy • Information certified and audited -> better client tools Maria Alandes Pradillo, CERN
WM TEG recommendations • Simplify Service Discovery • Most important use case for experiments • This is addressed by the “Three pillars”. • Define a way to specify: • Max number of cores supported by a site • Multi-core jobs support available in the site • Glue 2.0 to the rescue! Maria Alandes Pradillo, CERN
Ongoing work Maria Alandes Pradillo, CERN
IS Panorama ServiceData (static) State Data (dynamic) MetaData (quasi-static) grid information VO configuration databases GOCDB EMIR Info providers Site BDII Info providers Resource BDII glue-validator Top BDII manually FCR Info providers Grid services SAM Glue schema Site managers ginfo lcg-info lcg-info-sites Maria Alandes Pradillo, CERN
EMI Registry • EMIR is developed atJUELICH (Germany) and NIIF (Hungary) within the EMI project. • Candidate for: • Service catalogue. • Unique entry point for Service Discovery. • Main features: • Federated service registry. • It stores service registration records for all EMI services. • Glue 2.0. • Robust, scalable and secure. • Status: • Distributed in EMI 2: http://www.eu-emi.eu/emi-2-matterhorn-products/-/asset_publisher/B4Rk/content/emir • Ready for EGI evaluation. Maria Alandes Pradillo, CERN
ginfo • ginfo is developed at CERN within the CRISP project. • Candidate for: • Service discovery client. • Replacement of lcg-info and lcg-infosites. • Main features: • Based on Glue 2.0. • Now it queries the top level BDII. • Very easy to use EMI registry. • Status • Under internal review. • Why not SAGA? • Complex OGF standard • API only • Hasn’t been picked up by the community Maria Alandes Pradillo, CERN
Future work Maria Alandes Pradillo, CERN
IS Future Panorama Service Data (static) State Data (dynamic) MetaData (quasi-static) grid information Resource Information Service Messaging system Messaging Based Monitoring System GOCDB EMIR Service record Probes for glue2.0 manually MetaData Catalogue Grid services SAM Site managers ginfo VO configuration databases Maria Alandes Pradillo, CERN
Future work • Information Service Monitoring • Based on Messaging technology • What metrics are needed? • Information Service Metadata • Use cases • Who needs it? For what? • Central catalogue? Maria Alandes Pradillo, CERN
Conclusions Maria Alandes Pradillo, CERN • Current status • Short term recommendations from TEGs implemented: • Caching BDII. • Glue validator. • Glue 2.0 still to be widelydeployed. • Service Provider future support. • Future work (Three Pillars) • EMIR under review. • ginfounder review. • IS monitoring and metadata still to be understood.