150 likes | 387 Views
COD-11 Summary. COD-11 – Athens – November 2006. GOCDB - Summary. – New features (presented by Andy Newton): Improved availability with failovers Support for multiple grids (not just EGEE) Allows services to be associated with VOs
E N D
COD-11 Summary COD-11 – Athens – November 2006
GOCDB - Summary • – New features (presented by Andy Newton): • Improved availability with failovers • Support for multiple grids (not just EGEE) • Allows services to be associated with VOs • Allows resources (sites, nodes, services) to be logically grouped • More flexible roles / permission system allowing fine grained access from a whole grid down to a service level • Major improvements to scheduled downtimes • Can now be scheduled or unscheduled, have varying degrees of severity and be extended / shortened in an intelligent manner • Improved storage of historical data for metrics purposes • Simplified integration with third party applications through PL/SQL function library • Improved globalisation support Summary COD-11, Athens, Nov 2006
CIC Operations Portal - Summary • Presented by Gilles Mathew • Change of the host machine • Migration from a web server to a web cluster • Database migration from MySQL to Oracle • Source code completely restructured • EGEE BROADCAST evolving • VO registration procedure & related • COD Dashboard • Monitoring entry point based on SAM alarms • Changes in “templates” for mails • Integration of SAM admin’s page Summary COD-11, Athens, Nov 2006
Central EU - Last Shift Experience • Presented by Marcin Radecki & Małgorzata Krakowian • Amount of work increased significantly • High number of alarms • More tickets created to handle (suggested involve backup tem to work) • Problems to assess which alarm is important. • Some problems with SAM (under investigation) • “No results returned” • Results out of date sometimes • No GSTAT monitored in SAM • Alarms Related • Alarm history could be sorted by date • Could be switched off automatically if no fails during period • Possible to mask tickets? Summary COD-11, Athens, Nov 2006
Failover Procedures - Summary • Presented by Alessandro Cavalli & Alfredo Pagano • Working on CIC Portal Replication (deadline Dec 15th) • Starting discussions on failover schema. • Next candidate for replication and failover: SAM Summary COD-11, Athens, Nov 2006
Organization for 2007 Presented by Helene Cordier 10 teams total: SWE is to join early November. From COD10 proposals established for the share of the workload between lead and backup teams: • Divide a given duty week (2 days of 100% work for lead team and + 3 days of 100% work for backup team) – reverse at the following rotation. • Divide by the type of tests (Resources for lead team vs. Services for backup team). • Geographical split – by federations or by sites alphabetical order. (First half to be dealt with by lead team, second half by backup team) • Creation of new tickets vs. treatment of existing tickets. • First round of discussion led to the following preferences: • Proposition 3 – RU, DE, CE, SE, FR • Proposition 4 - UKI, IT, CERN, SWE • Proposition 1 – ASGC Proposition 3 has been validated in COD 11. Summary COD-11, Athens, Nov 2006
SAM Related 1 • Malgorzata: Check why CE/SE host certif. tests are not enabled in SAM • Rafal (use of SAM Admin Pages): • Check how to pass to same-exec the desired RB to use. • Not to exclude submissions from SAM Admin Portal from metrics calculations, but provide several RBs. • Needed more RBs and be able to use for Admin Portal submissions. Find more machines (GNAF, CERN and backup at GNAF?). Find solution before COD-12 • What is the need of RGMA Mon box for SAM client? Rafal thinks it’s used but Alexandre and I think it’s not required. • Clemens: Check why CE tests for OPS are only executed every 2h. Summary COD-11, Athens, Nov 2006
SAM Related 2 • Kai: • Automatic set of alarms to ‘off’ • Alarms not syncronized with CIC dashboard. Temporary problem yesterday? • Problem with OSG sites: • Gilles will implement a mapping in the dashboard to send these tickets to US support unit (till the definition of those sites in GGUS, by the end of the month) • Fix SAM submissions for BDII and start creating alarms for it. • Alessandro: • Asked requirements for SAM failover. Minimal hardware, number of machines, disk, memory, DB requirements, kind of load of the machine for each component, etc) Summary COD-11, Athens, Nov 2006
Alarms Prioritization • Different critieria allow to define alarm priority: • Central Core Service or Site level Service • How many alarms are masked by the one we are considering • What is the current status of the alarm (if OK -> Lower priority) • Site size for site services (number of CPUs) • If site/service is in Production or in PPS • Four levels classification: • A- Node type • B- Number of related/masked alarms • C- Current status of the test • D- Size of the site Summary COD-11, Athens, Nov 2006
A - Prioritisation following node type • GROUP 1 (Central Services - Very High Priority) • BDII,RB, gRB, VOMS, LFC • GROUP 2 (Central Services - High Priority) • FTS, SRM, MyProxy • GROUP 3 (Site Level Services - Normal Priority) • sBDII, RGMA • GROUP 4 (Site Level Services - Lower Priority) • CE, gCE, SE Summary COD-11, Athens, Nov 2006
Alarms Weight Mechanism • A Level • Group 1 : + 40 000 points • Group 2 : + 30 000 points • Group 3 : + 20 000 points • Group 4 : + 10 000 points • B Level • + 1 000 points per masked alarm, with a maximum of 9 000 points for nb alarms >= 9 Summary COD-11, Athens, Nov 2006
Alarms Weight mechanism • C Level • Number of points according to the following mapping: • NA/MAINT/OK: + 0 points • INFO: + 100 points • NOTE: + 200 points • WARNING: + 300 points • ERROR: + 400 points • CRIT: + 500 points • D Level • Calculation of the "relative size" of the site, being a value between 0 and 99 (bigger site= 99) • this relative size gives number of points (kind of percentage of the biggest size) Summary COD-11, Athens, Nov 2006
SAM Related 3 • Gilles: • Agreed giving weight to alarms • Agreed defining a classification following service type, number of related problems, status of the test and size of the site. • Previously shown rules can be tuned later. • We can pass the alarms priority to GGUS priority field which is not used by now. • Next COD-12 @ Karlsruhe in February. • Before, establishment of monthly phone meetings. Summary COD-11, Athens, Nov 2006
SAM List of Actions 1 • Actions on Judit: • Check SAM results with N.A. • Check bug when JS fails and others show OK • Check bug when all shows OK but result is CRIT or ERROR • Check why CE submissions show every 2h instead of 1. • Actions on Piotr: • Provide description of SAM to replace SFT section on OPS manual. Summary COD-11, Athens, Nov 2006
SAM List of Actions 2 • Actions on David: • Add Alarm Masking Rules to the Operations Manual • Check why GSTAT tests are in the framework but results not displayed in SAM • Implement alarms weight and hourly recalculation of weights. Deadline: December 1st • Provide to Alessandro requirements for SAM failover. • Provide all available services for each node in OPS vo to Osman. Deadline ASAP. Summary COD-11, Athens, Nov 2006