1 / 15

COD-11 Summary

COD-11 Summary. COD-11 – Athens – November 2006. GOCDB - Summary. – New features (presented by Andy Newton): Improved availability with failovers Support for multiple grids (not just EGEE) Allows services to be associated with VOs

denali
Download Presentation

COD-11 Summary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COD-11 Summary COD-11 – Athens – November 2006

  2. GOCDB - Summary • – New features (presented by Andy Newton): • Improved availability with failovers • Support for multiple grids (not just EGEE) • Allows services to be associated with VOs • Allows resources (sites, nodes, services) to be logically grouped • More flexible roles / permission system allowing fine grained access from a whole grid down to a service level • Major improvements to scheduled downtimes • Can now be scheduled or unscheduled, have varying degrees of severity and be extended / shortened in an intelligent manner • Improved storage of historical data for metrics purposes • Simplified integration with third party applications through PL/SQL function library • Improved globalisation support Summary COD-11, Athens, Nov 2006

  3. CIC Operations Portal - Summary • Presented by Gilles Mathew • Change of the host machine • Migration from a web server to a web cluster • Database migration from MySQL to Oracle • Source code completely restructured • EGEE BROADCAST evolving • VO registration procedure & related • COD Dashboard • Monitoring entry point based on SAM alarms • Changes in “templates” for mails • Integration of SAM admin’s page Summary COD-11, Athens, Nov 2006

  4. Central EU - Last Shift Experience • Presented by Marcin Radecki & Małgorzata Krakowian • Amount of work increased significantly • High number of alarms • More tickets created to handle (suggested involve backup tem to work) • Problems to assess which alarm is important. • Some problems with SAM (under investigation) • “No results returned” • Results out of date sometimes • No GSTAT monitored in SAM • Alarms Related • Alarm history could be sorted by date • Could be switched off automatically if no fails during period • Possible to mask tickets? Summary COD-11, Athens, Nov 2006

  5. Failover Procedures - Summary • Presented by Alessandro Cavalli & Alfredo Pagano • Working on CIC Portal Replication (deadline Dec 15th) • Starting discussions on failover schema. • Next candidate for replication and failover: SAM Summary COD-11, Athens, Nov 2006

  6. Organization for 2007 Presented by Helene Cordier 10 teams total: SWE is to join early November. From COD10 proposals established for the share of the workload between lead and backup teams: • Divide a given duty week (2 days of 100% work for lead team and + 3 days of 100% work for backup team) – reverse at the following rotation. • Divide by the type of tests (Resources for lead team vs. Services for backup team). • Geographical split – by federations or by sites alphabetical order. (First half to be dealt with by lead team, second half by backup team) • Creation of new tickets vs. treatment of existing tickets. • First round of discussion led to the following preferences: • Proposition 3 – RU, DE, CE, SE, FR • Proposition 4 - UKI, IT, CERN, SWE • Proposition 1 – ASGC Proposition 3 has been validated in COD 11. Summary COD-11, Athens, Nov 2006

  7. SAM Related 1 • Malgorzata: Check why CE/SE host certif. tests are not enabled in SAM • Rafal (use of SAM Admin Pages): • Check how to pass to same-exec the desired RB to use. • Not to exclude submissions from SAM Admin Portal from metrics calculations, but provide several RBs. • Needed more RBs and be able to use for Admin Portal submissions. Find more machines (GNAF, CERN and backup at GNAF?). Find solution before COD-12 • What is the need of RGMA Mon box for SAM client? Rafal thinks it’s used but Alexandre and I think it’s not required. • Clemens: Check why CE tests for OPS are only executed every 2h. Summary COD-11, Athens, Nov 2006

  8. SAM Related 2 • Kai: • Automatic set of alarms to ‘off’ • Alarms not syncronized with CIC dashboard. Temporary problem yesterday? • Problem with OSG sites: • Gilles will implement a mapping in the dashboard to send these tickets to US support unit (till the definition of those sites in GGUS, by the end of the month) • Fix SAM submissions for BDII and start creating alarms for it. • Alessandro: • Asked requirements for SAM failover. Minimal hardware, number of machines, disk, memory, DB requirements, kind of load of the machine for each component, etc) Summary COD-11, Athens, Nov 2006

  9. Alarms Prioritization • Different critieria allow to define alarm priority: • Central Core Service or Site level Service • How many alarms are masked by the one we are considering • What is the current status of the alarm (if OK -> Lower priority) • Site size for site services (number of CPUs) • If site/service is in Production or in PPS • Four levels classification: • A- Node type • B- Number of related/masked alarms • C- Current status of the test • D- Size of the site Summary COD-11, Athens, Nov 2006

  10. A - Prioritisation following node type • GROUP 1 (Central Services - Very High Priority) • BDII,RB, gRB, VOMS, LFC • GROUP 2 (Central Services - High Priority) • FTS, SRM, MyProxy • GROUP 3 (Site Level Services - Normal Priority) • sBDII, RGMA • GROUP 4 (Site Level Services - Lower Priority) • CE, gCE, SE Summary COD-11, Athens, Nov 2006

  11. Alarms Weight Mechanism • A Level • Group 1 : + 40 000 points • Group 2 : + 30 000 points • Group 3 : + 20 000 points • Group 4 : + 10 000 points • B Level • + 1 000 points per masked alarm, with a maximum of 9 000 points for nb alarms >= 9 Summary COD-11, Athens, Nov 2006

  12. Alarms Weight mechanism • C Level • Number of points according to the following mapping: • NA/MAINT/OK: + 0 points • INFO: + 100 points • NOTE: + 200 points • WARNING: + 300 points • ERROR: + 400 points • CRIT: + 500 points • D Level • Calculation of the "relative size" of the site, being a value between 0 and 99 (bigger site= 99) • this relative size gives number of points (kind of percentage of the biggest size) Summary COD-11, Athens, Nov 2006

  13. SAM Related 3 • Gilles: • Agreed giving weight to alarms • Agreed defining a classification following service type, number of related problems, status of the test and size of the site. • Previously shown rules can be tuned later. • We can pass the alarms priority to GGUS priority field which is not used by now. • Next COD-12 @ Karlsruhe in February. • Before, establishment of monthly phone meetings. Summary COD-11, Athens, Nov 2006

  14. SAM List of Actions 1 • Actions on Judit: • Check SAM results with N.A. • Check bug when JS fails and others show OK • Check bug when all shows OK but result is CRIT or ERROR • Check why CE submissions show every 2h instead of 1. • Actions on Piotr: • Provide description of SAM to replace SFT section on OPS manual. Summary COD-11, Athens, Nov 2006

  15. SAM List of Actions 2 • Actions on David: • Add Alarm Masking Rules to the Operations Manual • Check why GSTAT tests are in the framework but results not displayed in SAM • Implement alarms weight and hourly recalculation of weights. Deadline: December 1st • Provide to Alessandro requirements for SAM failover. • Provide all available services for each node in OPS vo to Osman. Deadline ASAP. Summary COD-11, Athens, Nov 2006

More Related