1 / 13

SAM Develop. & Integration

Explore how site availability metrics, SAM alarms, FCR Portal, and SAM Portal help track and manage service availability. Learn the procedures, data storage, and automation of alarms for efficient monitoring.

garyr
Download Presentation

SAM Develop. & Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAM Develop. & Integration David Collados, CERN IT/GD COD-11 - Athens

  2. Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08

  3. Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08

  4. Site Availability Metrics • How Availability is Calculated: • Possible Service Status: • 10 - OK • 20 - Down • 30 - Degraded • Per site service status: the OR of indiv. Services (Site BDII, CE, SE) • Per site: the AND of each service status. • Daily & Hourly availability for T1s and T0: • http://lcg-sam.cern.ch:8080/sqldb/site_avail.xsql • TODO: • Display availability at service level. • Similar for any site. SAM Status, COD-11, Athens, 2006-11-08

  5. Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08

  6. SAM Alarms 1/3 • Procedure to trigger an alarm: • the site is not in maintenance, AND • the node belongs to a certified site, AND • the node is not in maintenance, AND • VO is 'OPS', AND • Service is not in ('SE', 'SRM' or 'LFC'), AND • test status is > 40 (ERROR=50 and CRIT=60), AND • the test is critical, AND • there is no alarm already for that test, vo and node. SAM Status, COD-11, Athens, 2006-11-08

  7. SAM Alarms 2/3 • Data stored in each alarm: • alarmid • vo • test • node • test exec time • alarm status (new, assigned, masked, off) • update time • ticket id (GGUS) SAM Status, COD-11, Athens, 2006-11-08

  8. SAM Alarms 3/3 • Automatic Alarms Masking: • If there is one or more alarms with status='new' for this VO, node and test => new alarm triggered as masked. • Rules defining test relationships among alarms: • http://lcg-sam.cern.ch:8080/alarms/mask_alarm.xsql • Expected more work to improve this area. SAM Status, COD-11, Athens, 2006-11-08

  9. Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08

  10. FCR Portal • For VO managers to: • Manipulate top-level BD-IIs. • Set critical tests for all services • Display Site Resources (CE & SE) to be used by the VO (or remove not very stable ones) • The same for central services (RBs, BDIIs, etc) • Changes generate ldif file that BDII takes every 2 mins • Service Availability plots will depend on new set of C.T. • Available at https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi SAM Status, COD-11, Athens, 2006-11-08

  11. Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08

  12. SAM Portal • Judit working on the display of latest results per site and service. Expected around Christmas. • Available at https://lcg-sam.cern.ch:8443/sam/sam.py SAM Status, COD-11, Athens, 2006-11-08

  13. The End Comments? Questions? SAM Status, COD-11, Athens, 2006-11-08

More Related