130 likes | 145 Views
Explore how site availability metrics, SAM alarms, FCR Portal, and SAM Portal help track and manage service availability. Learn the procedures, data storage, and automation of alarms for efficient monitoring.
E N D
SAM Develop. & Integration David Collados, CERN IT/GD COD-11 - Athens
Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08
Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08
Site Availability Metrics • How Availability is Calculated: • Possible Service Status: • 10 - OK • 20 - Down • 30 - Degraded • Per site service status: the OR of indiv. Services (Site BDII, CE, SE) • Per site: the AND of each service status. • Daily & Hourly availability for T1s and T0: • http://lcg-sam.cern.ch:8080/sqldb/site_avail.xsql • TODO: • Display availability at service level. • Similar for any site. SAM Status, COD-11, Athens, 2006-11-08
Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08
SAM Alarms 1/3 • Procedure to trigger an alarm: • the site is not in maintenance, AND • the node belongs to a certified site, AND • the node is not in maintenance, AND • VO is 'OPS', AND • Service is not in ('SE', 'SRM' or 'LFC'), AND • test status is > 40 (ERROR=50 and CRIT=60), AND • the test is critical, AND • there is no alarm already for that test, vo and node. SAM Status, COD-11, Athens, 2006-11-08
SAM Alarms 2/3 • Data stored in each alarm: • alarmid • vo • test • node • test exec time • alarm status (new, assigned, masked, off) • update time • ticket id (GGUS) SAM Status, COD-11, Athens, 2006-11-08
SAM Alarms 3/3 • Automatic Alarms Masking: • If there is one or more alarms with status='new' for this VO, node and test => new alarm triggered as masked. • Rules defining test relationships among alarms: • http://lcg-sam.cern.ch:8080/alarms/mask_alarm.xsql • Expected more work to improve this area. SAM Status, COD-11, Athens, 2006-11-08
Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08
FCR Portal • For VO managers to: • Manipulate top-level BD-IIs. • Set critical tests for all services • Display Site Resources (CE & SE) to be used by the VO (or remove not very stable ones) • The same for central services (RBs, BDIIs, etc) • Changes generate ldif file that BDII takes every 2 mins • Service Availability plots will depend on new set of C.T. • Available at https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi SAM Status, COD-11, Athens, 2006-11-08
Outline • Site Availability Metrics • SAM Alarms • FCR Portal • SAM Portal SAM Status, COD-11, Athens, 2006-11-08
SAM Portal • Judit working on the display of latest results per site and service. Expected around Christmas. • Available at https://lcg-sam.cern.ch:8443/sam/sam.py SAM Status, COD-11, Athens, 2006-11-08
The End Comments? Questions? SAM Status, COD-11, Athens, 2006-11-08