GridICE: Monitoring for Grid Operation Center

GridICE The eyes of the grid A monitoring tool for aGrid Operation Center by DataTAG WP4 Sergio Fantinel, INFN LNL/PD

Sergio Fantinel’s Outline • Discovery: resources, components • Server side processes layout • DB schema • Graphs/data presentation • Next steps • Packaging/installation

Discovery entities • RESOURCES: are the entities discovered from the GIS, ex: • Cluster Head Nodes • Storage Services • Cluster Worker Nodes • COMPONENTS: are the entities belonging to resources and discovered directly from resource itself, ex: • Computing Elements • Storage Space • Network Adapters

Discovery purposes • track the life of the entities: they are characterized by a status (new, available, disappeared, dead). • Configure the monitoring system accordingly the status of these entities to collect metrics, status and other info.

Discovery entities list • This is the list of entities currently tracked by • the monitoring system: • Clusters • Storage Services • Worker Nodes (CL) • Computing Elements (CL) • Run Time Environments (CL) • Virtual Organizations (CE) • Storage Extents (WN) • Network Adapters (WN) • Storage Space (SE) • Storage Protocols (SE) CL = Cluster WN = Worker Node/host SE = Storage Service

Discovery entities (2) • Every entity (resource or component) is described by a number of characterizing information. • Entities may be linked together:Ex. Network Adapter -> Worker Node -> Cluster • To track the life of the entities it is used a SQL database where are stored also all the information related to every single entity.

MonitoringDB Discovery entities (3) GIIS Server 1 GIIS LDAP SQL MonitorigServer 2 LDAP 3 4 1: LDAP Query 2: available CE/SE 3: LDAP Query 4: CEIDs, WNs, Steps 3,4 repeated for every CE/SE GRIS Computing Element/Storage Element

Discovery config/check processes • All the info collected are used to generate a number of Nagios configuration files (configuration process). • Nagios schedule, according to some other DB stored parameters, at different interval times, the execution of a number of scripts (Nagios plug-ins wrote by the DataTAG WP4) that collect the info associated to every entity (check process) and put those in the DB.

Server Side processes layout GRID WEB Discovery 1A 5B 1C GIIS 1B Gfx/Presentation Config 2A 2B GRIS 5A MonitoringDB Nagios/scheduler 3 4A 4B Check 1: entities discovery 2: generation of config files 3: check scheduling 4: entities info collection 5: DB info rendering Grid Information System LDAP Interface Developed by DataTAG WP4

Scheduling • Discovery and config generation run as cron jobs; although the two processes can be scheduled independently at different time intervals, a discovery is just followed by a config generation. • Check plug-ins are scheduled by Nagios; the interval for each one is set by a corresponding parameter in the DB.

DataBase stored info • Three types of information are stored in the database (~50 tables): • Entities: actual status, historical status (fed by discovery process ) • Info about entities (fed by check process) • Monitoring configuration parameters (fed manually by monitoring administrator)

DataBase date/time attributes • Many record types has associated two date/time attributes (Date, LastCheck) that determines the time interval where the record is valid. • Checking the time distance between the Date attribute of a tuple and the LastCheck of the tuple that immediately precede, we can figure out problems of the resource or the monitoring system (VETOES generation, see later).

DataBase (2) • For many entities the related info are grouped into two sets: slow, fast parameters; in the first group are those params that change very slowly in the time (ex. SO version, RAM size of a node, bogoMIPS,…); in the second are all the other parameters.

DataBase check process • For a set of parameters: the check process get the info from the GRIS, then compare the values against the last record in the DB and, if no one has changed, it update only the LastCheck attribute of the tuple; otherwise it insert a new tuple with the current date/time in Date/LastCheck attributes and the new info.

DataBase thresholds • We are introducing a threshold mechanism: • For a selected set of attributes, nearly all the fast parameters, we define a threshold • If the new values for a tuple doesn’t exceed the related threshold over the old values, we change only the LastCheck attribute

DataBase thresholds (2) • This threshold mechanism become very useful to save DB space resource allocation. • For example • We doesn’t need to appreciate a variation of a CPU load less the 5-10% (think to the fluctuations around the 100% idle when no job is running on a CPU) • The same concept may be applied to a Storage Area with a 5-10MB threshold

RESOURCE RL_GE_CPU_SLOWPARAM Date LastCheck Vendor Model Version ClockSpeed TimeInterval Date LastCheck Load1Min Load5Min Load15Min date date string(32) string(64) string(32) int int date date int int int Date LastCheck IP ResName ID_GeType Group Cluster Date Data string(16) string(64) Int string(64) int RL_GE_CPU_FASTPARAM int int int int User Nice System Idle DB Example: GE simple info NB: this is a simplified structure

RL_CE_CEID Date LastCheck CeidName date date string(256) RL_GE_CPU_FASTPARAM Date LastCheck FreeCpus EstimatedResponseTime RunningJobs WaitingJobs WorstResponseTime date date int int int int int RL_GE_CPU_SLOWPARAM MaxTotalJobs MaxWallClockTime Priority LRMSType LRMSVersion date date int int Int int Date LastCheck CeidStatus TotalCpus MaxCpuTime MaxRunningJobs int int int string(16) string(16) DB Example: queues RESOURCE NB: this is a simplified structure

RL_SA_CHANGE_STATUS RL_RES_CHANGE_STATUS ID_ChgStatus Date ID_ChgStatus Date int date int date DB Example: status history RESOURCE RL_SA_FASTPARAM RL_SE_SA RL_SA_SLOWPARAM NB: this is a simplified structure

SERIVCE ServiceName CheckInterval Template Command string(64) Int string(32) string(32) DB Example: configuration Service List: CheckCeidFP CheckCeidSP CheckSeFP CheckSeSP CheckWnCpu CheckWnCtxc CheckWnInfo CheckWnIntc CheckWnMem CheckWnMpc CheckWnNet CheckWnPrcc CheckWnSckc CheckWnSpc CheckWnStg CheckWnVirtual GETYPE RL_GETYPE_SERVICE NB: this is a simplified structure

Data presentation • We based our work on the JpGraph tool (PHPv4), a very powerful set of API to generate a number of graph types that use directly the GDLib API (v2 is needed for an accurate colour management) • To overcame some limitation in the managing of date/time labels, merging and plotting of large set of date of JpGraph, and to have in general a fine control, we developed a decoupling layer between the DB and the render engine.

MonitoringDB Data presentation (2) Main Analysis Process JpGraph Data Load Data MergingGraph generation GDLib Resample Developed by DataTAG WP4

MonitoringDB Data presentation (3) Veto threshold Array of homogeneous tuples with their time interval validity (ALV) Data Load Time interval Object descriptors (Table, fields,…) Array of VETOES (if option enabled) (AVT)

Data presentation (4) Array with xSize number of elements; Every element is a record of metrics values; each value is the average of the metric function in the time interval slot represented by the xpoint considered (ALV) (AVT) Resample Time interval Graph Labels X-axis plot (dot) size (xSize) Graph VETO bars (if enabled)

Data presentation (5) • The presentation of the date was made addressing different user types (see later on the live demo session): • Vo views, for a VO manager • Site views, system manager • Single entity views

Next steps, integration • the next HIGH PRIORITYstep is the integration with the EDG 2.0 information system. Due to the structure of our monitoring system, the migration should be very smoothand fast.

Next steps, long term • Glue Network Service (NE) • Job monitoring • Collective services (RB, LB, ROS, RLS,…) • We are thinking to evaluate different technologies that can help us to improve our tool; we have in mind, for example, OGSA and JINI. • Major issue: distribution of the monitoring DB to get more scalability • Authentication/authorization • Remote system management • Better analysis of monitoring date and their presentation • Personal monitoring

Packaging/installation • Server side • Linux Red Hat 7.3 • PostgreSQL 7.2.3 • PHP 4.3.1 (GDLibv2) • JpGraph 1.2 • Nagios

Packaging/installation (2) • Server side: • Guide to installation and configuration • Client side: • Packaged rpms to install on the Head Node and on the Worker Nodes (fmon server/agents) • Replacing the Glue-CE.schema file on the Head Node (GLUE+)

GridICE: Monitoring for Grid Operation Center

GridICE: Monitoring for Grid Operation Center

Presentation Transcript

GridICE: a monitoring service for Grid Systems

GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites

GridICE: overview and current status

Grid monitoring System (GridICE)

GridICE: a monitoring service for Grid Systems

GridICE: overview and current status