1 / 16

Monitoring, Configuration and Control of the LHCb Trigger Farm

Monitoring, Configuration and Control of the LHCb Trigger Farm. Gianluca Peco On behalf of the Bologna Group. Trigger Meeting, 21/9/04. Monitoring, Configuration and Control. Monitoring Display of relevant parameters concerning the status of the farm elements

Download Presentation

Monitoring, Configuration and Control of the LHCb Trigger Farm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monitoring, Configuration and Control of the LHCb Trigger Farm Gianluca Peco On behalf of the Bologna Group Trigger Meeting, 21/9/04

  2. Monitoring, Configuration and Control • Monitoring • Display of relevant parameters concerning the status of the farm elements • Induce a FSM transition to an alarm state when the monitored parameters indicate error/warning conditions • Configuration • Define the farm running conditions • Farm elements and Kernel version to be used • Control • Action execution (reboot, ready, start, stop) triggered by manual command or by FSM transition

  3. Monitoring • Each node runs a few light processes: • monitor sensors; • command actuators. • DIM is the network communication layer between control units and farm elements • It allows bi-directional communication. • PVSS is interfaced to the farm nodes • to receive monitor data; • to issue command to the nodes; • to set node configuration.

  4. PVSS and DIM DIM is based on the client/server paradigm • Servers "publish" their servicesby registering them with the name server (normally once, at startup). • Clients "subscribe" to services by asking the name server which server provides the service and then contacting the server directly, providing the type of service and the type of update as parameters. • The name server keeps an up-to-date directory of all the servers and services available in the system. DIM SERVER runs on a farm element PVSS sensor actuator

  5. PVSS and DIM PVSS Data Base • PVSS provides a runtime DB, alarm generation, graphicalpanels • A key PVSS concept is the data point. A data point type is somewhat analogous to an object oriented class (collection of attributes that provides inheritance). • PVSS communicates with DIM via a PVSS-DIM Api Manager that can be configured • PVSS can behave as a DIM Client (i.e. receive information from or send commands to DIM servers) or as a DIM Server (i.e. send information to or receive commands from DIM clients) Data point DIM

  6. Sensors • Built as C programs, they collect relevant information from /proc and /sys kernel filesystems and publish them by DIM calls. • The following sensors arereadyand tested: • Temperature and fan speeds • CPU states, including irq and softirq • Hardware interrupt rates • Memory usage • Network interafce card • TCP/IP stack • Process status • The process list is achieved by calls to the libproc-3.2.3.so library (to cope to changes in kernel version).

  7. Sensor-1 DPT Sensor-2 DPT Sensor-n DPT Node_001_01 Node_001_02 Node_001_01 Node_001_02 Node_001_01 Node_001_02 Node_100_20 Node_100_20 Node_100_20 Data Point Structure • To each sensor corresponds a DPe in the PVSS (service is mapped in a DPT) • A sensor subscribing the DNS is automatically detected by a PVSS ctrl script and subscribed in a corresponding DP structure • A missing sensor is detected and its absence is shown in the corresponding control panel Data points

  8. DIMConfig ClientServices

  9. Data Point Structure (II) DpType Structure SFNode DpT Name : SFN_xxx_yy Reference DpT of Sensors DpT Sensor DpT Name : Stxxxxx settings readings info connected (bool)

  10. LHCBPLUS PC1 3 C o m PC2 Development Testbed DNS Sensors/Actuators PVSS Dist1 PVSS Dist7 14 Linux box 2 Windows box running PVSS Distributed System 1 Linux box development platform running DIM (sensors, actuators)

  11. Display Architecture Farm Display Panel subfarm Action: Event Click subfarm subfarm Node Display Panel SubFarm Display Panel Node_001_12 SubFarm_001 ssh Sensor Display Panel terminal Missing service DP doesn’t exist

  12. Display Main Display Panel Process list On click Nodes

  13. Display (II)

  14. Process Control • Basic mechanism to start/stop a process is ready (DIM Server publish DIMCMD). • When a process is started by DIMCMD an arbitrary Unique Thread Group Identifier (UTGID) is assigned to the process. (No more then one process can be started with the same UTGID. • Then the process may be traced and killed using UTGID command. • The UTGID mechanism is achieved by setting an additional environment variable.

  15. UTGID uStart uStart : start a UTGID process uStart : Can’t start two process with the same UTGID

  16. UTGID uLs,uKill uLs : show UTGID Proc uKill : stop process by UTGID

More Related