240 likes | 355 Views
Online Monitoring with MonALISA. Dan Protopopescu Glasgow, UK. MonALISA. Is a distributed service able to: collect any type of information from different systems analyze this information in real time take automated decisions and perform actions based on it
E N D
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK
MonALISA Is a distributed service able to: • collect any type of information from different systems • analyze this information in real time • take automated decisions and perform actions based on it • optimize work flows in complex environments Read more at http://monalisa.caltech.edu
Uses • Monitoring distributed computing, i.e. GRIDs • Optimizing flow in complex system (VRVS, optics cable networks) • ALICE also uses ML for monitoring online reconstruction • Some benchmark figures for the service: • ~ 800k monitored parameters at 50k updates/second • > 10k running (alien) jobs monitored simultaneously • > 100 WAN links We are proposing ML as a high level monitoring and possible control system along with (or on top of) existing slow controls systems as epics, pvss etc.
Advantages • MonALISA is simple to install, configure and use • ApMon APIs are available in C, C++, Java, Python and Perl • ROOT plugin allows macros to send data directly to MonaLISA • Can easily interface with (or sit on top of) any existing or future slow controls subsystem (epics, pvss) • Data is stored in a standard PgSQL (or MySQL) database that can be accessed by other applications, independently of ML • Automatic data summarizing • Several data repositories (and hence DBs) can exist (local and remote) • Easy access via WebService (WS) from service and/or repository • Fully supported by development team; work is being done in this direction
Based on monitored information, actions can be taken in: ML Service ML Repository Actions can be triggered by: Values above/below given thresholds Absence/presence of values Correlations between several values Possible actions types: External command Plain event logging Annotation of repository charts; RSS feeds Email Instant messaging Capabilities
Components GUI LUS/Proxies Web Server Service Service ApMon Actions based on local information Repository ApMon ApMon ApMon Actions based on aggregated information Quick actions
Service setup ML Service setup: wget http://nuclear.gla.ac.uk/~protopop/ML/MonaLisa.tar.gz tar -zxvf MonaLisa.tar.gz cd MonaLisa/ ./install.sh cd ../MonaLisa/Service/CMD/ ./MLD start LUS Web Server Service Service ApMon Actions based on local information Repository ApMon ApMon ApMon Actions based on aggregated information Quick actions
Repository setup ML Repository setup: wget http://nuclear.gla.ac.uk/~protopop/ML/MLrepository.tgz tar -zxvf MLrepository.tgz [configure it] cd MLrepository ./start.sh LUS Web Server Service Service ApMon Actions based on local information Repository ApMon ApMon ApMon Actions based on aggregated information Quick actions
ApMon setup ApMon setup: wget http://nuclear.gla.ac.uk/~protopop/ML/ApMon_perl.tar.gz tar -xzvf ApMon_perl.tar.gz cd ApMon_perl [create your script, say mysend.pl] perl mysend.pl LUS/Proxies Web Server Service Service ApMon Actions based on local information Repository ApMon ApMon ApMon Actions based on aggregated information Quick actions
[monalisa@glasgow]$cat mysend.pl use ApMon; my $apm = new ApMon({"glasgow.jlab.org:8884" => {"sys_monitoring" => 0, "general_info" => 0}}); my @pair; while (1) {# loop forever # get values from somewhere @pair = getmypar(“pspec_logic_ai_0”); $apm->sendParameters(”Detector", “MOR”, @pair); sleep (20); } Simple monitoring script LUS Web Server Service Service ApMon Actions based on local information Repository ApMon ApMon ApMon Actions based on aggregated information Quick actions
Time history example: [monalisa@glasgow]$cat mor.properties page=hist Farms=JlabML Clusters=Detector Nodes=MOR Functions=pspec_logic_ai_0 ylabel=Tagger rate title=MOR annotation.groups=2 Time history LUS Web Server Service Service ApMon Actions based on local information Repository ApMon ApMon ApMon Actions based on aggregated information Quick actions
Application control Your custom Java client • ML Clients • TCP based subscribe mechanism serialized, compressed objects with optional encryption • ML Proxies • Application commands are encrypted • ML Services • Standard and/or user’s sensors and/or application modules GUI client ML Repository Your custom view Key LUS Keystore ML Service Your mon module Your app module App MonC ApMon Your application bash Your Application
Alert-based Actions MySQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is automatically kept full by the automatic resubmission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information via instant messaging, RSS feeds, toolbar alerts etc.
Summary • MonALISA is a very promising tool for online experiment monitoring and interfacing with a variety of slow control subsystems; GlueX are seriously considering ML for this task • Easy to configure, understand and use • Experience from Grid monitoring and more • Support from the developers group for implementation of new modules/features • Online experiment monitoring tests of CLAS@Jlab were recently carried on; demo repository is at http://mlr1.gla.ac.uk:7002
AliEn Services Monitoring • AliEn services • Periodically checked • PID check + SOAP call • Simple functional tests • SE space usage • Efficiency
Job Network Traffic Monitoring • Based on the xrootd transfer from every job • Aggregated statistics for • Sites (incoming, outgoing, site to site, internal) • Storage Elements (incoming, outgoing) • Of • Read and written files • Transferred MB/s
Individual Job Tracking • Based on AliEn shell cmds. • top, ps, spy, jobinfo, masterjob • Using the GUI ML Client • Status, resource usage, per job
Head Node Monitoring • Machine parameters, real-time & history, load, memory & swap usage, processes, sockets
MonALISA in AliEn • The MonALISA framework is used as a primary monitoring tool for the ALICE Grid since 2004 • Presently the system is used for monitoring of all (identified) services, jobs and network parameters necessary for the Grid operation and debugging • The number of concurrently monitored and stored parameters today is ~ 300.000 in 75 ML Services • The add-on tools for automatic events notification allow for more efficient reaction to problems • The framework design and flexibility answers all requirements for a monitoring system • The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations