190 likes | 337 Views
A lightweight Monitoring and Accounting system for LHCb DC'04 production. V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo. Outline. Manifesto Monitoring Web interface Internals Accounting Web interface Internals Outlook URLs. Manifesto.
E N D
A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo
Outline • Manifesto • Monitoring • Web interface • Internals • Accounting • Web interface • Internals • Outlook • URLs
Manifesto • Monitoring and Accounting are tasks in DIRAC377 • DIRAC is a Production grid for LHCb • The Monitoring reports the status of jobs while in the WMS (Workload Management System)366 • Instantaneous snapshot of the system • No historic records • The Accounting records the status of jobs afterleaving the WMS • Provides historic record, accumulated statistics and evolution of recorded variables with time • Main users: production and site managers
Design choices • Monitoring • Job information stored centrally in the WMS • Info Provided directly by the job and the WMS • Passive services: no pushing of information • No need for a common consumer API • Job and Application state stored together • Accounting • Separate infrastructure from the monitoring • Jobs can never be on the Accounting and the Monitoring • Domain specific: LHCb production jobs
Monitoring Accounting Read Write Read Write Information Flow DIRAC Web interface Web interface Users Job Heart-beat Job Services & Agents Cleaner Agent WMS Job Database Accounting Database Backend
Monitoring Web Interface 1 • Interface to query monitoring service • JobId popup a window with job details if clicked
Monitoring Web Interface 2 Running jobs by site • The overview shows predefined plots on the production • Generated every few minutes • PyChart used as graphics engine • 100% python • Supports SVG
Monitoring Web Interface 3 • Job status by site and production id
Monitoring Internals • It consists of a XML-RPC service exposing whatever parameters are known to DIRAC • Job parameters stored internally by DIRAC • Primary parameters • Execution site, job status, job owner etc. • Fixed, centrally defined: fast access • Can query on them • Secondary parameters • Number of steps, internal job state, etc • Defined by the production job itself • Stored as key-value pairs • Slower access. Cannot query on them
JMS basic API example from xmlrpclib import ServerProxy server = ServerProxy(monitoring_url) #Retrieve list of jobs verifying some conditions conditions = {'Status': 'running', 'Site': 'DIRAC.CERN.ch' } jobreq = server.getJobs(conditions) #Print some parameters for each job if jobreq['Status']: for jobid in jobreq['Value']: print server.getJobSite(jobid) print server.getJobParameter(jobid, 'LocalBatchId') #Bulk operations sum = server.getJobsPrimarySummary(jobreq['Value']) ~3 s to select 95 out of 50k jobs ~40 s ~0.7 s
Accounting Web Interface 1 • GUI for querying the Accounting • Shows results • As graphics • As table • As Excel sheet • Several types of report • Only a few shown here
Accounting Web Interface 2 • Used resources by site
Accounting Web Interface 3 • Used resources by event type • Mb/job • CPU/job • Failed jobs • CPU vs. Exec time • Input and Output data vs. CPU
Accounting Web Interface 4 • Produced data by production ID • Rates • Cumulative • Number of events • Gb of output
Accounting Web Interface 5 • WMS statistics on DIRAC's performance • Plots • Job execution time vs. WMS waiting time • Job execution time vs. WMS matching time • Granularity • Per site • Per production • Integral • Allows assessment of DIRAC's performance
Accounting Internals • Job and DIRAC statistics kept in a database • Site contribution • Data produced and used by jobs and steps • Timing for jobs, steps and DIRAC internals • Separate XML-RPC interfaces to populate and query the accounting tables • Both interfaces have restricted access • Jobs are moved to the accounting system by a cleaner agent after being validated
Accounting Usage • About 10 hits per day • Time to generate daily static reports: 8 min • 60-70% of the time querying the database • 30-40% of the time in the drawing package Total: 169 kjobs Server load<0.2
Outlook • Monitoring page • Transactions in monitoring updates • Further optimisation (bulk operations...) • Search for a faster rendering package • Make the web page dynamic: Less reloads • Accounting • New report types • Normalized CPU • Contribution by country • Rate by site, country etc...
URLs • Monitoring page • http://fpegaes1.usc.es/dmon/DC04/joblist.html • Mirror on: • http://lhcb02.usc.cesga.es/dmon/DC04/joblist.html • Direct link to overview pages • http://lhcb.ecm.ub.es/DC04/Monitoring • Accounting page • http://lhcb.ecm.ub.es/DC04/Accounting/