A proposal for improving Job Reliability Monitoring

A proposal for improving Job Reliability Monitoring GDB 2nd April 2008

Problem Statement • We would like to be able to gather job state transitions from all jobs submitted in WLCG resources • EGEE • RB + WMS submitted jobs • Jobs submitted directly to a CE (via condor_g) • OSG • Jobs on a OSG CE for WLCG VOs • Nordugrid • ARC CE • Use this to calculate resource reliability • And use for debugging e.g. Dashboards 2

Principles • Only gather this information once • Propagate to interested parties • Using existing systems and expertise where possible • Don’t try and deploy components on every WMS/RB/L&B/CE/… • Get ‘cooked’ data from the systems • Hook up with Pilot Jobs • Linkage between pilot and experiment jobs as a ‘state change’ • ‘Job Wrapper tests’ fit in here too • They’re just another state change 3

Current situation - Gridview • Currently mines L&B log files, and sends them via R-GMA • Loses many records • Hacks to ‘finish’ unfinished jobs after 24 hours • Inaccurate results 4

Current situation - Dashboard • Jobs reported via experiment frameworks • Gathers from many sources – Imperial College XML files, job submission tools, monAlisa reporting from jobs, R-GMA • But some missing information for condor_g jobs • info between submission and user job starting on WN • Job aborted • Some work done (Sergey Belov/Dubna) on reporting state changes inside condor_g Presentation title - 5

Proposal • Use WLCG Monitoring infrastructure (MSG) for collecting and transporting the data • Messaging system • Standard message formats • Work with expert groups to instrument the job submission systems • Visualization by Gridview + Dashboards 6

Effort • Provide some effort to do the instrumentation • Coordination – WLCG Monitoring (D Rodrigues) • Messaging system integration - Gridview • EGEE WMS – L&B , Gridview • Condor-G – Condor team, Dashboard team • OSG CE – through OSG participation in the Joint Monitoring Group (OSG Operations (Rob Quick), measurements & metrics (Brian Bockelman) Presentation title - 7

EGEE • L&B Notifications means we don’t have to run components mining L&B logfiles • Consumer of notifications can be remote • L&B is stated to scale for our needs • Tested at >1m records/day • Testing of integrating with notifications underway by gridview team • Message formats already defined • Old log mining approach will all be moved to messaging system to free GridView from R-GMA dependency 8

condor_g • condor_g submitter instrumented to create L&B messages • Done by a separate listener process that is started by condor_g • Limited subset of condor_g state changes will be sent • Listener/reporter can use different transport for reporting • Currently monalisa as a transport layer • Will migrate to WLCG messaging system 9

EGEE Architecture 10

OSG • Gratia is used to transport messages inside OSG • A Gratia-MSG Bridge could be implemented • Similar to the RSV bridge used for OSG availability • Plan to include discussion in the upcoming EGEE-OSG-WLCG design meeting in Madison at the end of May • Hope to further collaborate with OSG on the infrastructure for the analysis of the collected data, dashboards etc. 11

Nordugrid • Nordugrid • Currently only Nordugrid Job info is via ATLAS production DB • How do we get information from the CE? • Will look to implement a similar bridge if needed • Need to work through the technical details with the experts • Discussion yet to start… Presentation title - 12

Pilot Jobs • L&B client resides on every worker node • Can be used to submit additional messages to L&B for a job • Timestamps +environment for Job Wrapper start/end • Timestamp of handover to user job • Linkage of pilot job to experiment job ID • … • Benefit is that it’s all in one coherent data structure for a given job 13

Summary • Propose a more coherent approach to mining job state changes • Uses expert knowledge where possible to ‘cook’ the data into a useful structure • Fits the principles of the WLCG monitoring activity • Use common message system components • Split the work across the relevant teams • What have we missed? • Your feedback essential… 14

A proposal for improving Job Reliability Monitoring