150 likes | 251 Views
OGSA-based Grid Workload Monitoring. R. Zhang 1 , S. Heisig 2 , S. Moyle 1 and S. McKeever 1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research Centre. Complicated Systems. Open Grid Service Architecture (OGSA) , is in a nutshell: The Grid + Web Services
E N D
OGSA-based Grid Workload Monitoring R. Zhang1 ,S. Heisig2 ,S. Moyle1 and S. McKeever1 1 Oxford University Computing Laboratory 2 IBM T.J. Watson Research Centre
Complicated Systems • Open Grid Service Architecture (OGSA), is in a nutshell: The Grid + Web Services • WhileOGSA brings computational power and interoperability, it also inevitably yields Dynamics and Complexity
Complicated Problems • For instance, the system has been slow (i.e. SLA violation) in the past hour • What is causing the problem? • How can it be fixed and prevented? • We must find out: • Grid services (and underlying platforms) touched • Time spent on services (and underlying platforms) • End-to-endresponse time composition
Monitoring: The First Step • We need to trace works across Grid services from end to end, monitoring workload and reporting data. • “If you don’t measure it, you can’t control it.”– TQM • Workload monitoring – the first step towards achieving self-managing and self-optimising system.
Instrumentation • Monitoring points inserted into common (OGSA-based Grid) middleware. • Requests given a unique ID and traced through the system.
Measurement • Timer at every monitoring point measures local response time. • Subtraction gives elapsed time (no clock sync). Start 0 (Client) Stop 0 Start 1 (Tomcat@eD) Stop1 Start 2 (Axis@ eD) Stop 2 Start 3 (Tomcat@Ogsa-Dai) Stop 3 Start 4 (Axis@Ogsa-Dai) Stop 4
Reporting • Data batched and aggregated at agents to reduce reporting overhead. • Data reported with Java Messaging Service (JMS) to provide reliability and scalability.
Concurrency Issue • Parallel invocation is common in practice. For example, Grid service A calls B,D in parallel, and then C after B and D return. • Concurrency is modelled by response time service Petri-Net (RTSPN),which is constructed automatically from data collected.
Conclusions • We have developed a monitoring infrastructure for OGSA-based Grids that: • discovers servicestouched; • monitors workload in an end-to-end manner; • captures concurrency in workload; • provides automated visualisation, • is portable (thanks to OGSA), scalable and lightweight (5 ms/req,service).
Future Work • The current infrastructure has enabled research on: • Performance problem determination; • End-to-end performance tuning/service differentiation • Real eDiamond workload data collection; • Instrumentation with finer granularity
We are grateful to • DTI for project grant • IBM for software/research support • eDiaMoND for experiment environment • all of you for coming along • Questions?
RTSPN Construction • Automatic construction from data • Each service receives ID of the service invoking it. • Each service receives IDs from services it depends on: • workflow description • temporal relation