680 likes | 1.73k Views
Supervision & Monitoring. Organization and work plan Olof Bärring. Mandate.
E N D
Supervision & Monitoring Organization and work plan Olof Bärring Supervision & Monitoring section
Mandate • Develop and deploy a monitoring solution that addresses LHC-era needs in areas such as data rates, data volumes and scalability and that provides appropriate information for users, administrators, operators and management both for individual component services and in logical service groupings. • Develop and deploy an automated fault tolerance solution that is compatible with the deployed monitoring solution. • Develop and maintain infrastructure for remote console access and system reset. • Fulfil CERN’s commitments to the monitoring and fault tolerance tasks within EDG/WP4+ WP4 management & integration Supervision & Monitoring section
LCG-1 monitoring: criteria • All measurement data in Oracle • for service and computer center managers • powerful reporting tools • complex correlation queries • Physics users must be given access to measurement data • API for query/subscription • web based query interface? • Alarm display • for operators and service managers Supervision & Monitoring section
LCG-1 monitoring: client • WP4 Monitoring Sensor Agent (MSA) deployed on all CPU, disk and tape servers. • Sensors: • FioSensor.pl: exception metrics • LinuxSensorProc: performance metrics • Castor performance/exception metrics would be desirable? e.g. • Tape queues length per device group • Tape pools (%free) • Drive status (physical and VDQM) • Network switches performance metrics Supervision & Monitoring section
LCG-1 monitoring: Server(1) • Measurement Repository, deploy: • WP4 MR server, TCP or UDP transport • PVSS, UDP transport • Both needs to be evaluated w.r.t. • Performance in a large deployment • Operational & maintenance burden • Physics user interface requirements • Evaluation period: 1 – 2 months Supervision & Monitoring section
LCG-1 monitoring: Server(2) • Oracle DB • Use PVSS info server to regularly export to Oracle • Use WP4 MR server with Oracle backend from David Front (LCG/Israel) • User interfaces • Service mgrs: Oracle tools • Users: WP4 repository API + web based query interface • Operators: alarm display Supervision & Monitoring section
Evaluation phase architecture MSA MSA MSA MSA MSA MSA A P I Oracle DB A P I oracleMonServer A P I Can this be given to users? PVSS Oracle DB W2K PVSS Info Server Export Supervision & Monitoring section
Monitoring deployment: Issues • WP4 alarm display: needs to be finalized and deployed • Externalized repository API for PVSS: Andreu’s library requires PVSS client to be installed • Continue to duplicate efforts for another 2 months knowing that ~half of the work will be thrown away afterwards Supervision & Monitoring section
LCG-1 monitoring: Scenarios • Test both solutions in parallel ~2 months • Document the evaluation and decide: • WP4 solution selected • PVSS solution selected • Would need both because requirement scope too wide, e.g. • PVSS alarm display is best for the operators • WP4 implementation of the repository API is best for the users Supervision & Monitoring section
Fault tolerance (FT): plans • Model the escalation procedures: ~May • Tracing of recovery actions • Exception escalation hierarchies • Evaluate WP4 FT framework: ~September • Adaptable to the modeled escalation procedure? • If not: survey other frameworks (e.g. Pete’s Oracle based correlation engine) • Adapt the LCG-1 monitoring to the FT recovery action tracing if necessary. ~October • Deploy. ~November Supervision & Monitoring section
FT: modeling (~May) • Model the escalation procedure Try local repairs Local recoveries Exception raised Try global repairs Escalation? Trace of actions? Global recoveries Exception reset Alarm raised Problem fixed! Try manual repairs Supervision & Monitoring section
FT: evaluation (~September) • Evaluate WP4 FT framework • Does it scale to global correlations? • Is the rule syntax rich enough? • Check other frameworks • Pete’s Oracle based solution Supervision & Monitoring section
FT: deployment (~November) • Make sure the framework works together with the LCG-1 monitoring • FT related metrics • Correlation engines need: • API for data consumption (subscription/queries) • API for action tracing (feedback to monitoring) • Deploy the system and ... • Develop correlation engine and exception escalation hierarchies • Check that it works in production Supervision & Monitoring section
Timelines Deploy WP4 server Deploy PVSS Gather input from selected set of LHC users Run both systems in parallel Evaluation report Maintenance Selection Feb Apr May Jun Jul Aug Sep Oct Nov Dec Mar Fault tolerance: model escalation and tracing Evaluate WP4 FT framework Adapt and deploy Supervision & Monitoring section
Other tasks • Develop and maintain infrastructure for remote console access and system reset • Strategy, man-power ?? • WP4 management • WP4 manager • WP4 monitoring task leader Supervision & Monitoring section
Who does what? Supervision & Monitoring section