1 / 16

Supervision & Monitoring

Supervision & Monitoring. Organization and work plan Olof Bärring. Mandate.

lois
Download Presentation

Supervision & Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supervision & Monitoring Organization and work plan Olof Bärring Supervision & Monitoring section

  2. Mandate • Develop and deploy a monitoring solution that addresses LHC-era needs in areas such as data rates, data volumes and scalability and that provides appropriate information for users, administrators, operators and management both for individual component services and in logical service groupings. • Develop and deploy an automated fault tolerance solution that is compatible with the deployed monitoring solution. • Develop and maintain infrastructure for remote console access and system reset. • Fulfil CERN’s commitments to the monitoring and fault tolerance tasks within EDG/WP4+ WP4 management & integration Supervision & Monitoring section

  3. LCG-1 monitoring: criteria • All measurement data in Oracle • for service and computer center managers • powerful reporting tools • complex correlation queries • Physics users must be given access to measurement data • API for query/subscription • web based query interface? • Alarm display • for operators and service managers Supervision & Monitoring section

  4. LCG-1 monitoring: client • WP4 Monitoring Sensor Agent (MSA) deployed on all CPU, disk and tape servers. • Sensors: • FioSensor.pl: exception metrics • LinuxSensorProc: performance metrics • Castor performance/exception metrics would be desirable? e.g. • Tape queues length per device group • Tape pools (%free) • Drive status (physical and VDQM) • Network switches performance metrics Supervision & Monitoring section

  5. LCG-1 monitoring: Server(1) • Measurement Repository, deploy: • WP4 MR server, TCP or UDP transport • PVSS, UDP transport • Both needs to be evaluated w.r.t. • Performance in a large deployment • Operational & maintenance burden • Physics user interface requirements • Evaluation period: 1 – 2 months Supervision & Monitoring section

  6. LCG-1 monitoring: Server(2) • Oracle DB • Use PVSS info server to regularly export to Oracle • Use WP4 MR server with Oracle backend from David Front (LCG/Israel) • User interfaces • Service mgrs: Oracle tools • Users: WP4 repository API + web based query interface • Operators: alarm display Supervision & Monitoring section

  7. Evaluation phase architecture MSA MSA MSA MSA MSA MSA A P I Oracle DB A P I oracleMonServer A P I Can this be given to users? PVSS Oracle DB W2K PVSS Info Server Export Supervision & Monitoring section

  8. Monitoring deployment: Issues • WP4 alarm display: needs to be finalized and deployed • Externalized repository API for PVSS: Andreu’s library requires PVSS client to be installed • Continue to duplicate efforts for another 2 months knowing that ~half of the work will be thrown away afterwards Supervision & Monitoring section

  9. LCG-1 monitoring: Scenarios • Test both solutions in parallel ~2 months • Document the evaluation and decide: • WP4 solution selected • PVSS solution selected • Would need both because requirement scope too wide, e.g. • PVSS alarm display is best for the operators • WP4 implementation of the repository API is best for the users Supervision & Monitoring section

  10. Fault tolerance (FT): plans • Model the escalation procedures: ~May • Tracing of recovery actions • Exception escalation hierarchies • Evaluate WP4 FT framework: ~September • Adaptable to the modeled escalation procedure? • If not: survey other frameworks (e.g. Pete’s Oracle based correlation engine) • Adapt the LCG-1 monitoring to the FT recovery action tracing if necessary. ~October • Deploy. ~November Supervision & Monitoring section

  11. FT: modeling (~May) • Model the escalation procedure Try local repairs Local recoveries Exception raised Try global repairs Escalation? Trace of actions? Global recoveries Exception reset Alarm raised Problem fixed! Try manual repairs Supervision & Monitoring section

  12. FT: evaluation (~September) • Evaluate WP4 FT framework • Does it scale to global correlations? • Is the rule syntax rich enough? • Check other frameworks • Pete’s Oracle based solution Supervision & Monitoring section

  13. FT: deployment (~November) • Make sure the framework works together with the LCG-1 monitoring • FT related metrics • Correlation engines need: • API for data consumption (subscription/queries) • API for action tracing (feedback to monitoring) • Deploy the system and ... • Develop correlation engine and exception escalation hierarchies • Check that it works in production Supervision & Monitoring section

  14. Timelines Deploy WP4 server Deploy PVSS Gather input from selected set of LHC users Run both systems in parallel Evaluation report Maintenance Selection Feb Apr May Jun Jul Aug Sep Oct Nov Dec Mar Fault tolerance: model escalation and tracing Evaluate WP4 FT framework Adapt and deploy Supervision & Monitoring section

  15. Other tasks • Develop and maintain infrastructure for remote console access and system reset • Strategy, man-power ?? • WP4 management • WP4 manager • WP4 monitoring task leader Supervision & Monitoring section

  16. Who does what? Supervision & Monitoring section

More Related