90 likes | 115 Views
Stay updated with the latest on the COSMOS Monitoring System Project. Explore completed phases, tools, and objectives. Includes demonstrations, Grafana dashboards, and more. Find out about phase timelines and future plans. -
E N D
COSMOSControls Open Source MOnitoringSystem Project Status Report Frank on behalf of the COSMOS core team BE-CO Technical Meeting: 11-04-2019
Agenda • Reminders and final architecture overview • Completion of Phase-1, progress of Phase-2 • Some considerations and conclusions • DEMO with BE-CO/IN use cases (Yann) • Grafana dashboards • Icingaweb tool
Some general reminders • Objectives • Renovate the CO infrastructure monitoring system • Reduce the quantity and complexity of tools • Optimize sys-admin maintenance & support • Provide integrated and efficient tools for user diagnosis • Proposal • Monitoring of the infrastructure data (not accel. variables) • New paradigm to delegate specific parts to final users • low-level data collection (host/services checks) and dashboards • Align our monitoring system to IT solutions • Rely on modern, industry-standard OSS technologies
COSMOS phase-1 (target: Q4/2018) 1/2 • Setup of the Monitoring infrastructure is completed, over the TN and GPN • Aligned with IT services for the OS metrics (collectd) and relying on their service for storage (DBOD) and authentication (LDAP) • No IT support for Grafana/Openshift over the TN setup our own Grafana instance • no SSO neither external CERN access but availability, reliability and performance! • Monitoring of not-critical hosts & services • >6600hosts monitored (CC7, SLC6 and Windows systems) • All Console, Servers (95%), Virtual computers and FECs (98%): ping, ssh • >21300services monitored (HW/SW) • CO “golden use-cases” : Quad enclosure, ELMA/Wiener crates, Redundant powers systems, WRs, Timing distribution, GPS receivers and uTCA platform • Specific checks: LBDS kickers crates (RDA), Post-Mortem disks state • >138k collectd metrics: disk, cpu, memory, netw., …
COSMOS phase-1 2/2 • Delegate specific parts to the final users • GITLabservice for storage, build and deployment of the icinga plugins (checks) • Grafana/VML for development + RO access Grafana/TN instance for production • Grafanatraining have been delivered for CO-IN experts • Additional training for CO and Eqp. Groups to be scheduled • ~40custom checks • bash, c/c++, python • CO-IN, CO-APS, CO-SRC, BE-ICS, TE-MPE, TE-ABT • ~50 Grafana dashboards
VISUALISATION GRAFANAdashboard icingaWeb DIAMONconsole Others(MOON, …) Elasticsearch REST REST MON1 grafanaserver collectdserver STORAGE icinga server influx REST icinga active check SNMP RDA IPMI PING, SSH HTTP SERVICE HOST prometheus icinga passive check collectd ACQUISITION JMX SHM CMX Dep. HW resources(crates, sensors, etc.) JAVA RDA service C/C++(fesa, timing, etc.) OS C/C++(wfip, mfip) icinga POLL icinga PUSH
COSMOS phase-2 timeline A M J F M J J A S O N D • COSMOS fullyoperational for Run3 • DIAMON consolemigrated to COSMOS • PLC monitoringBE-CO, BE-ICS • C/C++ processmetrics(Timing, FESA, …) • White-rabbitcheck including WR-Nodes • All CO specific infrast. checks(RAID, SSD/ECC mem, HW config. check,ether.speed, …) 2019 2020 • WorldFIP monitoring(MFIP, WFIP) • Metrics fromall ACC hosts including Cryo FECs, PVSS and TIM servers • Java metrics(from prometheus) • Processalive(up/down check) • Gradual phase-out of DIAMON (dmnclic) after LS2 in sync with LUMENS project (process manag.) • DIAMON console is migrated to COSMOS for the operators (gradual introduction to Grafana) • Ansible deploy., infra. config and users data backup, recovery procedure, … (Q4/19)
Further points to consider • COSMOS config. isCCDB driven but also uses LANDB information: We need comprehensive, consistent and up-to-date information from both several actions in progress with CO-DS! • We need to gain experience in real operation and find suitable solutions: • To expose hosts/services status in a fast and scalable way to our clients e.g. DIAMON console and MOON (Icinga API gateway?) • To provide integrated tools and automation for the user to configure checks, notification and to deploy his dashboards • To assess which type and level of service we can fully delegate to the users without risk to lose overall control
Conclusions • Beginning of this year we gained a lot of experience deploying COSMOS to over 90% hosts and testing it on “critical” systems (End of April 100% of hosts will be covered) • We are now convinced we have made the good choices regarding design and technologies • The system is already widely used by the CO-IN experts but there is still a lot room to make it more exploitable on a larger scale • Many thanks to the COSMOS core team and all contributors from CO!