490 likes | 673 Views
Scaling Up PVSS. Phase II Test Results Paul Burkimsher IT-CO. Aim of the Scaling Up Project. Investigate functionality and performance of large PVSS systems In Phase 1 we reassured ourselves that PVSS scales to support large systems Provided detail rather than bland reassurances.
E N D
Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO December 2005
Aim of the Scaling Up Project • Investigate functionality and performance of large PVSS systems • In Phase 1 we reassured ourselves that PVSS scales to support large systems • Provided detail rather than bland reassurances
Phase 2: WYSIWYAF • Began with a questionnaire to you to establish your concerns • Eclectic list of “hot topics of the moment” • Oracle Archiving • Alerts • Regular reconfiguration of channels (alerts and setpoints) • Backup and restore • Configuring all channels at startup
Your requests (cont.) • OPC performance • Local DB cache • Central Panel Repository • Windows/Linux lurking limits • System startup time (DPT distribution) • Task Allocation
Menu • From these requests, we initially picked out four for investigation: • Task Allocation • Backup of a running system • Alerts • Panel Repository
UI Userinterface Editor UI Userinterface Runtime UI Userinterface Runtime UI Userinterface Editor UI Userinterface Runtime API API-Manager CTRL Controlmanager API API-Manager CTRL Controlmanager DB Database-Manager EV Eventmanager DB Database-Manager EV Eventmanager D Driver D Driver D Driver D Driver D Driver Task Allocation • Recall that PVSS is manager based and any manager can be scattered to another machine (not just UIs).
Task Allocation • More than 20 different tests conducted to investigate the effect of moving managers around. • Results have been available on the web for some time (URLs at the end) • Results were surprising and went against our (& ETM’s!) assumptions of what would be “better”…
What we measured… • A task allocation was deemed “better” if it supported a higher number of datapoint changes per second (“throughput”) than a system running entirely on a single processor. • We observed the number of changes per second that the system could support before one of the following became overloaded : • CPU usage • Memory usage • Network traffic • Disk traffic
What we saw… • As throughput increases on a typical PVSS system, the machine first becomes CPU bound. • The Event Manager (EM) is the task most in need of CPU. • We expected that scattering the EM away from the Data Manager (DM) would cause slow-down because of the high traffic between these tasks. WRONG!
Scattering the EM • Despite the overhead of sending traffic EM DM over the external network, scattering the EM caused throughput to be significantly increased. (+75%)
AES • The Alert-Event Screen (AES) is CPU-hungry. • Runs in a UI task which can be scattered. • Beware: Each additional AES not only increases the load on its own machine, but also increases the load on the EM to which it is connected.
Recommendation • Execute as few AESs as possible outside the main control room. • When you are not actually looking at the AES, leave it in “stopped” mode. (Screen is not updated.)
Scattering other managers • Can improve throughput, but not as spectacularly as when scattering the EM. • Moving the DM is useful, but more delicate (i.e. many Value Archive (VA) connections?)
Absolute Performance • The average number of “changes per second” that can be supported depend on the nature of the traffic. • A steady data flow is easier to cope with. • Irregular bursts of rapid traffic tend to overflow the queues between the managers. (Queue lengths are configurable.)
Load Management • PVSS implements several Load Management schemes, e.g. • Alert screen update pauses during a brief avalanche • Alert screen switches into Stopped mode if the sustained number of alerts arriving is crazy
Load Management - II • Load Shedding, where EM will cut the umbilical to rogue managers rather than be brought down itself. • I recommend that shift operators be taught to recognise the symptoms when they occur
Multiple CPUs • An alternative to scattering: Buy a dual processor! • 2 CPUs are generally enough to satisfy even the hungry Event Manager • Our dual-CPUs became disk bound when we pushed them. ---Tribute to the well balanced design of modern PCs!
RAM • Look how much memory you are using. • Buy enough of it. • If you are worried about performance, paging is wasted effort!!
Task summary • Give plenty CPU capacity to the EM by: • Buying a fast machine • Scattering the EM • Buying a dual CPU machine
Menu • Task Allocation • Backup of a running system • Alerts • Panel Repository
Backup • In the development systems nobody did backup. • PVSS backup is somewhat intricate. • Need for a set of recipes of backup instructions
18-page Report • What needs backing up • What this means in PVSS • How to back it up • How to restore (rather important!) • Handout
Four Parts • 1) Executive Summary • 2) Recipes • 3) Detailed Background Description • 4) Frequently Asked Questions about Backup. (I’m not going to go through them, just let you know that they exist.)
Menu • Task Allocation • Backup of a running system • Alerts • Panel Repository
Alerts • PVSS 3.5 (due in 200x) will contain new functionality for summary alerts and alert provocation during ramping. • I did not do in depth performance measurements on the existing system, beyond those I described to you in Phase 1 of S.U.P.
At the request of one experiment though, we did investigate “What is the load of an alert definition on a PVSS system?” • Results on the web (Test 38).
Loads of Alert Definitions • We showed that it is safe to declare any number of alerts and even to activate them provided that the data values stay in range. • It is provocation of the warnings and alerts that incurs a significant CPU load.
Memory load • Test 39 looked at memory usage of Alerts. • Requirement of 2.5KB per DPE alert.
Menu • Task Allocation • Backup of a running system • Alerts • Panel Repository
Panel Repository • Owing to staffing changes in the section, it was not possible to address this topic
On the subject of panels… • During the tests I would have found it helpful to have a ready display of the interconnection status of the distributed systems. • I recommend that there is something showing this on the top-level display panel. (Even just a grid of red/green pixels showing connection status.) Lost connections should raise an alert.
Other questions • During the tests, I was approached by different experiments with other issues! • We agreed to investigate the following…
PVSS Disturbance • With Alice we looked together at the effect of heavy external (unrelated) network traffic on PVSS. • Results written up as Tests 28 & 29. • Use 100Mbit with switches not hubs • Conclusion was that external traffic is not a problem
Traffic Pattern • For Atlas we compared the CPU load demanded by: • Changing 1 item N times vs • Changing N items once each • Same
Long Term Test (LTT) • With CMS’ machines (for the use of which we are very grateful!) we ran a long term test: • Generated random data • Recorded it and displayed it continuously on a trend • Distributed system • Results
LTT Results • The electricity supply at Cern is unreliable. You really do need a UPS. • The Cern campus AFS servers are relatively unreliable and should never be used in a production system! • The Cern network infrastructure is very reliable, but can break.
Network Problem • One network break revealed that the Cern default Linux O/S settings actually prevent PVSS’s automatic recovery feature from accomplishing its goal. • Cache-ing problem. Written up in 2 pages of background, symptoms, explanation, how to fix it if it does happen to you and how to avoid it happening in the first place.
“Side Effects” of SUP Project • Accumulated a large body of practical experience wrestling with PVSS. • Systematically recorded for your benefit. • Where?
FAQs • FAQ pages on http://cern.ch/itcobe • Not restricted to today’s frequent questions but ones that we foresee will become frequent in the near future, e.g. • My disk is nearly full! What can I do? • My archive file is corrupt. What can I do? • Please spread the word, tell your friends…
FAQ Categories • PVSS - Linux specific • PVSS - Messages • PVSS - Miscellaneous • PVSS - Printing • PVSS – Programming • PVSS - Production Systems • PVSS - Run-time problems • PVSS - Scattered Systems • General Support Issues • Framework • PVSS - Installation • PVSS - Project Creation • PVSS - Alerts (Alarms) • PVSS - Import/Export • PVSS - Archiving • PVSS - Access Control • PVSS - Backup-Restore • PVSS - Cross Platform • PVSS - Distributed Systems • PVSS - Drivers • PVSS - Excel Report • PVSS - Folklore • PVSS - Graphics
Folklore • What the FAQs don’t really address is the folklore that is built up in a close-knit team. • Often this information is unknown (or inaccessible) to outsiders.
Folklore • Enter the Wiki… • Web pages editable from inside a browser. • Controls Wiki. • Only CERN users can add (or change existing) content. • Readable worldwide. (Is already used as a reference by non-HEP organisations!) • Folklore often embodies recommended ways of doing things. Do read it, and keep reading it… • …and edit it. It’s belongs to you!
Example Recommendations in the Folklore • Assume one PVSS system per machine (Service restriction in Windows) • Place EM/DM on a different CPU to OPC client/servers (Protect EM against CPU overload from OPC; Freedom to move EM to Linux) • In a Summary (Group) alert, use a CHAR type (not a STRING type) DPE upon which to hang the summary alert. It's more efficient.
Support Issues • Final Remark: SUP has generated a fair number of support issues that have been followed up with ETM. “Bugs you didn’t know you nearly had”. Significant contribution to the robustness of the PVSS systems.
Summary • I do not claim to have answered all questions about building large systems. • New questions come up frequently anyway. • We have shown that PVSS will scale to build large systems • We have investigated the “hot topics of the moment” as defined by you.
To read a summary of the salient points of the most recent tests, including a discussion of the observed “Emergent Behaviour” in large systems, see my ICALEPCS paper, “Scaling Up PVSS”. • We are now bringing this project to a close. • Thank you! • Any (more) questions?
Reference Links • Scaling Up Home Page: http://cern.ch/itcobe/Projects/ScalingUpPVSS/welcome.html • IT-CO-BE FAQs: http://itcobe.web.cern.ch/itcobe/Services/Pvss/FAQ/ • (T)Wiki: https://uimon.cern.ch/twiki/bin/view/Controls/PVSSFolkLore#PVSS_Folklore • ICALEPCS paper “Scaling Up PVSS”: http://elise.epfl.ch/pdf/P1_056.pdf