240 likes | 366 Views
DCS Test campaign. Peter Chochula. Two main activity areas of DCS tests: Performance and stability tests Resource consumption (implications on system scale and hardware requirements)
E N D
DCS Test campaign Peter Chochula
Two main activity areas of DCS tests: • Performance and stability tests • Resource consumption (implications on system scale and hardware requirements) • The tests cover also the DCS core computing (database, domain controllers, RAS…), system management and security • This talk focuses only on tests which are related to the final system scale
Performance tests with impact on DCS scale planning • PVSS II is a system which can be distributed and/or scattered to many CPUs • Two extreme approaches are possible: • group all processes on one machine • Even if this configuration runs stable for some systems, there could be a problem if peak load occurs • Dedicate one machine per task (LV, HV…) • Surely the computer resources would be wasted • Definition of optimal balance between the performance and the size of the system requires tests with realistic hardware
The PVSS II Architecture OPC c OPC s HW UI OPC c OPC s HW EM CM OPC c OPC s HW VA • PVSS is based on “managers” which are communicating via TCP/IP • These managers can be scattered across network • Managers communicate with drivers providing access to the hardware • (e.g. via OPC, DIM…)
Who is doing what? • ACC follows tests performed by other groups and provides feedback to sub-detectors • ACC performs test which complement the work of JCOP and other groups. This includes: • Tests not performed by external groups • Tests for which the external planning is incompatible with our schedule (e.g. OS management) • Alice-specific tests (FERO)
DCS Performance Tests • Hardware is selected by sub-detectors, we can provide feedback from JCOP for supported devices (and this is happening – ex. CAEN, Wiener…) • HW access is supported via OPC and DIM • Performance tests depend on hardware availability • Some results for CAEN will be showed later in this talk ConfDB FSM PVSS HW Access ArchDB Hardware
DCS Performance Tests • PVSS II performance has been studied by JCOP “Scaling Up Project” (SUP) • See some results later in this talk • FSM mechanism is based on JCOP developments • ALICE-specific implementation is being tested by HMPID and TPC • Implementation guidelines available to collaboration for both HV and LV (Giacinto) • Tests with OPC simulators scheduled (comparison between real hardware access and simulator is now available ConfDB FSM PVSS HW Access ArchDB Hardware
DCS Performance Tests • Configuration DB performance will be reviewed by FWWG in January 2005 (13th) • This will cover framework devices • FERO configuration • SPD prototype indicates that full SPD can be configured within 10-15s (including HW) • TPC/TRD prototyping started – ACC is testing performance of BLOBs • Archival • New version replacing private PVSS archiveswith ORACLE to be released soon • Svetozar Kapusta starts to work with us as doctoral student as of mid January ConfDB FSM PVSS HW Access ArchDB Hardware
Tests performed by SUP Communication between 130 PVSS systems Influence of heavy load on PVSS has been studied Alarms absorption, display, cancellation and acknowledgment Data archival Trending performance Influence of heavy network traffic PVSS Performance Tests SUP test hierarchy Data generated by leaf nodes was transported to the top level machines
(Some) SUP results • 130 systems with ~5 million DPEs defined interconnected successfully • Connection of 100 UIs to a project generating 1000 changes/s has been demonstrated • These tests were later repeated by our team in order to understand the remote access mechanism • Performance tests on realistic systems • 40000 DPE/machine equivalent to 5 CAEN crates • 1000 alerts generated on a leaf machine in a burst lasting 0.18 sec and repeated after 1s delay
Alert absorption by PVSS system • The PVSS system was able to absorb all alarms generated on a leaf note • Display of 5000 came alerts: 26s • Cancellation of 5000 alerts: 45 s • Acknowledgment of 10000 alerts: 2 min 20 s • ETM is implementing a new alarm handling mechanism which includes alarm filtering, summary alarms and will provide higher performance
The archiving was fully efficient during these tests (no data loss) • Trending performance depends on queue settings. Evasive PVSS action (protecting the PVSS system from overloading) can disturb the trend, but data can be recovered from archive once the avalanche is gone. • Alert avalanche is memory hungry (>200B/simple DP) • The ACC participated in additional tests (December 2004), where the network has been flooded by a ping command • No performance drop in the abovementioned tests has been observed • Report to be published
Summary of SUP tests • No showstoppers have been discovered • We will prepare (together with the SUP representatives) a summary showing all test results along with estimated resource utilization • Additional tests with more components (OPC server, real hardware) will be performed by our group in collaboration with SUP • See planning later in this talk
Test setup installed in DCS lab Prototypes of DCS core computers Worker nodes (rental) DCS Test Setup CERN Terminal Server Router Domain Controller Database Server Database Server Worker Node Worker Node Worker Node Worker Node Worker Node Worker Node Worker Node Worker Node Worker Node Worker Node
DCS core computers • In order to operate the DCS following core computers will be neededC: • Windows DC • Application Gateway for remote access • Database server(s) with mass storage DCS infrastructure node (prototype available) • Central operator’s computer Prototypes for all components available. Database servers need further testing
Complementary tests performed by the ACC – Remote access • Tests performed by SUP indicated that a large number of UI can be connected to a running project • no interference observed up to 100 UIs – tests did not go further • Our tests tried to simulate an external access using W2k3 Terminal Services to a heavy loaded system and observe the effects on • Terminal server • Running project Remark: The Terminal Server is technology recommended by CERN’s security. Tests performed by ALICE were presented to JCOP
Computer Infrastructure for Remote Access Tests Windows XP Pro Windows XP Pro Windows Server 2003 Windows Server 2003 CERN Net. Terminal server Router Remote User Remote User Windows XP Pro Windows XP Pro DCS Private Net. 192.168.39.0 PVSS Master Project Remote User
Computer loads for large number of remote clients Master project generated 50000 datapoints and updated 3000 /s. Remote client displayed 50 values at a time
Conclusions on remote access tests • Prototype performed well • Memory consumption ~35 MB per “heavy” session • CPU usage reasonable, one Xeon CPU running at 3GHz can handle the load • Stability tested over weeks
Additional possible bottlenecks to be tested • SUP tests were focused only on performance of the PVSS system • What is different in real DCS configuration? UI OPC c OPC s HW OPC c OPC s HW EM OPC c OPC s HW CM VA Peak load ? Data queuing ?
OPC tests • Results from CAEN OPC tests performed by JCOP made available last week • Test setup covered the full controls hierarchy • Time to switch 200 channels : 8s • Recommendations: • Maximum number of 4 fully equipped crates (~800 channels) per computer • OPC polling time settings– minimum 500ms • Tests provided very useful comparison between real hardware and software simulators
What happens next • We are preparing inventory of the hardware (number of channels, crates, etc.) • data available on DCS page, detectors are regularly requested to update the information • More (not only) performance tests scheduled for early 2005 • Full list of currently performed tests can be found on the DCS web
Test schedule 12/04 2/05 4/05 6/05 1/05 3/05 5/05 ConfDB tests……………….. ArchDB tests……………………….. Full system configuration………………… OPC stability…………………………..….. Alarms……………………………………….. ArchDB – connection to IT………………………………………….. FERO – SPD prototype…………………………. Mixed system environment……………………...... Patch deployment………………………………….. Network security tests……………………………… We should be able to provide more detailed information on resource utilization in the middle of February 2005
Input from sub-detectors is essential • Most unexpected problems are typically discovered only during the operation • This experience cannot be obtained in lab • Pre-installation is a very important period for DCS • Efficient tests can be performed only with realistic hardware • Components are missing • We do not have enough manpower to perform tests for all possible combinations of hardware at CERN