210 likes | 324 Views
Overall Performance of the DCS. PVSSII operation Limiting factors in distributed environment Cold configuration of the DCS. Data Generator Used in the Tests. The data generator executed a set of dpSet commands to a single datapoint
E N D
PVSSII operation • Limiting factors in distributed environment • Cold configuration of the DCS
Data Generator Used in the Tests • The data generator executed a set of dpSet commands to a single datapoint • According to SUP tests there is no difference between setting a single datapoint many times and setting many datapoints once • A dpSetWait command was executed at the end of each burst to mark the end of the processing • Burst size (number of dpSets) was altered • A delay was inserted between two consecutive bursts
1 3 2 PVSSII operation During a Burst (no Archival) • CPU is loaded during the burst generation (execution of the dpSet command). • Network activity indicates the communication between the managers • PVSSII confirms the end of the burst (dpSetWait), but the network activity continues, until the queues are emptied The CPU (PIV 2 GHz) needs 7.7 sec to execute 100k DP changes 1 The PVSSII needs 49 sec to exchange dpSet related messages between involved managers 2 The PVSSII needs 52 sec until all queues are empty 3
1 3 2 PVSSII operation During a Burst (with Archival) The CPU (PIV 2 GHz) needs 9.1 sec to execute 100k DP changes 1 The PVSSII needs 51 sec to exchange dpSet related messages between involved managers 2 The PVSSII needs 55 sec until all queues are empty 3 • If PVSSII archival is active, the CPU is involved as well. The data generation is by ~10% slower • The time needed for full queue processing increases by ~20%
Limits for Burst Generation (no Archival) • Tests with gradually increased burst size were performed • The PVSSII behaves stable – all counters increase linearly • At burst size of ~176 000 DP changes the PVSSII takes evasive action as the input queues become full
Limits for Burst Generation (no Archival) • Results with archival confirm the expected trend, the overall time needed for queue processing is longer by ~20% • The evasive action appeared at bursts inducing ~175000 DP changes
Intermediate results: • Bursts processing depends on CPU, but the main contributing factor is the messaging between the PVSSII managers • PVSSII load scales linearly with the increasing burst size • The test systems can cope with sustained rate of ~3500 DP changes/s with active archival • Evasive actions can be expected if the burst size exceeds ~175000 DPchanges • The system behavior can be further tuned by manipulating the manager buffers
Performance in the distributed environment • Typical Questions: • can be a really big distributed system created and operated? • Is it possible to retrieve data from other (remote) systems in a distributed system? • What is the load which can be digested by the distributed system?
Distributed System Size • 130 systems were created • 40 000 DPEs defined in each system • Equivalent of ~5 fully equipped CAEN crates • 5 200 000 DPEs defined in total • The systems interconnected usefully
Data Retrieval in Distributed System - Trends • PVSSII behavior was studied in a setup with 16 computers organized in 3-level tree architecture • Each system had 40 000 DPE defined • Each leaf node was generating data with 1000ms delay/754 changes • The top-level node was able to display trends from remote system • Tests were interrupted for practical reasons (no space left) when 48 trend windows showing 16 remote channels each were opened • In case of overload the PVSSII takes evasive actions to protect the system
Local UI needs ~27MB of memory/UI PVSSII can take evasive actions if it runs out of resources Tests were performed with 27 local UIs Remote UI does not put additional load on the worker node Remote UI takes ~35MB/UI on the TS Tests were performed with 60 remote UIs RDP UI UI RDP EM UI RDP UI RDP Remote Clients Terminal Server Project Worker Node Local and Remote UI EM UI UI UI UI
Remark Concerning the Functionality • No indirect connections are possible in the PVSS • Only systems with connection via DIST manager can communicate • The DIST manager does not allow for request routing • The DM of the remote system is always involved if the data is requested • At present this is also true for historical data retrieval (see the talk about AMANDA)
Intermediate Conclusions • A big distributed PVSSII system with 130 connected systems was successfully created • Data retrieval from remote nodes works correctly • Many local or remote UIs can connect to the system without disturbing its operation
Overall System Performance • A big distributed system can be seen as a collection of individual PVSS system • Data overload on one system in principle does not affect the other PVSS systems • Alert filtering and group alerts can significantly reduce the communication between individual PVSSII systems • Possible bottlenecks could be the network and database • It is important to understand the real detector needs (data flow and access patterns) • Present design is fully based on your information and our “educated guess”
System Cold Start • One of the most interesting numbers is the time needed to perform a cold start of the ALICE detector • The slowest detector will affect the result • Most of the operations will be performed in parallel by detector PVSS systems • In the following slides I will try to summarize the factors contributing to the startup time • Only the DCS “overhead” will be discussed (DCS can not influence for example the speed of FERO loading)
Factors contributing to the startup time • Time 1: propagation of the startup command from ECS to the DCS • Time 2: download of the configuration data from database to PVSS (FED Servers) • Time 3: download of the data to the devices • Time 4: ramping-up • Time 5: propagation of the resulting state to ECS
FSM Overhead • Performed tests (Giacinto, Sasha et.al.) showed that the FSM overhead does not significantly affect the overall configuration time • Changing the states in a system with 1CU and 180 DUs takes ~1.5 sec • Time estimate for Time 1 and Time 5 can be therefore in the order of seconds
Time 2 – loading of configuration • In previous presentations we discussed the DB access performance • One CAEN crate could be loaded in ~10 sec • If several CAEN systems are connected to the same PVSSII, they will be loaded sequentially • The FERO configuration can be loaded in most cases within seconds • Big BLOBs could be locally cached • BLOBs should be loaded at the speed of ~10MB/sec
Loading of configuration parameters to the devices • CAEN loading was studied • It takes 5 sec to load a SY1527 crate with 180 channels • This time includes ~1.5 sec FSM overhead • The rest of the time is spent in OPC client, OPC server and CAEN wrapper • Ramping time was not accounted for • Other FW devices interfaced via the OPC should perform at comparable speed • FERO loading is typically dependent of the detector architecture and this time cannot be attributed to the DCS overhead • We need your inputs
Conclusions • The PVSSII operation in a big distributed environment was studied and understood • Big effort was put towards the understanding of individual contributions of DCS components to the overall performance • Detector inputs are now essential, we need to replace the educated guess with realistic numbers • Information provided during this workshop will be compiled and will be used for the DCS performance report