• 310 likes • 441 Views
First operational experience with the CMS Run Control System. Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group. 17 th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal. The Compact Muon Solenoid Experiment. Drift-Tube chambers. Iron Yoke. Resistive Plate Chambers. LHC
E N D
First operational experience with the CMS Run Control System Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group 17th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal
The Compact Muon Solenoid Experiment Drift-Tube chambers Iron Yoke Resistive Plate Chambers LHC • p-p collisions, ECM=14 TeV (2010: 7 TeV), heavy ion • Bunch crossing frequency 40 MHz CMS • Multi-purpose detector, broad physics programme • 55 million readout channels 4 T Superconducting Coil Cathode Strip Chambers HadronicCalorimeter Electromagnetic Calorimeter • Trackers • Silicon Strip • Silcon Pixel
CMS Trigger and DAQ design • First Level Trigger (hardware) • up to 100 kHz • Central DAQ builds events at 100 kHz, 100 GB/s • 2 stages • 8 independent event builder / filter slices • High level trigger running on filter farm • ~700 PCs • ~6000 cores • In total around 10000 applications to control FrontendReadout Links Filter farm
CMS Control Systems Run Control System Java, Web Technologies … DCS Trigger Tracker ECAL DAQ Trigger Supervisor XDAQC++ … Slice Slice data data Front-end Electronics Front-end Drivers, First Level Trigger Central DAQ & High Level Trigger Farm
CMS Control Systems Detector Control System Run Control System Java, Web Technologies … … Tracker ECAL DCS Trigger Tracker ECAL DAQ Trigger Supervisor XDAQC++ … … Slice Slice PVSS (Siemens ETM)SMI (State Management Interface ) data data Low voltageHigh voltage Gas, Magnet Front-end Electronics Front-end Drivers, First Level Trigger Central DAQ & High Level Trigger Farm
CMS Run Control System Run Control World – Java, Web Technologies Defines the control structure GUI in a web browser HTML, CSS, JavaScript, AJAX Run Control Web ApplicationApache Tomcat Servlet Container Java Server Pages, Tag Libraries, Web Services (WSDL, Axis, SOAP) … Function ManagerNode in the Run Control Tree defines a State Machine & parameters User function managers dynamically loaded into the web application XDAQ World – C++, XML, SOAP XDAQ applications control hardware and data flow XDAQ is the framework of CMS online softwareIt provides Hardware Access, Transport Protocols, Services etc. data ~10000 applications to control XDAQ Application
Function Manager Framework AsynchronousNotifications Lifecycle Command Parameter State, Errors Parameters from/to Parent Function Manager / GUI Frame-work code XX YY Custom code Web Service Legend EventProcessor StateMachineEngine StateMachineDefinition ParameterSet Event Handler State Machine Callback Event Handler Ev C. Resource Proxy – PSX Child Resource Proxy – Run Control Child Resource Proxy - XDAQ Child Resource Proxy Child Resource Proxy Servlet Lifecycle + Configuration Web service Servlet / Web Service CommandParameter FunctionManager Monitor to / from Child Function Manager JobControl to / from DetectorControl System
Function Manager Framework AsynchronousNotifications Lifecycle Command Parameter State, Errors Parameters from/to Parent Function Manager / GUI Frame-work code XX YY Custom code Web Service Legend Run Info DB Conditions EventProcessor StateMachineEngine StateMachineDefinition ParameterSet ConfigurationFM + XDAQ LogCollector Logs Event Handler State Machine Callback Resource Service DB Event Handler Ev Monitoring XDAQMonitoring & AlarmingSystem Errors C. Resource Proxy – PSX Child Resource Proxy – Run Control Child Resource Proxy - XDAQ Child Resource Proxy Child Resource Proxy DAQ Structure DB Servlet Lifecycle + Configuration Web service Servlet / Web Service CommandParameter FunctionManager Monitor to / from Child Function Manager JobControl to / from DetectorControl System
Entire DAQ System Structure is Configurable High-level tools to generate configurations ResourceService API Database Flow of configuration data versioning • Control structure • Function Managers to load (URL) • Parameters • Child nodes • Configuration of XDAQ Executives (XML) • libraries to be loaded • applications (e.g. builder unit, filter unit) & parameters • network connections • collaborating applications SOAP XML XML Job ControlService
CMS Control Tree GUI (Web browser) Level-0: Control and parameterization of Run Level-0 Level-1: Common state machine andParameters … Trigger Tracker DT RPC ECAL DAQ … Level-2: FEC FED TTS FB Slice 0 Slice 7 Sub-system specific Frontend controller Frontend driver Trigger ThrottlingSystem … Level-n: FB RB HLT HighLevelTrigger Readout Builder FED Builder Framework and Top-Level Run Controldeveloped by central team Sub-system Run Control developedby sub-system teams
RCMS Level-1 State Machine (simplified) Creation Load & start Level-1 Function Managers Created Initialization Start further levels of function managersStart all XDAQ processes on the cluster Halted New: Pre-Configuration (trigger only – few seconds) Sets up the clock and periodic timing signals Halt Pre-Configured Configuration Load configuration from databaseConfigures hardware and applications Configured Stop run Start run Running Pause / Resume Pauses / resumes the trigger (and trackers which may need to change settings) Paused Error
Top-Level Run Control (Level-0) • Central point of control • Global State Machine • Level-0 allows to parameterize configuration • Sub-system Run Key (e.g. level of zero suppression) • First Level Trigger Key / High Level Trigger Key • Clock source (LHC / local)
Masking of components • Level-0 allows to mask out components • Remove/add sub-systems from control and readout • Remove add detector partitions • Remove/add individual Frontend-Drivers (masking) • Connection to readout (SLINK) • Connection to Trigger Throttling System • Mask out DAQ slices ( = 1/8 of central DAQ)
Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation
Mini DAQ (“partitioning”) MiniDAQ Run(heavily used in commissioning phase) Global Run • Dedicated small DAQ setups for most sub-systems • Low bandwidth but sufficient for most tests • Mini DAQ may be used in parallel to the Global Runs Level-0 Level-0 … LTC ECAL MiniDAQ GlobalTrigger Tracker DT GlobalDAQ Local Trigger Controller(or Global Trigger) … Slice 0 Slice 7
Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold start)
Optimization of run startup time • Globally • Optimized the global state model (pre-configuration) • Provided tools for parallelization of user code (Parameter handling) • Sub-system specific performance improvements
Optimization of run startup time • Globally • Optimized the global state model (pre-configuration) • Provided tools for parallelization of user code (Parameter handling) • Sub-system specific performance improvements • Central DAQ • Developed tool to analyze log files and plot timelines of all operations • Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) • Reduced message traffic between Run Control and XDAQ applications • combine commands and parameters into single message • New startup method for High Level Trigger processes on multi-core machines • Initialize and Configure mother process, then fork child processes • Reduced memory footprint due to copy-on-write
Optimization of run startup time • Globally • Optimized the global state model (pre-configuration) • Provided tools for parallelization of user code (Parameter handling) • Sub-system specific performance improvements • Central DAQ • Developed tool to analyze log files and plot timelines of all operations • Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) • Reduced message traffic between Run Control and XDAQ applications • combine commands and parameters into single message • New startup method for High Level Trigger processes on multi-core machines • Initialize and Configure mother process, then fork child processes • Reduced memory footprint due to copy-on-write
Run Start timing (May 2010) • Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) • Configuration time now dominated by frontend configuration (Tracker) • Pause/Resume 7x faster than Stop/Start sub-system time (seconds)
Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute • Initially some stability issues • Problems solved by debugging user code (thread leaks)
Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute • Initially some stability issues • Problems solved by debugging user code (thread leaks) • Recovery from sub-system faults • Control of individual sub-systems from top-level control node • Fast masking / unmasking of components (partial re-configuration, only)
Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute • Initially some stability issues • Problems solved by debugging user code (thread leaks) • Recovery from sub-system faults • Control of individual sub-systems from top-level control node • Fast masking / unmasking of components (partial re-configuration, only) • Operator efficiency • Operation is complex • Subsystem inter-dependencies when configuring partially • Dependencies on internal & external parameters • Procedures to follow (Clock change) • Operators are no longer DAQ experts but colleagues from the entire collaboration • Built-in cross checks to guide the operator
Built-in cross-checks • Built-in cross-checks guide the shifter • Indicate sub-systems to re-configure if • A parameter is changed in the GUI • A sub-system / FED is added/removed • External parameters change • Enforce correct order of re-configuration • Enforce re-configuration of CMS if clock source changed or LHC has been unstable Improved operator efficiency
Operation with the LHC • Cosmic run • Bring the detector into the desired state (Detector Control system) • Start Data Acquisition (Run Control System) • LHC • Detector state and DAQ state depend on the LHC • Want to keep DAQ going before beams are stable to ensure that we are ready Tracking detector high voltage only ramped up whenbeams are stable (detector safety) Ramp:clock variations may unlock some links in the trigger … LHC clock stable LHC dipole current
Integration with DCS & automatic actions 0 Detector Control System Run Control System • In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states • Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) • Trigger masks channels while LHC is ramping • Silicon-Strip Tracker masks payload when running with HV off (noise) • Silicon-Pixel Tracker reduce gains when running with HV off (high currents) • Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run DCS PVSS SOAP eXchange Level-0 LHC … … Tracker ECAL PSX DCS Tracker DAQ XDAQservice
Automatic actions … LHC clock stable LHC dipole current Automatic actions Ramp up tracker HV Ramp down tracker HV … CMS run: start stop ramp startMask sensitivetrigger channels ramp doneUnmask sensitivetrigger channels Tracker HV on Enable payloadlower thresholds log HV state in data Tracker HV off Disable payloadraise thresholds log HV state in data
Observations • Standardizing the experiment’s software is important for long-term maintenance • Almost successful considering the size of the collaboration • Run Control Framework was available early in the development of the experiment’s software (2003) • Adopted by all sub-systems • But some sub-systems built their own framework, underneath • Ease-of-use becomes more and more important • Run Control / DAQ is now operated by members of the entire CMS collaboration • Running with high life-time:> 95 % so far for stable-beam periods in 2010
Observations – Web Technology • Operations • Typical advantages of a web application: multiple clients, remote login • Stability of the server (Apache Tomcat + Run Control Web Application) very good: running for weeks • Stability of the GUI depends on third-part products (browser) • Behavior changes from one release to the next • Not a big problem - GUI can be restarted without affecting the run • Development • Knowledge of Java and the Run Control Framework sufficient for basic function managers • Web-based GUI & web technologies handled by framework • Development of complex GUIs such as the top-level control node more difficult • Many technologies need to be mastered • Modern web toolkits not yet used by Run Control
Summary & Outlook • CMS Run Control System is based on Java & Web Technologies • Good stability • Top-Level Control node optimized for efficiency • Flexible operation of individual sub-systems • Built-in cross-checks to guide the operator • Automatic actionstriggered by detector and LHC state • High CMS data-taking efficiency • life-time > 95% • Next developments • Further improve fault tolerance • Automatic recovery procedures • Auto Pilot candidate event