1 / 31

First operational experience with the CMS Run Control System

First operational experience with the CMS Run Control System. Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group. 17 th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal. The Compact Muon Solenoid Experiment. Drift-Tube chambers. Iron Yoke. Resistive Plate Chambers. LHC

judd
Download Presentation

First operational experience with the CMS Run Control System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. First operational experience with the CMS Run Control System Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group 17th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal

  2. The Compact Muon Solenoid Experiment Drift-Tube chambers Iron Yoke Resistive Plate Chambers LHC • p-p collisions, ECM=14 TeV (2010: 7 TeV), heavy ion • Bunch crossing frequency 40 MHz CMS • Multi-purpose detector, broad physics programme • 55 million readout channels 4 T Superconducting Coil Cathode Strip Chambers HadronicCalorimeter Electromagnetic Calorimeter • Trackers • Silicon Strip • Silcon Pixel

  3. CMS Trigger and DAQ design • First Level Trigger (hardware) • up to 100 kHz • Central DAQ builds events at 100 kHz, 100 GB/s • 2 stages • 8 independent event builder / filter slices • High level trigger running on filter farm • ~700 PCs • ~6000 cores • In total around 10000 applications to control FrontendReadout Links Filter farm

  4. CMS Control Systems Run Control System Java, Web Technologies … DCS Trigger Tracker ECAL DAQ Trigger Supervisor XDAQC++ … Slice Slice data data Front-end Electronics Front-end Drivers, First Level Trigger Central DAQ & High Level Trigger Farm

  5. CMS Control Systems Detector Control System Run Control System Java, Web Technologies … … Tracker ECAL DCS Trigger Tracker ECAL DAQ Trigger Supervisor XDAQC++ … … Slice Slice PVSS (Siemens ETM)SMI (State Management Interface ) data data Low voltageHigh voltage Gas, Magnet Front-end Electronics Front-end Drivers, First Level Trigger Central DAQ & High Level Trigger Farm

  6. CMS Run Control System Run Control World – Java, Web Technologies Defines the control structure GUI in a web browser HTML, CSS, JavaScript, AJAX Run Control Web ApplicationApache Tomcat Servlet Container Java Server Pages, Tag Libraries, Web Services (WSDL, Axis, SOAP) … Function ManagerNode in the Run Control Tree defines a State Machine & parameters User function managers dynamically loaded into the web application XDAQ World – C++, XML, SOAP XDAQ applications control hardware and data flow XDAQ is the framework of CMS online softwareIt provides Hardware Access, Transport Protocols, Services etc. data ~10000 applications to control XDAQ Application

  7. Function Manager Framework AsynchronousNotifications Lifecycle Command Parameter State, Errors Parameters from/to Parent Function Manager / GUI Frame-work code XX YY Custom code Web Service Legend EventProcessor StateMachineEngine StateMachineDefinition ParameterSet Event Handler State Machine Callback Event Handler Ev C. Resource Proxy – PSX Child Resource Proxy – Run Control Child Resource Proxy - XDAQ Child Resource Proxy Child Resource Proxy Servlet Lifecycle + Configuration Web service Servlet / Web Service CommandParameter FunctionManager Monitor to / from Child Function Manager JobControl to / from DetectorControl System

  8. Function Manager Framework AsynchronousNotifications Lifecycle Command Parameter State, Errors Parameters from/to Parent Function Manager / GUI Frame-work code XX YY Custom code Web Service Legend Run Info DB Conditions EventProcessor StateMachineEngine StateMachineDefinition ParameterSet ConfigurationFM + XDAQ LogCollector Logs Event Handler State Machine Callback Resource Service DB Event Handler Ev Monitoring XDAQMonitoring & AlarmingSystem Errors C. Resource Proxy – PSX Child Resource Proxy – Run Control Child Resource Proxy - XDAQ Child Resource Proxy Child Resource Proxy DAQ Structure DB Servlet Lifecycle + Configuration Web service Servlet / Web Service CommandParameter FunctionManager Monitor to / from Child Function Manager JobControl to / from DetectorControl System

  9. Entire DAQ System Structure is Configurable High-level tools to generate configurations ResourceService API Database Flow of configuration data versioning • Control structure • Function Managers to load (URL) • Parameters • Child nodes • Configuration of XDAQ Executives (XML) • libraries to be loaded • applications (e.g. builder unit, filter unit) & parameters • network connections • collaborating applications SOAP XML XML Job ControlService

  10. CMS Control Tree GUI (Web browser) Level-0: Control and parameterization of Run Level-0 Level-1: Common state machine andParameters … Trigger Tracker DT RPC ECAL DAQ … Level-2: FEC FED TTS FB Slice 0 Slice 7 Sub-system specific Frontend controller Frontend driver Trigger ThrottlingSystem … Level-n: FB RB HLT HighLevelTrigger Readout Builder FED Builder Framework and Top-Level Run Controldeveloped by central team Sub-system Run Control developedby sub-system teams

  11. RCMS Level-1 State Machine (simplified) Creation Load & start Level-1 Function Managers Created Initialization Start further levels of function managersStart all XDAQ processes on the cluster Halted New: Pre-Configuration (trigger only – few seconds) Sets up the clock and periodic timing signals Halt Pre-Configured Configuration Load configuration from databaseConfigures hardware and applications Configured Stop run Start run Running Pause / Resume Pauses / resumes the trigger (and trackers which may need to change settings) Paused Error

  12. Top-Level Run Control (Level-0) • Central point of control • Global State Machine • Level-0 allows to parameterize configuration • Sub-system Run Key (e.g. level of zero suppression) • First Level Trigger Key / High Level Trigger Key • Clock source (LHC / local)

  13. Masking of components • Level-0 allows to mask out components • Remove/add sub-systems from control and readout • Remove add detector partitions • Remove/add individual Frontend-Drivers (masking) • Connection to readout (SLINK) • Connection to Trigger Throttling System • Mask out DAQ slices ( = 1/8 of central DAQ)

  14. Commissioning and First Operation with the LHC

  15. Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation

  16. Mini DAQ (“partitioning”) MiniDAQ Run(heavily used in commissioning phase) Global Run • Dedicated small DAQ setups for most sub-systems • Low bandwidth but sufficient for most tests • Mini DAQ may be used in parallel to the Global Runs Level-0 Level-0 … LTC ECAL MiniDAQ GlobalTrigger Tracker DT GlobalDAQ Local Trigger Controller(or Global Trigger) … Slice 0 Slice 7

  17. Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold start)

  18. Optimization of run startup time • Globally • Optimized the global state model (pre-configuration) • Provided tools for parallelization of user code (Parameter handling) • Sub-system specific performance improvements

  19. Optimization of run startup time • Globally • Optimized the global state model (pre-configuration) • Provided tools for parallelization of user code (Parameter handling) • Sub-system specific performance improvements • Central DAQ • Developed tool to analyze log files and plot timelines of all operations • Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) • Reduced message traffic between Run Control and XDAQ applications • combine commands and parameters into single message • New startup method for High Level Trigger processes on multi-core machines • Initialize and Configure mother process, then fork child processes • Reduced memory footprint due to copy-on-write

  20. Optimization of run startup time • Globally • Optimized the global state model (pre-configuration) • Provided tools for parallelization of user code (Parameter handling) • Sub-system specific performance improvements • Central DAQ • Developed tool to analyze log files and plot timelines of all operations • Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) • Reduced message traffic between Run Control and XDAQ applications • combine commands and parameters into single message • New startup method for High Level Trigger processes on multi-core machines • Initialize and Configure mother process, then fork child processes • Reduced memory footprint due to copy-on-write

  21. Run Start timing (May 2010) • Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) • Configuration time now dominated by frontend configuration (Tracker) • Pause/Resume 7x faster than Stop/Start sub-system time (seconds)

  22. Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute • Initially some stability issues • Problems solved by debugging user code (thread leaks)

  23. Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute • Initially some stability issues • Problems solved by debugging user code (thread leaks) • Recovery from sub-system faults • Control of individual sub-systems from top-level control node • Fast masking / unmasking of components (partial re-configuration, only)

  24. Commissioning and First Operation • Independent parallel commissioning of sub-detectors • Mini DAQ setups allow for standalone operation • Run start time • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute • Initially some stability issues • Problems solved by debugging user code (thread leaks) • Recovery from sub-system faults • Control of individual sub-systems from top-level control node • Fast masking / unmasking of components (partial re-configuration, only) • Operator efficiency • Operation is complex • Subsystem inter-dependencies when configuring partially • Dependencies on internal & external parameters • Procedures to follow (Clock change) • Operators are no longer DAQ experts but colleagues from the entire collaboration • Built-in cross checks to guide the operator

  25. Built-in cross-checks • Built-in cross-checks guide the shifter • Indicate sub-systems to re-configure if • A parameter is changed in the GUI • A sub-system / FED is added/removed • External parameters change • Enforce correct order of re-configuration • Enforce re-configuration of CMS if clock source changed or LHC has been unstable Improved operator efficiency

  26. Operation with the LHC • Cosmic run • Bring the detector into the desired state (Detector Control system) • Start Data Acquisition (Run Control System) • LHC • Detector state and DAQ state depend on the LHC • Want to keep DAQ going before beams are stable to ensure that we are ready Tracking detector high voltage only ramped up whenbeams are stable (detector safety) Ramp:clock variations may unlock some links in the trigger … LHC clock stable LHC dipole current

  27. Integration with DCS & automatic actions 0 Detector Control System Run Control System • In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states • Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) • Trigger masks channels while LHC is ramping • Silicon-Strip Tracker masks payload when running with HV off (noise) • Silicon-Pixel Tracker reduce gains when running with HV off (high currents) • Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run DCS PVSS SOAP eXchange Level-0 LHC … … Tracker ECAL PSX DCS Tracker DAQ XDAQservice

  28. Automatic actions … LHC clock stable LHC dipole current Automatic actions Ramp up tracker HV Ramp down tracker HV … CMS run: start stop ramp startMask sensitivetrigger channels ramp doneUnmask sensitivetrigger channels Tracker HV on Enable payloadlower thresholds log HV state in data Tracker HV off Disable payloadraise thresholds log HV state in data

  29. Observations • Standardizing the experiment’s software is important for long-term maintenance • Almost successful considering the size of the collaboration • Run Control Framework was available early in the development of the experiment’s software (2003) • Adopted by all sub-systems • But some sub-systems built their own framework, underneath • Ease-of-use becomes more and more important • Run Control / DAQ is now operated by members of the entire CMS collaboration • Running with high life-time:> 95 % so far for stable-beam periods in 2010

  30. Observations – Web Technology • Operations • Typical advantages of a web application: multiple clients, remote login • Stability of the server (Apache Tomcat + Run Control Web Application) very good: running for weeks • Stability of the GUI depends on third-part products (browser) • Behavior changes from one release to the next • Not a big problem - GUI can be restarted without affecting the run • Development • Knowledge of Java and the Run Control Framework sufficient for basic function managers • Web-based GUI & web technologies handled by framework • Development of complex GUIs such as the top-level control node more difficult • Many technologies need to be mastered • Modern web toolkits not yet used by Run Control

  31. Summary & Outlook • CMS Run Control System is based on Java & Web Technologies • Good stability • Top-Level Control node optimized for efficiency • Flexible operation of individual sub-systems • Built-in cross-checks to guide the operator • Automatic actionstriggered by detector and LHC state • High CMS data-taking efficiency • life-time > 95% • Next developments • Further improve fault tolerance • Automatic recovery procedures • Auto Pilot candidate event

More Related