1 / 32

DAQ Status and Prospects

DAQ Status and Prospects. Frans Meijers Wed 4 November 2009 Readiness for Data-Taking review. Intro Global Running DAQ Readiness for Data Taking. CMS DAQ. DAQ Phase I (“50 kHz DAQ”) - Today. 100 kHz. Full R/O off all sub-detectors (no TOTEM)

luisak
Download Presentation

DAQ Status and Prospects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DAQ Status and Prospects Frans Meijers Wed 4 November 2009 Readiness for Data-Taking review Intro Global Running DAQ Readiness for Data Taking

  2. CMS DAQ

  3. DAQ Phase I (“50 kHz DAQ”) - Today 100 kHz • Full R/O off all sub-detectors • (no TOTEM) • 8 DAQ Slices, 100 GByte/s event builder • Event Filter: 720 8-core PCs (2.6 GHz, 16 GByte): • ~5000 instances of CMSSW-HLT • @50 kHz: ~100ms/evt 2.6 GHz CPU, 2GByte /process • @100 kHz: ~50ms/evt • Storage Manager: 2 GByte/s, 300 TB buffer. 12.5 kHz +12.5 kHz • Phase II ~2011: “100 kHz DAQ”: add another ~1000 PCs • will be decided after experience with initial real LHC data

  4. GLOBAL RUNNING t

  5. DAQ in recent global runs • ~1 kHz cosmics +100 Hz calibration + 90 kHz randoms

  6. R/O of sub-detsFEDs • All sub-dets routinely in • for most: ~empty events @ 90 kHz • ES (ECAL preshower) • problem with ‘held-back’ events preventing DAQ to stop • fixed yesterday? • little global running with CASTOR t

  7. Event Sizes • Size through EVB: total ~500 kByte (compare 1 MB nominal) • ~same for randoms and cosmics • ECAL and HCAL produce ‘nominal’ 2 kB/FED • For most subdetectors high rate test not at nominal working point • First LHC expect dominated by noise and backgrounds • A single FED with high size can limit full system !! t

  8. Run Control Operation (Sep09) • Full cycle ~5 minutes (Init CDAQ 50s (NOW 25 s), Configure TK 200s) • Short cycle (start – stop) ~ 2 min (but happens rarely due to crashes) • Short cuts possible with commander, but requires attention by DAQ operator • Still at high event rate some back-pressure from HLT on first event t

  9. Run Control Operation (Oct09) • At present (26 Oct) little improvement wrt Sep09 • Iinit from 57 to 26 s. t

  10. Global Operation • Observations: • Now: empty events ~500 kByte • start/stop often tedious • Various problems (subdet config, FEDs not sending, cDAQ not stopping, ..) • Often requires full destroy – init • Once running • In general OK • occasional subdet TTC out-of-sync • Usually cured by TTC reset (by operator) • Occasional FED with problem, forcing restart t

  11. CENTRAL DAQ t

  12. Network • GPN and private network (dot-cms) • Status: OK • Support by IT • Note: • during Xmas break: no configuration changes possible • in Green Barrack: • connection to CMS private network possible

  13. Cluster + Network • Operates independent from CERN campus network • Network structure: • Private network + headnodes with access also to campus • PCs ~2000 servers, ~100 Desktop • Linux SLC4 • Cluster services: • ntp, dns, DHCP, Kerberos, LDAP • NAS filer for home directories, projects • Cluster monitoring with Nagios • Software packaged in rpm • Quattor for system installation and configuration • can install PC from scratch ~2 min, whole cluster 1 hour • Windows cluster of ~100 nodes for DCS

  14. Cluster – Progress and Prospects • Purchased few ‘spare’ PE1950, PE2950 • Can no longer buy from Dell models with PCI-X (vmepc, Myrinet) On-going: • Re-arrangement of server nodes, headnodes • Extending SCX5 control room (~continuously) • Consolidation of sub-det computing: storage, PC models\ • Use of central filer, instead of local disks • eg CSC local DAQ farm • Security • SLC5 t

  15. Quattor and subdet-SW • Quattor: • Requires SW packed in RPM • Installs PCs in known state according to dbase • Can roll-back to any point in time • Demonstrated several times • However, Is expert operation • Problem today: • Many PCs are in ‘permissive’ mode • Need tool for subdetector to submit updated rpm’s and initiate installation w/o sysadmin • In progress, production quality before Jan 2010 t

  16. Quattor Profiles • Quattor profiles • 61 different profiles • Obsolete (xdaq 6) • Stable, Prod, Pre-Prod (flavors of XDAQ 7) • Central DAQ • 2122 nodes • 110 nodes in Permissive mode • Can not guarantee proper installation ! Observation / recommendation: • Too much proliferation of profiles • All move to xdaq10 in Jan2010 t

  17. Cluster Nodes • Failures with PC nodes • Essential nodes (connected to H/W) : • Subdet vmepc, controllers of FRL, FMM • Model SC1425, PE2850 • very few failures • When broken need replacement with spare • LENGHTY procedure! • Central DAQ EVB+HLT nodes • PE2950, PE1950 • Failures: ~1/week • Slice Masking, Black listed • Cluster and general services t

  18. Online database • Production Database: • Oracle 6-node RAC in SCX5 • admin: DBA by IT • Development & test database • 2 node RAC in SCX5 • also on GPN • Portal Web service • Services: • Elog • Shifttool • ..

  19. Run Control • Progress: • Ability to mask DAQ slices • DAQ shifter can disable DAQ slice for problems RU onwards; requires destroy and re-init • Does NOT cover faulty FRL, etc • Improvements handling FED masking • Prospects • Some more fault-tolerance t

  20. Event Filter Progress: • adoption of CMSSW_3 • streamlining of patching procedure and deployment • CMSSW HLT patches via tag-collector • Rework of EventProcessor • More efficient in memory, control, .. • Can restart crashed HLT instance on the fly In Progress: • HLT counters to RunInfo dBase • To be finalized, signal on end-of-LumiSection from EVM t

  21. Storage Manager Performance ½ SM system for all 8 DAQ slices Total capacity: 300 TB Allows peak recording ~2 Gbytes/s The event size at the SM is typically factor 2 less compared with EVB size (due to compression) Expected need when in ‘steady state’ LHC running ~400 MB/s

  22. Issues – Short term (2009) • Synchronisation • L1 configuration and HLT • Updates to online dataBase • Monitoring of L1 and HLT scalers • L1 scalers (with deadtime) • Collected by EVB event managers • Monitoring at LumiSection boundaries t

  23. Issues – Medium term (2010) • Overall Control: • Automation and communication between DCS Run control and LHC status • In place: • Basic communication mechanism • DCS is partitioned according to RC TTC partitions • Missing: • Higher level control • Different implementation options • “devil is in the details” t

  24. SLC5/gcc4.3 • Motivation: • Follow ‘current’ versions • functionality, HW support, security fixes • TSG prefers HLT on same version of SLC5/gcc as offline • SLC5 has gcc4.1, LCG uses gcc4.3.x, CMSSW uses gcc4.3.4+mod • Migration to SLC5/gcc4.3.x • SLC5 (64-bit OS, 32 bit applications) • Private gcc4.3.x (distributed with CMSSW) • Only migrate Event Filter (and SM) nodes • Involves: • System • XDAQ port (coretools and powerpack) and EVB, EvF • CMSSW t

  25. SLC5/gcc4.3 status • XDAQ release pre-10 • Subset (components required by EvF, SM) • Built with cmssw gcc434 • NOT tested • CMSSW online release • Build environment operational • Build in progress • My (FM) personal feeling • Window of opportunity to migrate in Jan2010 • Needs extensive testing (hybrid environment of EVB, monitoring) t

  26. Xdaq • Release and deployment • Online SW migrated from CVS+sourceforge to SVN+TRAC • From build-8 onwards. • Global (umbrella) release number eg 9.4 • Release schedule • Build-9 (21-sep-09): improvements for central DAQ • in particular Monitoring Infrastructure (collecting many large FLs) • Build-10: both SLC4 and SLC5/gcc4.3.4 (partial) • Sub-detectors • Now on build 6 or 7 • “Suggest” to migrate to bld-10 (SLC4) in Jan 2010 t

  27. READINESS FOR DATA-TAKING t

  28. Splashes • Low L1 trigger rate ~1 Hz • HLT accepts all, flagging • no TK, PX • Maybe use small config to minimise risks • Note: • 2008 splashes: CSC FED problem, repaired • Conclusion: DAQ is ready t

  29. Early Collisions • 900 GeV and possibly 2.2 Tev • Go up to 8x8 • L1 Can accept all events (88 kHz) • HLT menu to bring rate down (budget 50 ms/evt) • Put all 16 Storage Managers • 2 Gbyte/s write (~2 kHz) instantaneous • Conclusion: • DAQ ready (also efficient?) • Hope to have robust L1, HLT scalers t

  30. Plan for Jan 2010 • Implement and test • Tolerate missing FEDs • Migrate to xdaq 10 • Also all subdet (SLC4) • HLT+SM on slc5/gcc434 if ready, tested in time • Design and progressively advance • Overall DCS/DAQ control t

  31. Shifts and On-Call (I) • Sysadmin on-call • Currently 3, increase to 6 • Network • IT/CS support • Data Base • DBA from IT • Need CMS specific first line / filter t

  32. Shifts and On-Call • DAQ shifter • From collaboration. Enough pledges. Still need to fill roster. • Prefer fewer people doing more shifts • DAQ on-call • 6 persons • Heavy load (~full time job, high availability) • DCS shifter • From collaboration • DCS on-call • Not in place: from pool of DCS subdet TF • DCS expert • Now 3 persons t

More Related