320 likes | 336 Views
DAQ Status and Prospects. Frans Meijers Wed 4 November 2009 Readiness for Data-Taking review. Intro Global Running DAQ Readiness for Data Taking. CMS DAQ. DAQ Phase I (“50 kHz DAQ”) - Today. 100 kHz. Full R/O off all sub-detectors (no TOTEM)
E N D
DAQ Status and Prospects Frans Meijers Wed 4 November 2009 Readiness for Data-Taking review Intro Global Running DAQ Readiness for Data Taking
DAQ Phase I (“50 kHz DAQ”) - Today 100 kHz • Full R/O off all sub-detectors • (no TOTEM) • 8 DAQ Slices, 100 GByte/s event builder • Event Filter: 720 8-core PCs (2.6 GHz, 16 GByte): • ~5000 instances of CMSSW-HLT • @50 kHz: ~100ms/evt 2.6 GHz CPU, 2GByte /process • @100 kHz: ~50ms/evt • Storage Manager: 2 GByte/s, 300 TB buffer. 12.5 kHz +12.5 kHz • Phase II ~2011: “100 kHz DAQ”: add another ~1000 PCs • will be decided after experience with initial real LHC data
GLOBAL RUNNING t
DAQ in recent global runs • ~1 kHz cosmics +100 Hz calibration + 90 kHz randoms
R/O of sub-detsFEDs • All sub-dets routinely in • for most: ~empty events @ 90 kHz • ES (ECAL preshower) • problem with ‘held-back’ events preventing DAQ to stop • fixed yesterday? • little global running with CASTOR t
Event Sizes • Size through EVB: total ~500 kByte (compare 1 MB nominal) • ~same for randoms and cosmics • ECAL and HCAL produce ‘nominal’ 2 kB/FED • For most subdetectors high rate test not at nominal working point • First LHC expect dominated by noise and backgrounds • A single FED with high size can limit full system !! t
Run Control Operation (Sep09) • Full cycle ~5 minutes (Init CDAQ 50s (NOW 25 s), Configure TK 200s) • Short cycle (start – stop) ~ 2 min (but happens rarely due to crashes) • Short cuts possible with commander, but requires attention by DAQ operator • Still at high event rate some back-pressure from HLT on first event t
Run Control Operation (Oct09) • At present (26 Oct) little improvement wrt Sep09 • Iinit from 57 to 26 s. t
Global Operation • Observations: • Now: empty events ~500 kByte • start/stop often tedious • Various problems (subdet config, FEDs not sending, cDAQ not stopping, ..) • Often requires full destroy – init • Once running • In general OK • occasional subdet TTC out-of-sync • Usually cured by TTC reset (by operator) • Occasional FED with problem, forcing restart t
CENTRAL DAQ t
Network • GPN and private network (dot-cms) • Status: OK • Support by IT • Note: • during Xmas break: no configuration changes possible • in Green Barrack: • connection to CMS private network possible
Cluster + Network • Operates independent from CERN campus network • Network structure: • Private network + headnodes with access also to campus • PCs ~2000 servers, ~100 Desktop • Linux SLC4 • Cluster services: • ntp, dns, DHCP, Kerberos, LDAP • NAS filer for home directories, projects • Cluster monitoring with Nagios • Software packaged in rpm • Quattor for system installation and configuration • can install PC from scratch ~2 min, whole cluster 1 hour • Windows cluster of ~100 nodes for DCS
Cluster – Progress and Prospects • Purchased few ‘spare’ PE1950, PE2950 • Can no longer buy from Dell models with PCI-X (vmepc, Myrinet) On-going: • Re-arrangement of server nodes, headnodes • Extending SCX5 control room (~continuously) • Consolidation of sub-det computing: storage, PC models\ • Use of central filer, instead of local disks • eg CSC local DAQ farm • Security • SLC5 t
Quattor and subdet-SW • Quattor: • Requires SW packed in RPM • Installs PCs in known state according to dbase • Can roll-back to any point in time • Demonstrated several times • However, Is expert operation • Problem today: • Many PCs are in ‘permissive’ mode • Need tool for subdetector to submit updated rpm’s and initiate installation w/o sysadmin • In progress, production quality before Jan 2010 t
Quattor Profiles • Quattor profiles • 61 different profiles • Obsolete (xdaq 6) • Stable, Prod, Pre-Prod (flavors of XDAQ 7) • Central DAQ • 2122 nodes • 110 nodes in Permissive mode • Can not guarantee proper installation ! Observation / recommendation: • Too much proliferation of profiles • All move to xdaq10 in Jan2010 t
Cluster Nodes • Failures with PC nodes • Essential nodes (connected to H/W) : • Subdet vmepc, controllers of FRL, FMM • Model SC1425, PE2850 • very few failures • When broken need replacement with spare • LENGHTY procedure! • Central DAQ EVB+HLT nodes • PE2950, PE1950 • Failures: ~1/week • Slice Masking, Black listed • Cluster and general services t
Online database • Production Database: • Oracle 6-node RAC in SCX5 • admin: DBA by IT • Development & test database • 2 node RAC in SCX5 • also on GPN • Portal Web service • Services: • Elog • Shifttool • ..
Run Control • Progress: • Ability to mask DAQ slices • DAQ shifter can disable DAQ slice for problems RU onwards; requires destroy and re-init • Does NOT cover faulty FRL, etc • Improvements handling FED masking • Prospects • Some more fault-tolerance t
Event Filter Progress: • adoption of CMSSW_3 • streamlining of patching procedure and deployment • CMSSW HLT patches via tag-collector • Rework of EventProcessor • More efficient in memory, control, .. • Can restart crashed HLT instance on the fly In Progress: • HLT counters to RunInfo dBase • To be finalized, signal on end-of-LumiSection from EVM t
Storage Manager Performance ½ SM system for all 8 DAQ slices Total capacity: 300 TB Allows peak recording ~2 Gbytes/s The event size at the SM is typically factor 2 less compared with EVB size (due to compression) Expected need when in ‘steady state’ LHC running ~400 MB/s
Issues – Short term (2009) • Synchronisation • L1 configuration and HLT • Updates to online dataBase • Monitoring of L1 and HLT scalers • L1 scalers (with deadtime) • Collected by EVB event managers • Monitoring at LumiSection boundaries t
Issues – Medium term (2010) • Overall Control: • Automation and communication between DCS Run control and LHC status • In place: • Basic communication mechanism • DCS is partitioned according to RC TTC partitions • Missing: • Higher level control • Different implementation options • “devil is in the details” t
SLC5/gcc4.3 • Motivation: • Follow ‘current’ versions • functionality, HW support, security fixes • TSG prefers HLT on same version of SLC5/gcc as offline • SLC5 has gcc4.1, LCG uses gcc4.3.x, CMSSW uses gcc4.3.4+mod • Migration to SLC5/gcc4.3.x • SLC5 (64-bit OS, 32 bit applications) • Private gcc4.3.x (distributed with CMSSW) • Only migrate Event Filter (and SM) nodes • Involves: • System • XDAQ port (coretools and powerpack) and EVB, EvF • CMSSW t
SLC5/gcc4.3 status • XDAQ release pre-10 • Subset (components required by EvF, SM) • Built with cmssw gcc434 • NOT tested • CMSSW online release • Build environment operational • Build in progress • My (FM) personal feeling • Window of opportunity to migrate in Jan2010 • Needs extensive testing (hybrid environment of EVB, monitoring) t
Xdaq • Release and deployment • Online SW migrated from CVS+sourceforge to SVN+TRAC • From build-8 onwards. • Global (umbrella) release number eg 9.4 • Release schedule • Build-9 (21-sep-09): improvements for central DAQ • in particular Monitoring Infrastructure (collecting many large FLs) • Build-10: both SLC4 and SLC5/gcc4.3.4 (partial) • Sub-detectors • Now on build 6 or 7 • “Suggest” to migrate to bld-10 (SLC4) in Jan 2010 t
READINESS FOR DATA-TAKING t
Splashes • Low L1 trigger rate ~1 Hz • HLT accepts all, flagging • no TK, PX • Maybe use small config to minimise risks • Note: • 2008 splashes: CSC FED problem, repaired • Conclusion: DAQ is ready t
Early Collisions • 900 GeV and possibly 2.2 Tev • Go up to 8x8 • L1 Can accept all events (88 kHz) • HLT menu to bring rate down (budget 50 ms/evt) • Put all 16 Storage Managers • 2 Gbyte/s write (~2 kHz) instantaneous • Conclusion: • DAQ ready (also efficient?) • Hope to have robust L1, HLT scalers t
Plan for Jan 2010 • Implement and test • Tolerate missing FEDs • Migrate to xdaq 10 • Also all subdet (SLC4) • HLT+SM on slc5/gcc434 if ready, tested in time • Design and progressively advance • Overall DCS/DAQ control t
Shifts and On-Call (I) • Sysadmin on-call • Currently 3, increase to 6 • Network • IT/CS support • Data Base • DBA from IT • Need CMS specific first line / filter t
Shifts and On-Call • DAQ shifter • From collaboration. Enough pledges. Still need to fill roster. • Prefer fewer people doing more shifts • DAQ on-call • 6 persons • Heavy load (~full time job, high availability) • DCS shifter • From collaboration • DCS on-call • Not in place: from pool of DCS subdet TF • DCS expert • Now 3 persons t