280 likes | 429 Views
LHCb Computing. Umberto Marconi On behalf of the Bologna Group Castiadas, May 27, 2004. Overview. Introduction The LHCb experiment Computing Activities Online Offline. Introduction.
E N D
LHCb Computing Umberto Marconi On behalf of the Bologna Group Castiadas, May 27, 2004
Overview • Introduction • The LHCb experiment • Computing Activities • Online • Offline U. Marconi INFN, Bologna
Introduction • Amongst the LHCb Italian groups, Bologna is the only one involved in computing activities, both in the online and offline world. • Project of the online computing farm of the L1&HLT trigger • Representative of LHCb in the INFN Tier-1 computing centre at CNAF • Italian representative in the LHCb National Computing Board • We are also working to develop and test the analysis and simulation tools U. Marconi INFN, Bologna
Beauty Physics at LHCb • Aim of LHCb is to study the dynamics of the beauty quark, with the main goal of measuring the CP symmetry violation in this sector. • The proton-proton collisions at 14TeV, at the frequency of 40 MHz, of LHC, can be exploited as intense source of beauty quarks • The expected intensity of beauty production is of about 0.1 MHz • The signal to noise ratio at the source is expected of the order of 1/160 • Useful processes are those related to the oscillations and decays of the Bd and Bs neutral mesons • A “rare decay process” is a decay mode whose probabilities is of the order or below 10-4 • The intensity of the interesting processes are expected at the level or below 0.1MHz x 10-4 = 10Hz U. Marconi INFN, Bologna
P CP C C P CP Symmetry in (very) few words CP symmetry works rather well but it is violated U. Marconi INFN, Bologna
The LHCb Experiment • Key feature used to select the Bd and Bs meson decays is their relative long lifetime (1.5x10-12s) U. Marconi INFN, Bologna
~65 m2 VELO: 21stations (Rmin= 8mm) Si 220 mm, strips R e φ TT~1.41.2 m2 Si microstrips 3 Tracking stations IT : Si strips OT: straw tubes U. Marconi INFN, Bologna
Trigger Architecture • Level-0 Trigger • Fixed latency, 4μs • Reduce the ~10MHz visible interaction rate to 1.1 MHz • Select • The highest ET hadron, electron, photon • The two highest pT muons • Level-1 Trigger • Variable latency, 58ms max • Output rate is fixed at 40KHz • Decisions are delivered chronologically ordered • Event selection: B vertex • HLT • Variable latency • Output rate established at 200 Hz • Event selection: algorithms for specific decay modes U. Marconi INFN, Bologna
L1 & HLT Implementation • System design ingredients: • data rates per front-end board (@ L1 and HLT) • protocol overheads • required CPU power # CPUs • ~300 readout front-end boards to be connected to a ~2000 CPUs • The system needs to be affordable & scalable • Want to use mostly commercial – commodity – components • Solution: use a large Ethernet Local Area Network (LAN) and PCs cheap, reliable, commercial & (mostly) commodity size of the network size of the farm U. Marconi INFN, Bologna
Push-through protocol simple, scalable Distributed global flow-control (throttle) via Timing and Fast Control (TFC) system disable trigger temporarily to avoid buffer overflows Data are sent as IP packets can be used with any standard network equipment Data for several consecutive triggers are packed into Multi Event Packets and sent as a single packet over the network reduces packet rate and transport overheads CPU-farm is partitioned into sub-farms reduces the connectivity problem Sub-farms are assigned centrally by the TFC system central static load balancing Event-building and dynamic load-balancing is done by the Subfarm Controller (SFC) Single CPU farm is used for both L1 & HLT. L1 runs as a priority task, HLT in the back-ground on each node minimises latency for L1 and overall idle time, seamless redistribution of computing power Key Features U. Marconi INFN, Bologna
HLTTraffic Level-1Traffic Front-end Electronics FE FE FE FE FE FE FE FE FE FE FE FE TRM 323Links 4 kHz 1.6 GB/s 126Links 44 kHz 5.5 GB/s Multiplexing Layer Switch Switch Switch Switch Switch 29 Switches 62 Switches 64 Links 88 kHz 32 Links Readout Network L1-Decision Sorter TFCSystem 94 Links 7.1 GB/s StorageSystem 94 SFCs SFC SFC SFC SFC SFC CPUFarm Switch Switch Switch Switch Switch CPU CPU CPU CPU CPU CPU CPU ~1800 CPUs CPU CPU CPU CPU CPU CPU CPU CPU Gb Ethernet Level-1 Traffic HLT Traffic Mixed Traffic L1 Event-Building • Upon reception of L0 yes, data are stored in the L1-buffers. VELO, TT, L0DU and Calorimeter Selection Crate pack data into a MEP • When a MEP is full, the TFC sends the destination address for this MEP. The boards send the MEP as IP packets • The packet is routed through the the readout-network to the Subfarm Controller (SFC) • The SFC collects all MEPs, assembles the events and sends individual events to a free CPU • The CPU reports back to the SFC with a L1 decision and discards the data • The SFC sends the decision to the L1 trigger decision sorter U. Marconi INFN, Bologna
HLTTraffic Level-1Traffic Front-end Electronics FE FE FE FE FE FE FE FE FE FE FE FE TRM 323Links 4 kHz 1.6 GB/s 126Links 44 kHz 5.5 GB/s Multiplexing Layer Switch Switch Switch Switch Switch 29 Switches 62 Switches 64 Links 88 kHz 32 Links Readout Network L1-Decision Sorter TFCSystem 94 Links 7.1 GB/s StorageSystem 94 SFCs SFC SFC SFC SFC SFC CPUFarm Switch Switch Switch Switch Switch CPU CPU CPU CPU CPU ~1800 CPUs CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Gb Ethernet Level-1 Traffic HLT Traffic Mixed Traffic HLT Event-Building • The Timing and Fast Control system receives the L1 trigger decisions from the L1 Decision Sorter. It then broadcasts its decision to the readout boards. • All readout boards send their events to the SFC • The SFC collects all fragments, assembles events and distributes them to a CPU • The CPU runs the HLT algorithm and reports back with either a negative decision or the event with reconstructed and raw data • The SFC forwards the event to permanent storage U. Marconi INFN, Bologna
Switch Switch Switch Switch Switch Readout Network SFC SFC SFC SFC SFC Switch Switch Switch Switch Switch CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Level-1 Traffic Gb Ethernet HLT Traffic Mixed Traffic How the system works Front-end Electronics FE FE FE FE FE FE FE FE FE FE FE FE 1 2 2 TRM 1 Sorter L0 Yes L1-Decision TFCSystem L1 Yes 21 21 94 Links 7.1 GB/s StorageSystem CPUFarm CPUFarm 94 SFCs L1 Trigger HLT Yes L1 D B ΦΚs ~1800 CPUs U. Marconi INFN, Bologna
Main Readout Network • A possible solution for the readout network based on 24 ports Gigabit Ethernet switches • Its behaviour is being simulated • Fully connected 69 input ports 85 output ports network U. Marconi INFN, Bologna
Testbed Farm Hardware • 2 Gigabit Ethernet switches • 2x(3Com 2824), 2x24 ports • 16 1U rack-mounted PCs • Dual Intel Xeon 2.4 GHz with HyperTrading • 2 GB of RAM • 160 GB IDE disk(but machines operate diskless) • 1 Fast Ethernet and 3 Gigabit Ethernet adapters • 64 bits/133 MHz PCI-X bus • 1 TB RAID5 disk array with Adaptec RAID controller and Ultra320 SCSI disks • Could be used to feed the SFC with input data at Gigabit rate to perform simulations of the event processing U. Marconi INFN, Bologna
Farm Configuration • 16 Nodes running Red Hat 9b, with 2.6.5 kernel • 1 Gateway, acting as bastion host and NAT to the external network • 1 Service PC, providing network boot services, central syslog, time synchronization, NFS exports, etc. • 1 diskless SFC, with 3 Gigabit links (2 for data and 1 for control traffic) • 13 diskless SFNs (26 physical, 52 logical processors with HT) with 2 Gigabit links (1 for data and 1 for control traffic) • Root fs mounted on a 150 MB RAM disk (kernel and compressed RAM disk image download from network at boot time) • RAM disk is automatically created by a set of scripts on admin’s demand whenever a change is performed on a development root fs area on the service PC, and put online for subsequent reboots • /usr and /home mounted via NFS from the service PC • NFS mount points can provide access to the online application binaries U. Marconi INFN, Bologna
Monitoring, configuration and control • One critical issue in administering the event filter farm is how to monitor, keep configured and up-to-date, and control each node • A stringent requirement of such a control system is that it necessarily has to be interfaced to the general DAQ framework • PVSS provides a runtime DB, automatic archiving of data to permanent storage, alarm generation, easy realization of graphical panels, various protocols to communicate via network U. Marconi INFN, Bologna
PVSS PVSS-DIM integration The DIM network communication layer, already integrated with PVSS, is very suitable for our needs • It is simple and efficient • It allows bi-directionalcommunication • The idea is to run light agents on the farm nodes, providing information to a PVSS project, which publishes them through GUIs, and which can also receive arbitrary complex commands to be executed on the farm nodes passing back the output AGENT DIM CLIENT U. Marconi INFN, Bologna
Monitoring • All the relevant quantities useful to diagnose hardware or configuration problems should be traced • CPU fans and temperatures • Memory occupancy • RAM disk filesystem occupancy • CPU load • Network interface statistics, counters, errors • TCP/IP stack counters • Status of relevant processes • Network Switch statistics (via the SNMP-PVSS interface) • … plus many other things to be learnt by experience • Information should be viewed as actual values and/or historical trends • Alarms should be issued whenever relevant quantities don’t fit in allowed ranges • PVSS naturally allows it, and can even start feedback procedures U. Marconi INFN, Bologna
Configuration and control • The idea is to embed in the framework every common operation which is usually needed by the sysadm, to be performed by means of GUIs • On the Service PCs side • Upgrade of operating systems • RAM disk creation, kernel upgrades, etc. • Upgrade of application software • Put online new versions of online programs, utilities, upgrade of bugged packages,… • Automatic setup of configuration files • dhcpd table, NFS exports table, etc. • On the farm nodes side • Inspection and modification of files • Broadcast commands to the entire farm (e.g., reboot) • Fast logon by means of a shell like environment embedded inside a PVSS GUI (e.g., commands, stdout and stderr passed back and forth by DIM) • (Re)start of online processes • … U. Marconi INFN, Bologna
Datagram Loss over Gigabit Ethernet • A IP datagram loss implies an unpleasant Multi Event Packet loss • But we can’t use reliable protocols since re-transmission of data from the read-out trigger boards to the filter farm would introduce unpredictable delays • LHCb measured a very good BER of 10-14 on copper cables: the BER level, on a copper cable 100 m long, according IEEE 802.3, is of about 10-10 • We measured also • The datagram loss in the OS IP stack. • The Ethernet frame loss in the level 2 switches • We got the best system performances in a point-to-point transmission using IP datagram of 4096 B: • Data flux: 999.90 Mb/s. • Percentage of the datagram loss: 7.1x10-10. U. Marconi INFN, Bologna
Offline Computing • Waiting for 2007, offline computing activities mainly consist in the mass production of Monte Carlo events and MC data analysis • LHCb is performing in this period the 2004 Data Challenge • It has just started these days • The target is to produce about 180M events in a run period of ~2-3 months, to be used for HLT and physics background studies • The events will be subsequently analysed on-site where they have been produced (providing that the site have stored them on local storage) • A Computing TDR is going to be written at the beginning of 2005, based on the results of this year • ~20 Computing Centres (including CERN) of various European countries participate to the Data Challenge • 2500 processors in total are expected to be used this year (~1600 at the moment) U. Marconi INFN, Bologna
LHCb DC’04 (I) • LHCb adopts two ways to produce MC data: • LHCb has developed its own production system (DIRAC), without using the LCG GRID • DIRAC uses a pull mechanism to fetch jobs from a central server, with agents running on the Computing Elements of the variuos production centres • DIRAC agents perform unattended automatic installation of specific software and libraries when needed by a job, submit the jobs, sends the output data and logs to the Storage Elements, update bookkeeping databases and replica file catalogues, performs job monitoring and accounting of the used resources • LHCb can also submit jobs through LCG-2 • In the LHCb DC’04 a first round of production is being performed by DIRAC, then after a testing phase, LCG will smoothly grow and replace DIRAC for the second part of the data challenge U. Marconi INFN, Bologna
LHCb DC’04 (II) • DST files (the last step of the Monte Carlo production chain to be used for data analysis) are produced in Bologna at a rate of 150 GB/day • After the job completion, DST data are stored by DIRAC to • local disk servers (NFS) • Tier-1 Castor tape Mass Storage (RFIO) • CERN Mass Storage (BBFTP or GRIDFTP) • Data produced at the Tier-1 and stored on Castor are made available for external usage by a BBFTP server and a GRIDFTP server U. Marconi INFN, Bologna
LHCb DC’04 (III) • LHCb Italy is participating to the DC with order of 400 processors (200k SPECint) at the INFN Tier-1 • In this very moment it is the most important regional centre with an amount of resources comparable to CERN ITALY CERN U. Marconi INFN, Bologna
WN 1 WN 2 Clients Ethernet switch NAS WN m WN 1 ION 1 ION 2 WN 2 I/O nodes Clients Ethernet switch ION n WN m MGR Management Node Some ideas for high throughput analysis Classic solution:Network Attached Storage bottleneck A more effective solution:Parallel File System U. Marconi INFN, Bologna
Parallel Virtual File System (PVFS) Performance • Using 12 I/O nodes connected through 100Base-T to 100 clients simultaneously reading data we measured an aggregate I/O of ~100 MB/s It can be compared to: • 20-40 MB/s (local disk) • 5-10 MB/s (NAS 100Base-T) • 20-30 MB/s – very optimistically (NAS 1000Base-T) • We have successfully used such a system during 2003 for LHCb massive data analysis at Bologna with outstanding results • We plan to work in strict collaboration with Tier-1 Staff to setup a testbed to compare various Parallel Filesystem implementations • We believe this is a very promising approach to massive data analysis at the LHC scale U. Marconi INFN, Bologna
Acknowledgments We want to thanks the Computig Staff at INFN Bologna for their support in building the L1&HLT event filter farm We want to express our sincere thanks to the Tier-1 Management and Staff at CNAF for their precious efforts in providing a high quality infrastructure and support, fighting every day with a plenty of technical issues connected to the construction and maintenance of such a large Computing Centre, which is emerging as one of the most important HEP-dedicated European Computing Centres U. Marconi INFN, Bologna