460 likes | 474 Views
Explore the LHCb-Italy off-line computing resources centralized at CNAF and their optimization for efficient system administration, resource exploitation, and distributed MC production and analysis. Evaluate the improvement following the setup of Tier-3 in major INFN sites.
E N D
LHCb on-line / off-line computing INFN CSN1 Lecce, 24.9.2003 Domenico Galli, Bologna
Off-line computing • We plan LHCb-Italy off-line computing resources to be as much centralized as possible. • Put as much computing power as possible in CNAF Tier-1. • To minimize system administration manpower. • To optimize resources exploitation. • “Distributed” for us means distributed among CNAF and other European Regional Centres. • Virtual drawback: strong dependence on CNAF resource sharing. • The improvement following the setup of Tier-3 in major INFN sites for parallel nt-ples analysis should be evaluated later. LHCb on-line / off-line computing.2 D. Galli
2003 Activities • In 2003 LHCb-Italy contributed to DC03 (production of MC samples for TDR). • 47 Mevt / 60 d. • 32 Mevt minimum bias; • 10 Mevtinclusive b; • 50 signal samples, whose size is 50 to100 kevt. • 18 Computing centresinvolved. • Italian contribution:11.5% (should be 15%). LHCb on-line / off-line computing.3 D. Galli
2003 Activities (II) • Italian contribution to DC03 has been obtained using limited resources (40kSi2000, i.e. 100 1GHz PIII CPUs). • Larger contibutions (Karlsruhe D, Imperial College, UK) come from the huges, dinamically allocated, resources of these centres. • DIRAC, LHCb distributed MC production system, has been used to produce 36600 jobs; 85% of them run out of CERN with 92% mean efficiency. LHCb on-line / off-line computing.4 D. Galli
2003 Activities (III) • DC03 has also been used to validate LHCb distributed analysis model. • Distribution to Tier-1 centres of signal and bg MC samples stored at CERN during production. • Samples has been pre-reduced based on kinematical or trigger criteria. • Selection algorithms for specific decay channels (~30) has been executed. • Events has been classified by means of tagging algorithms. • LHCb-Italy contributed to implementation of selection algorithms for B decays in 2 charged pions/kaons. LHCb on-line / off-line computing.5 D. Galli
2003 Activities (IV) • To perform high statistics data samples analysis the PVFS distributed file system has been used. • 110 MB/s aggregate I/O using 100Base-T Ethernet connection (to be compared with 50 MB/s of a typical 1000BaseT NAS). LHCb on-line / off-line computing.6 D. Galli
2003 Activities (V) • Analysis work by LHCb-Italy has been included in “Reoptimized Detector Design and Performance” TDR (2 hadrons channel + tagging). • 3 LHCb internal notes has been written: • CERN-LHCb/2003-123: Bologna group, “Selection of B/Bsh+h- decays at LHCb”; • CERN-LHCb/2003-124: Bologna group, “CP sensitivity with B/Bsh+h- decays at LHCb”. • CERN-LHCb/2003-115: Milano group, “LHCb flavour tagging performance”. LHCb on-line / off-line computing.7 D. Galli
Software Roadmap LHCb on-line / off-line computing.8 D. Galli
DC04 (April-June 2004) – Physics Goals • Demonstrate performance of HLTs (needed for computing TDR) • Large minimum bias sample + signal • Improve B/S estimates of optimisation TDR • Large bb sample + signal • Physics improvements to generators LHCb on-line / off-line computing.9 D. Galli
DC04 – Computing Goals • Main goal: gather information to be used for writing LHCb computing TDR • Robustness test of the LHCb software and production system • Using software as realistic as possible in terms of performance • Test of the LHCb distributed computing model • Including distributed analyses • Incorporation of the LCG application area software into the LHCb production environment • Use of LCG resources as a substantial fraction of the production capacity LHCb on-line / off-line computing.10 D. Galli
DC04 – Production Scenario • Generate (Gauss, “SIM” output): • 150 Million events minimum bias • 50 Million events inclusive b decays • 20 Million exclusive b decays in the channels of interest • Digitize (Boole, “DIGI” output): • All events, apply L0+L1 trigger decision • Reconstruct (Brunel, “DST” output): • Minimum bias and inclusive b decays passing L0 and L1 trigger • Entire exclusive b-decay sample • Store: • SIM+DIGI+DST of all reconstructed events LHCb on-line / off-line computing.11 D. Galli
Goal: Robustness Test of the LHCb Software and Production System • First use of the simulation program Gauss based on Geant4 • Introduction of the new digitisation program, Boole • With HLTEvent as output • Robustness of the reconstruction program, Brunel • Including any new tuning or other available improvements • Not including mis-alignment/calibration • Pre-selection of events based on physics criteria (DaVinci) • AKA “stripping” • Performed by production system after the reconstruction • Producing multiple DST output streams • Further development of production tools (Dirac etc.) • e.g. integration of stripping • e.g. Book-keeping improvements • e.g. Monitoring improvements LHCb on-line / off-line computing.12 D. Galli
Goal: Test of the LHCb Computing Model • Distributed data production • As in 2003, will be run on all available production sites • Including LCG1 • Controlled by the production manager at CERN • In close collaboration with the LHCb production site managers • Distributed data sets • CERN: • Complete DST (copied from production centres) • Master copies of pre-selections (stripped DST) • Tier1: • Complete replica of pre-selections • Master copy of DST produced at associated sites • Master (unique!) copy of SIM+DIGI produced at associated sites • Distributed analysis LHCb on-line / off-line computing.13 D. Galli
Goal: Incorporation of the LCG Software • Gaudi will be updated to: • Use POOL (persistency hybrid implementation) mechanism • Use certain SEAL (general framework services) services • e.g. Plug-in manager • All the applications will use the new Gaudi • Should be ~transparent but must be commissioned • N.B.: • POOL provides existing functionality of ROOT I/O • And more: e.g. location independent event collections • But incompatible with existing TDR data • May need to convert it if we want just one data format LHCb on-line / off-line computing.14 D. Galli
Needed Resources for DC04 • CPU requirement is 10 times what was needed for DC03 • Current resource estimates indicate DC04 will last 3 months • Assumes that Gauss is twice slower than SICBMC • Currently planned for April-June • GOAL: use of LCG resources as a substantial fraction of the production capacity • We can hope for up to 50% • Storage requirement: • 6TB at CERN for complete DST • 19TB distributed among TIER1 for locally produced SIM+DIGI+DST • up to 1TB per TIER1 for pre-selected DSTs LHCb on-line / off-line computing.15 D. Galli
Resources request to Bologna Tier-1 for DC04 • CPU power: 200 kSI2000 (500 1GHzPIII CPU). • Disk: 5 TB • Tape: 5 TB LHCb on-line / off-line computing.16 D. Galli
Tier-1 Grow in Next Years LHCb on-line / off-line computing.17 D. Galli
Online Computing • LHCb-Italy has been involved in online group to design the L1/HLT trigger farm. • Sezione di Bologna • G. Avoni, A. Carbone , D. Galli, U. Marconi, G. Peco, M. Piccinini, V. Vagnoni • Sezione di Milano • T. Bellunato, L. Carbone, P. Dini • Sezione di Ferrara • A. Gianoli LHCb on-line / off-line computing.18 D. Galli
Online Computing (II) • Lots of changes since the Online TDR • abandoned Network Processors • included Level-1 DAQ • have now Ethernet from the readout boards • destination assignment by TFC (Timing and Fast Control) • Main ideas the same • large gigabit Ethernet Local Area Network to connect detector sources to CPU destinations • simple (push) protocol, no event-manager • commodity components wherever possible • everything controlled, configured and monitored by ECS (Experimental Control System) LHCb on-line / off-line computing.19 D. Galli
Switch Switch Switch Switch Switch Readout Network SFC SFC SFC SFC SFC Switch Switch Switch Switch Switch CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Gb Ethernet Level-1 Traffic HLT Traffic Mixed Traffic DAQ Architecture HLTTraffic Level-1Traffic Front-end Electronics FE FE FE FE FE FE FE FE FE FE TRM FE FE 323Links 4 kHz 1.6 GB/s 126-224Links 44 kHz 5.5-11.0 GB/s Multiplexing Layer 29 Switches 62-87 Switches 64-137 Links 88 kHz 32 Links L1-Decision Sorter TFCSystem 94-175 Links 7.1-12.6 GB/s StorageSystem 94-175 SFCs CPUFarm ~1800 CPUs LHCb on-line / off-line computing.20 D. Galli
Switch Switch Switch Switch Switch Readout Network SFC SFC SFC SFC SFC Switch Switch Switch Switch Switch CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Level-1 Traffic Gb Ethernet HLT Traffic Mixed Traffic Following the Data-Flow Front-end Electronics FE FE FE FE FE FE FE FE FE FE FE FE 1 2 2 TRM 1 Sorter L0 Yes L1-Decision TFCSystem L1 Yes 21 21 94 Links 7.1 GB/s StorageSystem CPUFarm 94 SFCs L1 Trigger HLT Yes L1 D B ΦΚs ~1800 CPUs LHCb on-line / off-line computing.21 D. Galli
Design Studies • Items under study: • Physical farm implementation (choice of cases, cooling, etc.) • Farm management (bootstrap procedure, monitoring) • Subfarm Controllers (event-builders, load-balancing queue) • Ethernet Switches • Integration with TFC and ECS • System Simulation • LHCb-Italy is involved in Farm management, Subfarm Controllers and their communication with Subfarm Nodes. LHCb on-line / off-line computing.22 D. Galli
Tests in Bologna • To begin the activity in Bologna we started (August 2003) from scratch by trying to transfer data through 1000Base-T (gigabit Ethernet on copper cables) from PC to PC and to measure performances. • As we plan to use an unreliable protocol (RAW Ethernet, RAW IP or UDP) because reliable ones (like TCP, which retransmit datagrams not acknowledged) introduce unpredictable latency, so, together with throughput and latency, we need to benchmark also data loss. LHCb on-line / off-line computing.23 D. Galli
Tests in Bologna (II) – Previous results • In IEEE802.3 standard specifications, for 100 m long cat5e cables, the BER (Bit Error Rates) is said to be< 10-10. • Previous measures, performed by A. Barczyc, B. Jost, N. Neufeld using Network Processors (not real PCs) and 100 m long cat5e cables showed a BER < 10-14. • Recent measures (presented A. Barczyc at Zürich, 18.09.2003), performed using PCs gave a frame drop rate O(10-6). • Many data (too much for L1!) get lost inside kernel network stack implementation in PCs. LHCb on-line / off-line computing.24 D. Galli
Tests in Bologna (III) • Transferring data on 1000Base-T Ehernet is not as trivial as it was for 100Base-TX Ethernet. • A new bus (PCI-X) and new chipsets (e.g. Intel E7601, 875P) has been designed to support gigabit NIC data flow (PCI bus and old chipsets have not enough bandwidth to support gigabit NIC at gigabit rate). • Linux kernel implementation of network stack has been rewritten 2 times since kernel 2.4 to support gigabit data flow (networking code is 20% of the kernel source). Last modification imply the change of the kernel-to-driver interface (network driver must be rewritten). • Standard Linux RedHat 9A setup uses back-compatibility stuff and looses packets. • No many people are interested in achieving very low packet loss (except for video streaming). • Also a DATATAG group is working on packet losses (M. Rio, T. Kelly, M. Goutelle, R. Hughes-Jones, J.P.Martin-Flatin, “A map of the networking code in Linux Kernel 2.4.20”, draft 8, 18 August 2003). LHCb on-line / off-line computing.25 D. Galli
Tests in Bologna. Results Summary • Throughput was always higher than expected (957 Mb/s of IP payload measured) while data loss was our main concern. • We have understood, first (at least) in the LHCb collaboration, how to send IP datagram at gigabit/second rate from Linux to Linux on 1000Base-T Ethernet without datagram loss (4 datagrams lost / 2.0x1010 datagrams sent). • This required: • use the appropriate software: • NAPI kernel ( 2.4.20 ). • NAPI-enabled drivers (for Intel e1000 driver, recompilation with a special flag set was needed). • kernel parameters tuning (buffer & queue length). • 1000Base-T flow control enabled on NIC. LHCb on-line / off-line computing.26 D. Galli
Test-bed 0 • 2 x PC with 3 x 1000Base-T interfaces each • Motherboard:SuperMicro X5DPL-iGM • Dual Pentium IV Xeon 2.4 GHz, 1 GB ECC RAM • Chipset Intel E7501 • 400/533 MHz FSB (front side bus) • Bus Controller Hub Intel P64H2 (2 x PCI-X, 64 bit, 66/100/133 MHz) • Ethernet controller Intel 82545EM: 1 x 1000Base-T interface (supports Jumbo Frames) • Plugged-in PCI-X Ethernet Card: Intel Pro/1000 MT Dual Port Server Adapter • Ethernet controller Intel 82546EB: 2 x 1000Base-T interfaces (supports Jumbo Frames) • 1000Base-T 8 ports switch: HP ProCurve 6108 • 16 Gbps backplane: non-blocking architecture • latency: < 12.5 µs (LIFO 64-byte packets) • throughput: 11.9 million pps (64-byte packets) • switching capacity: 16 Gbps • Cat. 6e cables • max 500 MHz (cfr 125 MHz 1000Base-T) LHCb on-line / off-line computing.27 D. Galli
1000Base-T switch 131.154.10.2 131.154.10.7 uplink 10.10.0.2 10.10.0.7 10.10.1.2 10.10.1.7 lhcbcn1 lhcbcn2 Test-bed 0 (II) • echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter • to use only one interface to receive packet owning to a certain network (131.154.10, 10.10.0 and 10.10.1). LHCb on-line / off-line computing.28 D. Galli
Test-bed0 (III) LHCb on-line / off-line computing.29 D. Galli
SuperMicro X5DPL-iGM Motherboard (Chipset Intel E7501) • Chipset internal bandwidth is granted • 6.4 Gb/s min LHCb on-line / off-line computing.30 D. Galli
Benchmark Software • We used 2 benchmark software: • Netperf 2.2p14 UDP_STREAM • Self-made basic sender & receiver programs using UDP & RAW IP • We discovered a bug in netperf on Linux platform: • since Linux calls setsockopt(SO_SNDBUF) & setsockopt(SO_RCVBUF) set the buffer size to twice the requested size, while Linux calls getsockopt(SO_SNDBUF) & getsockopt(SO_RCVBUF) return the actual the buffer size, then when netperf iterate to achieve the requested precision in results, it doubles the buffer size each iteration, using the same variable for both the sistem calls. LHCb on-line / off-line computing.31 D. Galli
Benchmark Environment • Kernel2.4.20-18.9smp • GigaEthernet driver: e1000 • version 5.0.43-k1 (RedHat 9A) • version 5.2.16 recompiled with NAPI flag enabled • System disconnected from public network • Runlevel3 (X11 stopped) • Daemons stopped (crond, atd, sendmail, etc.) • Flow controlon (on both NICs and switch) • Numer of descriptors allocated by the driver rings: 256, 4096 • IP send buffer size: 524288 (x2) Bytes • IP receive buffer size: 524288 (x2), 1048576 (x2) Bytes • Tx queue length100, 1600 LHCb on-line / off-line computing.32 D. Galli
First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning. • First benchmark results about datagram loss showed big fluctuations which, in principle, can due to packet queue reset, other CPU process, interrupts, soft_irqs, broadcast network traffic, etc. • Resultingdistribution ismulti-modal. • Mean loss:1 datagramlost every20000datagramsent.Too much forLHCb L1!!! LHCb on-line / off-line computing.33 D. Galli
First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning (II) • We think that peak behavior is due to kernel queues resets (all queue packets silently dropped when queue is full). LHCb on-line / off-line computing.34 D. Galli
Changes in Linux Network Stack Implementation • 2.1 2.2: netlink, bottom halves, HFC (harware flow control) • As few computation as possible while in interrupt context (interrupt disabled). • Part of the processing deferred from interrupt handler to bottom halves to be executed at later time (with interrupt enabled). • HFC (to prevent interrupt livelock): as the backlog queue is totally filled, interrupt are disable until backlog queue is emptied. • Bottom halves execution strictly serialized among CPUs; only one packet at a time can enter the system. • 2.3.43 2.4: softnet, softirq • softirqs are software thread that replaces bottom halves. • possible parallelism on SMP machines • 2.5.53 2.4.20 (N.B.: back-port): NAPI (new application program interface) • interrupt mitigation technology (mixture of interrupt and polling mechanisms) LHCb on-line / off-line computing.35 D. Galli
Interrupt livelock • Given the interrupt rate coming in, the IP processing thread never gets a chance to remove any packets off the system. • There are so many interrupts coming into the system such that no useful work is done. • Packets go all the way to be queued, but are dropped because the backlog queue is full. • System resourced are abused extensively but no useful work is accomplished. LHCb on-line / off-line computing.36 D. Galli
NAPI (New API) • NAPI is a interrupt mitigation mechanism constituted by a mixture of interrupt and polling mechanisms: • Polling: • useful under heavy load. • introduces more latency under light load. • abuses the CPU by polling devices that have no packet to offer. • Interrupts: • improve latency under light load. • make the system vulnerable to livelock as the interrupt load exceed the MLFFR (Maximum Loss Free Forwarding Rate). LHCb on-line / off-line computing.37 D. Galli
Packet Reception in Linux kernel 2.4.19 (softnet) and 2.4.20 (NAPI) Softnet (kernel 2.4.19) NAPI (kernel 2.4.20) LHCb on-line / off-line computing.38 D. Galli
NAPI (II) • Under low load, before the MLFFR is reached, the system converges toward an interrupt driven system: packets/interrupt ratio is lower and latency is reduced. • Under heavy load, the system takes its time to poll devices registered. Interrupts are allowed as fast as the system can process them : packets/interrupt ratio is higher and latency is increased. LHCb on-line / off-line computing.39 D. Galli
NAPI (III) • NAPI changes driver-to-kernel interfaces. • all network drivers should be rewritten. • In order to accommodate devices not NAPI-aware, the old interface (backlog queue) is still available for the old drivers (back-compatibility). • Backlog queues, when used in back-compatibility mode, are polled just like other devices. LHCb on-line / off-line computing.40 D. Galli
True NAPI vs Back-Compatibility Mode NAPI NAPI kernel with NAPI driver NAPI kernel with old(not NAPI-aware) driver LHCb on-line / off-line computing.41 D. Galli
The Intel e1000 Driver • Even in the last version of e1000 driver (5.2.16) NAPI is turned off by default (to allow the usage of the driver also in kernels 2.4.19). • To enable NAPI, e1000 5.2.16 driver must be recompiled with the option:make CFLAGS_EXTRA=-DCONFIG_E1000_NAPI LHCb on-line / off-line computing.42 D. Galli
Best Results • Maximum trasfer rate (udp 4096 byte datagrams):957 Mb/s. • Mean datagram lost fraction (@ 957 Mb/s):2.0x10-10 (4 datagram lost for 2.0x1010 4k-datagrams sent) • corresponding to BER 6.2x10-15 (using 1 m cat6e cables) if data loss is totally due to hardware CRC errors. LHCb on-line / off-line computing.43 D. Galli
To be Tested to Improve Further • kernel 2.5 • fully preemptive (real time) • sysenter & sysexit (instead of int 0x80) for context switching following system calls (3-4 times faster). • Asynchronous datagram receiving • Jumbo frames • Ethernet frames whose MTU (Maximum Transmission Unit) is 9000 instead of 1500. Less IP datagram fragmentation in packets. • Kernel Mode Linux (http://web.yl.is.s.u-tokyo.ac.jp/~tosh/kml/) • KML is a technology that enables the execution of ordinary user-space programs inside kernel space. • Protection-by-software (like in Java bytecode) instead of protection-by-hardware. • System calls become function calls (132 time faster than int 0x80, 36 time faster than sysenter/sysexit). LHCb on-line / off-line computing.44 D. Galli
Milestones • 8.2004 – Streaming benchmarks: • Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet with loopback cable. • Test of switch performance (streaming throughput, latency and packet loss, using standard frames and jumbo frames). • Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet for 2 or 3 simultaneous connections on the same PC. • Test of event building (receive 2 message stream and send 1 joined messages stream) • 12.2004 – SFC (Sub Farm Controller) to nodes communication: • Definition of SFC-to-nodes communication protocol. • Definition of SFC queueing and scheduling mechanism • First implementation of queueing/scheduling procedures (possibly zero-copy). LHCb on-line / off-line computing.45 D. Galli
Milestones (II) • OS test (if performance need to be improved) • kernel Linux 2.5.53. • KML (kernel mode linux). • Design and test of bootstrap procedures: • Measure of the rate of failure of simultaneous boot of a cluster of PCs, using pxe/dhcp and tftp. • Test of node switch on/off and powe cycle using ASF. • Design of bootstrap system (rate nodes/proxy servers/servers, sofware alignment among servers) • Definition of requirement for the trigger software: • error trapping. • timeout. LHCb on-line / off-line computing.46 D. Galli