380 likes | 481 Views
The CMS Event Builder Demonstrator based on Myrinet. Introduction Myrinet Overview Tests of the Switching Fabric Event Building Studies Future Work and Conclusions. Frans Meijers CERN/EP on behalf of the CMS DAQ group CHEP2000, Padova Italy, Feb 2000. Introduction.
E N D
The CMS Event Builder Demonstrator based on Myrinet Introduction Myrinet Overview Tests of the Switching Fabric Event Building Studies Future Work and Conclusions Frans Meijers CERN/EP on behalf of the CMS DAQ group CHEP2000, Padova Italy, Feb 2000
Introduction • DAQ architecture and EVB parameters • Event building by switches. Crossbar • EVB traffic shaping: barrel shifter • Banyan network • A multistage 1024 port switch • The CMS DAQ system
DAQ architecture and EVB parameters Detector Front-end Level 1 Trigger Readout Systems Run Event Builder Networks Control Manager Builder and Filter Systems ComputingServices Level-1 Maximum trigger rate 100 kHz High Level Trigger acceptance 1 - 10 % 1 Mbyte Average event size Number of Readout Units 512 Average event fragment size 2 kbyte 1 Tbps Builder network (512x512 port) aggregate throughput
Event building by switches. Crossbar NxN matrix N2 number of crosspoints • The maximum switch load for random traffic is about 63% (large N limit) due to head-of-line blocking • Higher efficiency: • queues at input and/or outputs ports • traffic shaping (example: barrel shifter 100%)
5 Step 4 Step 2 Step 3 Step 1 4 Event 3 2 1 4 3 1 2 Event 5 EVB traffic shaping: barrel shifter • sources emit to mutually exclusive • destinations in a cycle • works only for fixed size chunks • needs synchronisation
s0 d0 d7 s7 Banyan network • single path per connection • suffers from internal blocking • number of cross points : N log2 N Example : 8x8 made of 3 stages 2x2 (8=23) • For random traffic (no intermediate IQ and no OQ): • efficiency drops with s, N; for “infinite” N, eff. 20% • There exists a non-blocking barrel-shifting pattern
basic unit: 8x8 crossbars • 3 stages: 512x512 • need 192 crossbars in total A multistage 1024 port switch Banyan topology: NxN out of nxnN=ns Important to study multistage switches
Detector front-end readout LV1 R U EVM Ctrl F U Computing and Communication Services The CMS DAQ system
Myrinet overview • Myrinet features • Myrinet switches • Network Interface Card
...... PAYLOAD CRC ROUTING HEADER GO STOP Myrinet features • Myrinet is a System Area Network (SAN) • point to point links, byte wide, full-duplex, 1.3 Gbps per direction, very low error rate • packet structure: routing header, payload and tail each crossbar switch strips leading byte from routing header • wormhole routing (versus store-and-forward) no buffering, low latency, arbitrary length packets • byte based flow control (STOP/GO) • no packet loss inside switching fabric • 3Q 2000: link speed from 1.3 Gbps to 2.6 Gbps
Myrinet switches • M2M-OCT-SW8 • 32 ports • 8 times 4x4 crossbars • Large switch fabric built out of 4x4 crossbar elements • now 8x8 crossbar available as basic element
M2M-PCI64 Memory 2 MByte Address Data LANai7 64 (66 MHz) PCI Send DMA Myrinet host RISC 8 Bridge 32 or 64 SAN link DMA 66 MHz Recv DMA (33 or 66 MHz) 8 (80 MHz, NRZ) Pkt Interface Network interface card • Developed a custom Myrinet Control Program • controls DMA engines • implements low-level communication protocol
Switch tests • Set-up for switch test • Traffic conditions tested • Point-to-point 1x1 • Parameters point-to-point 1x1 • Point-to-Point NxN - Mutually exclusive paths • Block on output port • Block on internal switch • Random Traffic
Demonstrator set-up for switch tests sources destinations • 32 nodes Linux PCs • PC: 450 MHz PII BX PCI 33 MHz/32bit • Myrinet switch: M2M-OCT-SW8, NIC: M2M-PCI64[A] • two-stage Banyan network out of 4x4 crossbars
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Traffic conditions tested Point-to-point traffic (fixed destinations) Random traffic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Point-to-point 1x1 link PCI • full host - NIC DMA: limited by PCI (33 MHz/32bit) • partial host - NIC DMA: • NIC memory - link: full packet • host - NIC: only headers • limited by SAN link Allows to load switch to maximum
time per packet = overhead + size / speed Parameters point-to-point 1x1 • above 1 kbyte: linear behaviour • below 1 kbyte: plateau 5 s (NIC-host communication) Full host - NIC DMA Partial host - NIC DMA speed: 141 Mbyte/s -> 92% link eff. speed: 128 Mbyte/s -> PCI speed
1x1 4x4 8x8 16x16 Point-to-point NxN - Mutually exclusive paths [d = 4*(s%4)+s/4, s=0-15] As expected; Aggregate throughput through the switch is linear in N
Block on output port 1 2 3 4 Force m(=1,2,3,4)sources on the same destination: Each source gets 1/m of Vmax measured at source #0
Block on internal switch Force 2 sources on different destinations, but through same intermediate path: As expected; plateau at Vmax/2 measured at source #0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Random traffic 1x1 sources send, independently, to a random destination according to a uniform distribution 4x4 16x16 Efficiency: 4x4: 69 % expect 68% 16x16: 51 % limited by head-of-line blocking measured at destinations
Event building studies • EVB demonstrator set-up • Event building protocol • Variable size event fragments • Event building performance • Event building: scaling behaviour • Traffic shaping • EVB performance with traffic shaping • performance for variable size event fragments • EVB with traffic shaping: scaling behaviour • Traffic shaping: time evolution
PC: emulate RU EVM PC: emulate BU EVB demonstrator set-up • 32+1 Linux PCs [450 MHz PII BX PCI 33 MHz/32bit] • Myrinet switch: M2M-OCT-SW8, NIC: M2M-PCI64[A] • 16x16 two-stage Banyan network out of 4x4 crossbars • Myrinet between RUs and BUs (full duplex). N-to-N traffic • Fast Ethernet between BUs and EVM. N-to-1 traffic • No emulation of Level-1 trigger
Event building protocol BU EVM RU level1 Request EvtId EvtId Request Data Send Data Myrinet Clear EvtId Several EvtId messages are grouped in a single Ethernet packet
EVB EVB Builder Units Variable size event fragments Readout Units Log-normal distribution example: Average = 2 kbyte, RMS = 2 kbyte mimics CMS data readout
Event building performance • No traffic shaping • Fixed size event fragments 1x1 • results: • 1x1 is close to point-to-point • Performance decrease from 4x4 to 8x8 to 16x16, as expected • from small sizes: overhead 7 s 4x4 unstable 8x8 16x16 2k Fragment rate per node † 16x16: For 2 kbyte fragments: 30 kHz † Fragment rate per node = level-1 rate
Event building - scaling behaviour • take average fragment size of 2 kbyte • also variable size fragments • results: • For variable size reduced performance, as expected • No scaling in N Need simulation for large N ?
RU0 RU1 RU2 RU3 ... ... ... ... BU0 BU1 BU2 BU3 Traffic shaping • Sources divide fragments into fixed size packets (blocks) and cycle through all destinations • Inspired by ATM rate division (block size is 53 bytes) • Should work for large N multistage switch as well • Implementation: • Performed by NIC control program • Block size set to 4 kbyte (30 s cycle) • Barrel shifter without external synchronisation (Myrinet back pressure by HW flow control) • Packets can be (partially) empty
fragment rate per node 16x16: for 2 kbyte fragments: 65 kHz EVB performance with traffic shaping 4k • fixed size event fragments • results: • close to point-to-point 2k
Fragment rate per node for nominal average of 2k and RMS 2k †: 60 kHz Performance for variable size event fragments decrease of efficiency with larger RMS of fragment size distribution (in agreement with Monte Carlo) 2k [†with full host-NIC DMA about 80 Mbyte/s or 40 kHz]
with traffic shaping: approximate scaling EVB traffic shaping - scaling behaviour EVB
Traffic shaping - time evolution (I) BS cycling rate * block size • 23:00 ? • throughput dropped • traffic shaping barrel shifter stayed in sync ? 2 hours (= 2 108cycles, 10 Tbyte moved)
RU EVM BU Traffic shaping - time evolution (II) BS cycling rate * block size perturb system : 1: slow down RU1: all BU’s reduced rate 2: slow down BU1: only BU1 reduced rate 2 1 traffic shaping barrel shifter stays in sync 1 hour (= 108cycles)
Future work and conclusions • Future work • Conclusions
Future work • Evaluate Myrinet 2000 • available 3Q 2000 • link speed from 1.3 Gbps to 2.6 Gbps • switches based on 8x8 crossbars as elementary units • Further study of traffic shaping • Simulation • Extrapolate to large systems
Conclusions • Event builder demonstrator16x16 based on Myrinet multistage switch and Linux PCs established. • Performed systematic switch studies. As expected. • Measured event building performance • without traffic shaping: no scaling, as expected • with traffic shaping: approximate scaling • For nominal event fragment sizes with average and RMS of 2 kbyte achieved about 60 kHz trigger rate or 120 Mbyte/s per node (almost 2 Gbyte/s aggregate) • That is, today, a factor two off from CMS needs, assuming scaling. • Measurementsprovide parameters for simulation of large scale (512x512) systems
Multi-step Event Building Step 1: at 100 kHz Rejection factor 10 with 0.25 of the data from High Level Trigger 100 kHz Step 2: at 10 kHz Remaining 0.75 of the data 10 kHz Throughput reduced by 0.25+0.1x0.75=0.33, ie factor 3 At the cost of control complexity and increased latency • With link speed of 1 Gbps need factor 2 from multi-step event building for 100 kHz level-1 rate (assuming 100% efficient switch ) • If higher speed links in 2003-2004, then single-step event builder