1 / 38

The CMS Event Builder Demonstrator based on Myrinet

The CMS Event Builder Demonstrator based on Myrinet. Introduction Myrinet Overview Tests of the Switching Fabric Event Building Studies Future Work and Conclusions. Frans Meijers CERN/EP on behalf of the CMS DAQ group CHEP2000, Padova Italy, Feb 2000. Introduction.

marin
Download Presentation

The CMS Event Builder Demonstrator based on Myrinet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The CMS Event Builder Demonstrator based on Myrinet Introduction Myrinet Overview Tests of the Switching Fabric Event Building Studies Future Work and Conclusions Frans Meijers CERN/EP on behalf of the CMS DAQ group CHEP2000, Padova Italy, Feb 2000

  2. Introduction • DAQ architecture and EVB parameters • Event building by switches. Crossbar • EVB traffic shaping: barrel shifter • Banyan network • A multistage 1024 port switch • The CMS DAQ system

  3. DAQ architecture and EVB parameters Detector Front-end Level 1 Trigger Readout Systems Run Event Builder Networks Control Manager Builder and Filter Systems ComputingServices Level-1 Maximum trigger rate 100 kHz High Level Trigger acceptance 1 - 10 % 1 Mbyte Average event size Number of Readout Units 512 Average event fragment size 2 kbyte 1 Tbps Builder network (512x512 port) aggregate throughput

  4. Event building by switches. Crossbar NxN matrix N2 number of crosspoints • The maximum switch load for random traffic is about 63% (large N limit) due to head-of-line blocking • Higher efficiency: • queues at input and/or outputs ports • traffic shaping (example: barrel shifter 100%)

  5. 5 Step 4 Step 2 Step 3 Step 1 4 Event 3 2 1 4 3 1 2 Event 5 EVB traffic shaping: barrel shifter • sources emit to mutually exclusive • destinations in a cycle • works only for fixed size chunks • needs synchronisation

  6. s0 d0 d7 s7 Banyan network • single path per connection • suffers from internal blocking • number of cross points : N log2 N Example : 8x8 made of 3 stages 2x2 (8=23) • For random traffic (no intermediate IQ and no OQ): • efficiency drops with s, N; for “infinite” N, eff. 20% • There exists a non-blocking barrel-shifting pattern

  7. basic unit: 8x8 crossbars • 3 stages: 512x512 • need 192 crossbars in total A multistage 1024 port switch Banyan topology: NxN out of nxnN=ns Important to study multistage switches

  8. Detector front-end readout LV1 R U EVM Ctrl F U Computing and Communication Services The CMS DAQ system

  9. Myrinet overview • Myrinet features • Myrinet switches • Network Interface Card

  10. ...... PAYLOAD CRC ROUTING HEADER GO STOP Myrinet features • Myrinet is a System Area Network (SAN) • point to point links, byte wide, full-duplex, 1.3 Gbps per direction, very low error rate • packet structure: routing header, payload and tail each crossbar switch strips leading byte from routing header • wormhole routing (versus store-and-forward) no buffering, low latency, arbitrary length packets • byte based flow control (STOP/GO) • no packet loss inside switching fabric • 3Q 2000: link speed from 1.3 Gbps to 2.6 Gbps

  11. Myrinet switches • M2M-OCT-SW8 • 32 ports • 8 times 4x4 crossbars • Large switch fabric built out of 4x4 crossbar elements • now 8x8 crossbar available as basic element

  12. M2M-PCI64 Memory 2 MByte Address Data LANai7 64 (66 MHz) PCI Send DMA Myrinet host RISC 8 Bridge 32 or 64 SAN link DMA 66 MHz Recv DMA (33 or 66 MHz) 8 (80 MHz, NRZ) Pkt Interface Network interface card • Developed a custom Myrinet Control Program • controls DMA engines • implements low-level communication protocol

  13. Switch tests • Set-up for switch test • Traffic conditions tested • Point-to-point 1x1 • Parameters point-to-point 1x1 • Point-to-Point NxN - Mutually exclusive paths • Block on output port • Block on internal switch • Random Traffic

  14. Demonstrator set-up for switch tests sources destinations • 32 nodes Linux PCs • PC: 450 MHz PII BX PCI 33 MHz/32bit • Myrinet switch: M2M-OCT-SW8, NIC: M2M-PCI64[A] • two-stage Banyan network out of 4x4 crossbars

  15. 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Traffic conditions tested Point-to-point traffic (fixed destinations) Random traffic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

  16. Point-to-point 1x1 link PCI • full host - NIC DMA: limited by PCI (33 MHz/32bit) • partial host - NIC DMA: • NIC memory - link: full packet • host - NIC: only headers • limited by SAN link Allows to load switch to maximum

  17. time per packet = overhead + size / speed Parameters point-to-point 1x1 • above 1 kbyte: linear behaviour • below 1 kbyte: plateau 5 s (NIC-host communication) Full host - NIC DMA Partial host - NIC DMA speed: 141 Mbyte/s -> 92% link eff. speed: 128 Mbyte/s -> PCI speed

  18. 1x1 4x4 8x8 16x16 Point-to-point NxN - Mutually exclusive paths [d = 4*(s%4)+s/4, s=0-15] As expected; Aggregate throughput through the switch is linear in N

  19. Block on output port 1 2 3 4 Force m(=1,2,3,4)sources on the same destination: Each source gets 1/m of Vmax measured at source #0

  20. Block on internal switch Force 2 sources on different destinations, but through same intermediate path: As expected; plateau at Vmax/2 measured at source #0

  21. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Random traffic 1x1 sources send, independently, to a random destination according to a uniform distribution 4x4 16x16 Efficiency: 4x4: 69 % expect 68% 16x16: 51 % limited by head-of-line blocking measured at destinations

  22. Event building studies • EVB demonstrator set-up • Event building protocol • Variable size event fragments • Event building performance • Event building: scaling behaviour • Traffic shaping • EVB performance with traffic shaping • performance for variable size event fragments • EVB with traffic shaping: scaling behaviour • Traffic shaping: time evolution

  23. PC: emulate RU EVM PC: emulate BU EVB demonstrator set-up • 32+1 Linux PCs [450 MHz PII BX PCI 33 MHz/32bit] • Myrinet switch: M2M-OCT-SW8, NIC: M2M-PCI64[A] • 16x16 two-stage Banyan network out of 4x4 crossbars • Myrinet between RUs and BUs (full duplex). N-to-N traffic • Fast Ethernet between BUs and EVM. N-to-1 traffic • No emulation of Level-1 trigger

  24. Event building protocol BU EVM RU level1 Request EvtId EvtId Request Data Send Data Myrinet Clear EvtId Several EvtId messages are grouped in a single Ethernet packet

  25. EVB EVB Builder Units Variable size event fragments Readout Units Log-normal distribution example: Average = 2 kbyte, RMS = 2 kbyte mimics CMS data readout

  26. Event building performance • No traffic shaping • Fixed size event fragments 1x1 • results: • 1x1 is close to point-to-point • Performance decrease from 4x4 to 8x8 to 16x16, as expected • from small sizes: overhead 7 s 4x4 unstable 8x8 16x16 2k Fragment rate per node † 16x16: For 2 kbyte fragments: 30 kHz † Fragment rate per node = level-1 rate

  27. Event building - scaling behaviour • take average fragment size of 2 kbyte • also variable size fragments • results: • For variable size reduced performance, as expected • No scaling in N Need simulation for large N ?

  28. RU0 RU1 RU2 RU3 ... ... ... ... BU0 BU1 BU2 BU3 Traffic shaping • Sources divide fragments into fixed size packets (blocks) and cycle through all destinations • Inspired by ATM rate division (block size is 53 bytes) • Should work for large N multistage switch as well • Implementation: • Performed by NIC control program • Block size set to 4 kbyte (30 s cycle) • Barrel shifter without external synchronisation (Myrinet back pressure by HW flow control) • Packets can be (partially) empty

  29. fragment rate per node 16x16: for 2 kbyte fragments: 65 kHz EVB performance with traffic shaping 4k • fixed size event fragments • results: • close to point-to-point 2k

  30. Fragment rate per node for nominal average of 2k and RMS 2k †: 60 kHz Performance for variable size event fragments decrease of efficiency with larger RMS of fragment size distribution (in agreement with Monte Carlo) 2k [†with full host-NIC DMA about 80 Mbyte/s or 40 kHz]

  31. with traffic shaping: approximate scaling EVB traffic shaping - scaling behaviour EVB

  32. Traffic shaping - time evolution (I) BS cycling rate * block size • 23:00 ? • throughput dropped • traffic shaping barrel shifter stayed in sync ? 2 hours (= 2 108cycles, 10 Tbyte moved)

  33. RU EVM BU Traffic shaping - time evolution (II) BS cycling rate * block size perturb system : 1: slow down RU1: all BU’s reduced rate 2: slow down BU1: only BU1 reduced rate 2 1 traffic shaping barrel shifter stays in sync 1 hour (= 108cycles)

  34. Future work and conclusions • Future work • Conclusions

  35. Future work • Evaluate Myrinet 2000 • available 3Q 2000 • link speed from 1.3 Gbps to 2.6 Gbps • switches based on 8x8 crossbars as elementary units • Further study of traffic shaping • Simulation • Extrapolate to large systems

  36. Conclusions • Event builder demonstrator16x16 based on Myrinet multistage switch and Linux PCs established. • Performed systematic switch studies. As expected. • Measured event building performance • without traffic shaping: no scaling, as expected • with traffic shaping: approximate scaling • For nominal event fragment sizes with average and RMS of 2 kbyte achieved about 60 kHz trigger rate or 120 Mbyte/s per node (almost 2 Gbyte/s aggregate) • That is, today, a factor two off from CMS needs, assuming scaling. • Measurementsprovide parameters for simulation of large scale (512x512) systems

  37. Extra Material

  38. Multi-step Event Building Step 1: at 100 kHz Rejection factor 10 with 0.25 of the data from High Level Trigger 100 kHz Step 2: at 10 kHz Remaining 0.75 of the data 10 kHz Throughput reduced by 0.25+0.1x0.75=0.33, ie factor 3 At the cost of control complexity and increased latency • With link speed of 1 Gbps need factor 2 from multi-step event building for 100 kHz level-1 rate (assuming 100% efficient switch ) • If higher speed links in 2003-2004, then single-step event builder

More Related