300 likes | 485 Views
Overview of Upgrade DAQ. Niko Neufeld, CERN. 1 st Upgrade DAQ Mini-workshop, May 27 th 2013, CERN. Introduction – the task. Assuming there are 11448 GBT links for the DAQ to be read out (true in a Universe containing a pixel VeLo , a reduced OT and a SciFi )
E N D
Overview of Upgrade DAQ Niko Neufeld, CERN 1st Upgrade DAQ Mini-workshop, May 27th2013, CERN
Introduction – the task • Assuming there are 11448 GBT links for the DAQ to be read out (true in a Universe containing a pixel VeLo, a reduced OT and a SciFi) • The total amount of data is thus 12000 x 3.2 == 38400 Gbit/s == 4.8 TB/s • GBT wide-mode disregarded for this talk • These data needs to be collected and filtered • The proposed system should be reliable, operated for 10 years by a small crew, scalable and as cost-effective as possible Upgrade DAQ Overview - Niko Neufeld
Some constraints • Input links are bundled in units of 24 • Not a magic number but a good match for current state of the art FPGA technology • With foreseeable technology the required system (network + filter-farm) does not fit into the existing space, cooling and power available • CERN has agreed to provide a new dedicated modular data-centre – this has the additional advantage that in principle installation (where it makes economical sense) can start before LS2 • Also the up-graded LHCb events will be small – MEP will be needed more than ever (packing factor will increase) Upgrade DAQ Overview - Niko Neufeld
The topic of this talk • The technologies and technology trends which we can use • I will limit myself here to the baseline: a farm of x86-servers interconnected with an industry standard local area network, i.e.:I do not consider the potential effect of micro-servers (Atom, ARM), GPUs or other co-processors • This does not mean that we exclude the use of these, but the study of their implications on the DAQ architecture requires dedicated effort and needs (first) to be warranted by demonstrated potential gains (also we need topics for further workshops ) • Which architectures are possible with these technologies and their relative merits Upgrade DAQ Overview - Niko Neufeld
The good things about our current DAQ – What we want to keep • Simple single-stage data-flow • Smallest possible number of hardware components • It’s not easy to imagine to take something away and replace it by software and/or more work in other elements in the system (a kind of “Nash” equilibrium ) • Scalability Upgrade DAQ Overview - Niko Neufeld
What we probably need to change in the Upgrade DAQ • Currently we buffer mostly in the network • It looks like that buffering in the servers will be significantly cheaper and large buffers leave more options for the event-building protocols used • Currently the packing factor in MEP is determined by the Ethernet MTU (9000 bytes) • With hardware assists in network ASICs the MTU becomes less relevant, however reducing the message rate becomes even more important at 10, 20, 30 MHz Upgrade DAQ Overview - Niko Neufeld
Technology Links, PCs and networks
The evolution of PCs - 1 • PCs used to be relatively modest I/O performers (compared to FPGAs), this has radically changed with PCIe Gen3 • Xeon processor line has now 40 PCIe Gen3 lanes / socket • Dual-socket system has a theoretical throughput of 1.2 Tbit/s(!) • Tests suggest that we can get quite close to the theoretical limit (using RDMA at least) • This is driven by the need for fast interfaces for co-processors (GPGPUs, XeonPhi) • CPU will be the bottle-neck in the server - not the LAN interconnect – 10 Gb/s by far sufficient Upgrade DAQ Overview - Niko Neufeld
The evolution of PCs - 2 • More integration (Intel roadmap) • Integrate all important parts of the server in the CPU: • memory controller , PCIe controller • NIC (2 – 3 years?) • physical LAN interface (4 – 5 years)? • Aim for high density, because of • short distances (!) • efficient cooling • less real-estate needed (rack-space) Upgrade DAQ Overview - Niko Neufeld
Ethernet • The champion of all classes in networking • 10 Gbit/s, 40 Gbit/s, 100 Gbit/s out and 400 Gbit/s in preparation • As speed goes up so does power-dissipation high-density switches come with high integration • Market seems to separate more and more in to two camps: • carrier-class, deep-buffer, high-density, flexible firmware (FPGA, network processors) (Cisco, Juniper, Brocade, Huawei), $$$/port • data-centre, shallow buffer, ASIC based, ultra-high density, focused on layer 2 and simple layer 3 features, very low latency, $/port (these are often also called Top Of the Rack TOR) • These trends get more pronounced as speed goes up – i.e. from 10 Gb/s to 40 Gb/s Upgrade DAQ Overview - Niko Neufeld
The “two” Ethernet’s Prices exclude optics • If possible less core and more TOR ports • buffering needs to be done elsewhere Upgrade DAQ Overview - Niko Neufeld
InfiniBand • Driven by a relatively small, agile company: Mellanox • Essentially HPC + some DB and storage applications • Only competitor: Intel (ex-Qlogic) • Extremely cost-effective in terms of Gbit/s / $ • Open standard, but almost single-vendor – unlikely for a small startup to enter • Software stack (OFED including RDMA) also supported by Ethernet (NIC) vendors • Many recent Mellanox products (as of FDR) compatible with Ethernet Upgrade DAQ Overview - Niko Neufeld
InfiniBand prices Worth some R&D at least Upgrade DAQ Overview - Niko Neufeld
The devil’s advocate: why not InfiniBand *? • InfiniBand is (almost) a single-vendor technology • Can the InfiniBand flow-control cope with the very bursty traffic typical for a DAQ system, while preserving high bandwidth utilization? • InfiniBand is mostly used in HPC, where hardware is regularly and radically refreshed (unlike IT systems like ours which are refreshed and grown over time in an evolutionary way). Is there not a big risk that the HPC guys jump onto a new technology and InfiniBand is gone? • The InfiniBand software stack seems awfully complicated compared to the Berkeley socket API. What about tools, documentation, tutorial material to train engineers, students, etc…? * Enter the name of your favorite non-favorite here! Upgrade DAQ Overview - Niko Neufeld
The evolution of NICs PCIe Gen4 expected EDR (100 Gb/s) HCA expected PCIe Gen3 available 100 GbE NIC expected Chelsio T5 (40 GbE and Intel 40 GbE expected Mellanox FDR Mellanox 40 GbE NIC Upgrade DAQ Overview - Niko Neufeld
The evolution of lane-speed • All modern interconnects (> 10 Gb/s) are multiple serial • Another aspect of “Moore’s” law is the increase of serialiser speed • Higher speed reduces number of lanes (fibres) • Cheaper interconnects also require availability of cheap optics (VCSEL, Silicon-Photonics) • VCSEL currently runs better over MMF (OM3, OM4 for 10 Gb/s and above) per meter these fibres are more expensive than SMF • Current lane-speed 10 Gb/s (same as 8 Gb/s, 14 Gb/s) • Next lane-speed (coming soon and already available on high-end FPGAs) is 25 Gb/s (same as 16 Gb/s) • should be safely established by 2017 (a hint for GBT v3 ?) • It is widely believed that the volume of 100 Gb/s deployments and beyond will at least use 25 Gb/s (not 10 Gb/s) – example InfiniBand EDR (this year!). PCIe may soon be the only volume market for 10 Gb/s VCSELS. Upgrade DAQ Overview - Niko Neufeld
The evolution of switches – BBOT(Big Beasts Out There) • Brocade MLX: 768 10-GigE • Juniper QFabric: up to 6144 10-GigE (not a single chassis solution) • Mellanox SX6536: 648 x 56 Gb (IB) / 40 GbE ports • Huawei CE12800: 288 x 40 GbE / 1152 x 10 GbE Date of release Upgrade DAQ Overview - Niko Neufeld
Inside the beasts • Many ports does not mean that the interior is similar • Some are a high-density combination of TOR elements (Mellanox, Juniper Qfabric) • Usually just so priced that you can’t do it cheaper by cabling up the pizza-boxes yourself • Some are classical fat cross-bars (Brocade) • Some are in-between (Huawei, CLOS but lots of buffering) • Upshot: Suitable devices exist for both InfiniBand and Ethernet cost will determine Upgrade DAQ Overview - Niko Neufeld
The eternal copper • Surprisingly (for some) copper cables remain the cheapest option, iff distances are short • A copper interconnect is even planned for InfiniBand EDR (100 Gbit/s) • For 10 Gigabit Ethernet the verdict is not yet passed, but I am convinced that we will have 10 GBaseT on main-board “for free” • It is not yet clear if there will be (a need for) truly high-density 10 GBaseT line-cards (100 ports+) Upgrade DAQ Overview - Niko Neufeld
Keep distances short • Multi-lane optics (Ethernet SR4, SR10, InfiniBand QDR) over multi-mode fibres are limited to 100 (OM3) to 150 (OM4) meters • Cable assemblies (“direct-attach) cables are either • passive (“copper”, “twinax”), very cheap and rather short (max. 4 to 5 m), or • active – still cheaper than discreet optics , but as they use the same components internally they have similar range limitations • For comparison: price-ratio of 40G QSFP+ copper cable assembly, 40G QSFP+ active cable, 2 x QSFP+ SR4 optics + fibre (30 m) = 1 : 8 : 10 • On-chip transceivers (si-photonics) are expected to be limited to about 70 m (on MMF) Upgrade DAQ Overview - Niko Neufeld
Architecture Make the best of available technologies
Principles • Minimize number of expensive “core” network ports • Use the most efficient technology for a given connection • different technologies should be able to co-exist (e.g. fast for building, slow for end-node) • Try to be open for interconnect technology • keep distances short • Exploit the economy of scale try to do what everybody does (but smarter ) Upgrade DAQ Overview - Niko Neufeld
Complete read-out on/to the surface Detector • Long distance covered by versatile links (GBT) • Maximum of 303 m from FE on detector to AMC40 input • Measurements with prototype FE and AMC40 show that this works well with excellent margins • Can use cheapest commercial links • Easiest operation (accessibility) Readout Units 100 m rock DAQ network Compute Units Upgrade DAQ Overview - Niko Neufeld
Split read-out system Detector • In this scenario the event-building is done in UX85A • Part of the HLT can also be run there • Pros: • fewer links to the surface • Cons: • need to refurbish completely D1 and D2 • Remaining links to the surface will be very expensive – adding bandwidth (scaling!) becomes expensive • Since the bulk of the cost is in the connectors, break-points and testing of the fibresand not in the material itself, the cost savings due to reduced number of links are not as big as might be hoped Event-building / HLT0 100 m rock DAQ network Compute Units Upgrade DAQ Overview - Niko Neufeld
Protocols & topologies • Pull, push, barrel-shifting can be done of course on any topology (more in Daniel & Guoming’s talk) • Also it is not an ideological discussion between “pushing” or “pulling” – after all the data flows always from the detector to the farm , but which management of the flow is the most efficient • It is more the switch-hardware (and there in particular the amount of buffer-space) and the properties of the low-level (layer-2) protocol which will make the difference Upgrade DAQ Overview - Niko Neufeld
Classical fat-core event-builder About 1600 core ports Upgrade DAQ Overview - Niko Neufeld
Uniform event-builder network 100 Gb/s bi-directional These 2 not necessarily the same technology 10 GBaseT Upgrade DAQ Overview - Niko Neufeld
Uniform Network • Advantages • Fewer overall network ports • Technology independence of event-building network • Allows simple push architecture with minimum buffering in AMC40 • Allows using most light-weight event-building based on Remote DMA • Challenges • Lots of I/O in event-building servers • Need to find the most efficient way to couple the AMC40 to a PC: Ethernet, InfiniBand, PCIe • Need careful task organization on event-building PCs (task priorities, IRQ handling etc…) Upgrade DAQ Overview - Niko Neufeld
Planning • Under the assumption that we need to be ready to take data in 2019 • LS2 shifts so do we • System for SD-tests ready whenever needed minimal investment • 2013: DAQPIPE ready, tests of 40 GigE and IB, PC I/O technology, PCIe performance (see talks by Umberto and Beat) • 2013 – 16: technology following (PC and network) • 2014 - 15: Large scale IB and Ethernet tests • 2016: TDR & tender preparations • 2017: Acquisition of minimal system to be able to read out every GBT (at some speed) • Acquisition of modular data-center • 2018: Acquisition and Commissioning of full system • starting with network • farm as needed Upgrade DAQ Overview - Niko Neufeld
Conclusions • Technology is on our side • Aim for a simple data-flow (push wherever possible), as few components as possible • Stay flexible: decide on core network technology as late as possible • Uniform network currently looks like the best architecture to achieve the above goals Upgrade DAQ Overview - Niko Neufeld