190 likes | 349 Views
Using FPGAs to Generate Gigabit Ethernet Data Transfers & The Network Performance of DAQ Protocols. Dave Bailey, Richard Hughes-Jones, Marc Kelly The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks”. Collecting Data over the Network. Detector elements e.g. Calorimeter Planks.
E N D
Using FPGAs to Generate Gigabit Ethernet Data Transfers&The Network Performance of DAQ Protocols Dave Bailey, Richard Hughes-Jones, Marc Kelly The University of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Collecting Data over the Network Detector elements e.g. Calorimeter Planks • Aim for a general purpose DAQ solution for CALICE • CAlorimeter for the LInear Collider Experiment • Take ECAL as an example. • At the end of the beam spill the planks send all the data, to the concentrators • Concentrators ‘pack’ data & send to one processing node • Classic bottleneck problem for the switch Custom Links ●●● Concentrators Ethernet Switches Output link Bottleneck Queue Processing Nodes 1 Burst / Node IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
XpressFX Vertex4 Network Test Board • XpressFX Development Card from PLDApplications • 8 lane PCI-e card • Xilinx Virtex4FX60 FPGA • DDR2 memory • 2 SFP cages – 1GigE • 2 HSSDC connectors IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Overview of the Firmware Design • Virtex4FX60 has: • 16 RocketIP Multi-Gigabit Tranceivers • Large internal memory • 2 PPC CPUs • Ethernet Interface • Embedded MAC • RocketIO • Packet Buffers & logic • Allows routing of input • Prioritising of output • Packet State Machine • Packet Generator • State Machines • VHDL model HC11 CPU • Control of MAC State Machines (Green Mountain Computer Systems) • Reserve the PPC for data processing IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
The State Machine Blocks • Packet Generator • CSRs (set by HC11) for • Packet length • Packet count • Inter-packet delay • Destination Address • Request – Response • RX State Machine • Decode Request Packet • Checksum RFC768 • Action Mem writes • Q Other Requests • FIFO • TX State Machine • Process Request • Construct reply • Fragment if needed • Checksum • Packet Analyser Packet Analyser State Machine IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
The Receive State Machine End of packet Packet in Queue Idle Empty Packet Read Header Wrong packet type Fifo written Correct packet type Fifo has: Address cmd Fill Fifo Read Cmd Bad cmd Not a memory write Write finished All bytes received Write Mem Check Cmd Do Cmd Is a memory write Good cmd IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
The Receive State Machine End of packet cmd in fifo Idle End Pkt Send Header &cmd Xsum sent Header & cmd sent cmd needs no data Send Xsum Check Cmd Update Counter cmd requires data More data to send Send Memory All bytes have been sent All Sent? Max packet size or byte count done IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
The Test Network • Use for testing Raw Ethernet Frame generation by the FPGA • Test Data collection with Request-Response protocols Responding nodes FPGA Concentrator Cisco 7609 1 GE and 10 GE blades Requesting Node IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Request-Response Latency 1 GE • Request sent from PC • Linux Kernel 2.6.20-web100_pktd-plus • Intel e1000 NIC • Interrupt Coalescence OFF on PC • MTU 1500 bytes • Response Frames generated by FPGA code • Latency 19.7 µs well behaved • Latency Slope 0.018 µs/byte • B2B Expect: 0.0182 µs/byte • Mem 0.0004 • PCI-e 0.0018 • 1GigE 0.008 • FPGA 0.008 • Smooth to 35,000 bytes IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Packet loss FPGA PC ethCal_recv : Frame jitter • 12 us frame spacing (line speed) • 25 us frame spacing Peak separation 4-5 us no coalescence IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Test the Frame Spacing from the FPGA • Frames generated by FPGA code • Interrupt Coalescence OFF on PC • Frame size 1472 bytes • 1M packets sent. • Plot mean of observed frame spacing vs requested spacing • Appear have offset of -1 us ? • Slope close to 1 as expect • Packet loss decreases with packet rate. • Packet lost in receiving host • Larger effect than UDP/IP packets • UDP/IP losses linked to scheduling IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
The Test Network • Use for testing Raw Ethernet Frame generation by the FPGA • Test Data collection with Request-Response protocols • This time use 10GE hosts • But does 10GE work on a PC?? Responding nodes FPGA Concentrator Cisco 7609 1 GE and 10 GE blades Requesting Node IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
10 GigE Back2Back: UDP Throughput • Motherboard: Supermicro X7DBE • Kernel: 2.6.20-web100_pktd-plus • NIC: Myricom 10G-PCIE-8A-R Fibre • rx-usecs=25 Coalescence ON • MTU 9000 bytes • Max throughput 9.4 Gbit/s • Notice rate for 8972 byte packet • ~0.002% packet loss in 10M packetsin receiving host • Sending host, 3 CPUs idle • For <8 µs packets, 1 CPU is >90% in kernel modeinc ~10% soft int • Receiving host3 CPUs idle • For <8 µs packets, 1 CPU is 70-80% in kernel modeinc ~15% soft int IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
●●● ●●● Time Time Scaling of Request-Response Messages • Requests from 10GE system • Interrupt Coalescence OFF on PC • Frame size 1472 bytes • 1M packets sent. • Request 10,000 bytes of data • Host does fragment collectionlike the IP layer • Sequential Requests: • Time to receive all responses scales with round trip time. • As expected from sequential requests • Grouped Requests: • Collection time increases by 24.6µs per node. • From network alone expect 1+12.3 = 13.3 µs IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Sequential Request-Response • Interrupt Coalescence OFF on PCs • MTU 1500 bytes • 10,000 packets sent. • Histograms similar • Strong 1st peak • Second peak 5 µs later • Small group ~25 µs later • Ethernet occupancy for 1500 bytes: • 1Gig 12.3 µs • 10Gig 1.2 µs IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Grouped Request-Response • Interrupt Coalescence OFF on PCs • MTU 1500 bytes • 10,000 packets sent. • Histograms multi-nodal • Second peak ~ 7 µs later • Small group ~25 µs later IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Conclusions • Implemented MAC and PHY layers inside Xilinx Virtex4 FPGA • Learning curve steep had to overcome issues with • Xilinx “CoreGen” design • Clock generation & stability on PCB • FPGA easily drives 1Gigabit Ethernet at line rate • Packet dynamics on the wire as expected • Loss of Raw Ethernet frames in end host being investigated • Request-Response style data collection promising • Developing a simple Network test system • Planned upgrade to operate at 10Gbit/s • Work performed in collaboration with ESLEA UK e-Science & EU EXPReS projects: IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
Any Questions? IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester
10 GigE UDP Throughput vs packet size • Motherboard: Supermicro X7DBE • Linux Kernel 2.6.20-web100_pktd-plus • Myricom NIC10G-PCIE-8A-R Fibre • myri10ge v1.2.0 + firmware v1.4.10 • rx-usecs=0 Coalescence ON • MSI=1 • Checksums ON • tx_boundary=4096 • Steps at 4060 and 8160 byteswithin 36 bytes of 2n boundaries • Model data transfer time as t= C + m*Bytes • C includes the time to set up transfers • Fit reasonable C= 1.67 µs m= 5.4 e4 µs/byte • Steps consistent with C increasing by 0.6 µs • The Myricom drive segments the transfers, limiting the DMA to 4096 bytes – PCI-e chipset dependent! IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester