280 likes | 350 Views
ATCA based Compute Node as Backend DAQ for sBelle DEPFET Pixel Detector. Andreas Kopp, Wolfgang Kühn, Johannes Lang, Jens Sören Lange, Ming Liu, David Münchow, Johannes Roskoss, Qiang Wang (Tiago Perez, Daniel Kirschner) II. Physikalisches Institut, Justus-Liebig-Universität Giessen
E N D
ATCA based Compute Nodeas Backend DAQfor sBelle DEPFET Pixel Detector Andreas Kopp, Wolfgang Kühn, Johannes Lang, Jens Sören Lange, Ming Liu, David Münchow, Johannes Roskoss, Qiang Wang(Tiago Perez, Daniel Kirschner) II. Physikalisches Institut, Justus-Liebig-Universität Giessen Colleagues involved in project,but not (s)Belle members Dapeng Jin, Lu Li, Zhen'An Liu, Yunpeng Lu, Shujun Wei, Hao Xu, Dixin Zhao (IHEP Beijing, Beijing)
Compute Node (CN) Concept • 5 x VIRTEX4 FX-60 FPGAs • each FPGA has 2 x 300 MHz PowerPC • Linux 2.6.27 (open source version), stored in FLASH memory • algorithm programming in VHDL (XILINX ISE 10.1) • ATCA (Advanced Telecommunications Computing Architecture)with full mesh backplane(point-to-point connections on backplane from each CN to each other CN, i.e. no bus arbitration) • optical links (connected to RocketIO at FPGA) • Gigabit Ethernet • ATCA management (IPMI) by add-on card
Compute Node Data Transfer • All 5 FPGAs are connected pairwise (on the board) by • one 32-bit general purpose bus (GPIO) • one full duplex RocketIO link • 4 of 5 FPGAs have two RocketIO links routed to front panelusing Multi-Gigabit Receivers (MGT)! optical links • 1 of 5 FPGAs serves as a switch! has 13 RocketIO links to all the other compute nodes in the same ATCA shelf • All 5 FPGA have a Gigabit Ethernet Link routed to front panelbandwidth tested with a Virtex-4 FX12 test board· 0.3 Gbit/s TCP· 0.4 Gbit/s UDP total integrated bandwidth · 32 Gbit/s (all channels summed, theoretical limit)
ATCA Shelf 1 kW
Size of the DAQ System • Assuming a requirement of 100 Gbit/s (Estimate H. Moser) for whole pixel detectorATTENTION! Estimate was changed on Valencia meeting,see remarks later (p. 25,26) • >> 1 ATCA Shelf with 14 Compute Nodes (+2 spares) • DATA IN:per 1 compute node 8 optical links @ 1.6 Gbit/sx 14 compute nodes per 1 ATCA shelf= 180 Gbit/si.e. factor 1.8 safety marginDATA OUT:5 x GB Ethernet per 1 compute node @ 0.4 GBit/s • 150k Euro investment in the BMBF application • This is identical size to system size for the HADES Upgrade (test beamtime at GSI, parallel to existing DAQ system, planned for end of 2009) • Note: the compute note is DAQ prototype system for PANDA(>2016)Panda bandwidth requirement is ~10-20% higher than ATLAS<3 x 107 interactions/s
Status of the DAQ System • 1 CN (1st generation) tested in Giessen since ~1/2 year • 2nd iteration (schematics, layout, fabrication)of CN in spring-summer 2009note: PCB has 14 layers • 1 full ATCA Shelf (identical size to sBelle DEPFET system) planned until end of 2009 for HADES experiment at GSI(testbeam maybe spring 2010)
Compute Node Boot Sequence • 5 x VIRTEX-4 FX-60 FPGAs • booting by slave sequential configuration chain(not shown in the block schematics) • @ power up, CPLD powers up • bitstream data (1 file, but contains data for all 5 FPGAs) are copied from FLASH #0 to all 5 FPGAs • bitstream contains a small boot loader for each FPGA, which loads Linux 2.6.27 (open source version), stored in local FLASH memory • then Linux is copied to local DDR2 memory(volatile)
Current Status of Test System • The 1st version CN PCB has been tested. • optical links @ 1.6 Gbps to TRB2 (HADES TPC board,CERN HPTPC and ETRAX 1-chip PC), 0 bit error for 150-hour test • Gigabit Ethernet • JTAG chain • CPLD+Flash system start-up mechanism and remote reconfigurability • DDR2 SDRAM • other peripherals
IPMI • Intelligent Platform Management Interface • I2C2-line serial interface (clock, data) • ATCA Power Management~180 W / compute node neededbut only 10 W management power provided @ power-up! request to shelf manager • CN piggy-back add-on card 75x35 mmdesign/layout in GiessenMicrocontroller AVR Atmega 12802 x 60 pin connector to CN • Additional tasks:read temperature,read voltages (0.9/1.2/1.8/3.3/5.0 V, ADC via I2C), allows for remote reset/reboot, hot swap (i.e. communicate to shelf manager „I will be disconnected from the backplane now“) Johannes Lang
IPMI Add-On Board Johannes Lang
IPMI Add-On Board Johannes Lang
Algorithms to run on the CN ? • pixel subevent building • data reduction (?)if rate estimate is correct, we must achieve a data reduction of factor ~20 on the CN. Preliminary idea: • receive CDC data (from COPPER) • receive SVD data (from COPPER) • track finding and track fitting • extrapolation to pixel detector • matching to pixel hits • identify synchrotron radiation hits (i.e. no track match) and discard • data compression (?)
Example Algorithm #1: HADES Event Selector 1. reads HADES raw data from DDR2 memory(HADES binary data format) 2. uses PLB bus for memory access(LocalLink multiport memory controller only supported with newest core from XILINX) 3. copy data from DDR-2 to Block-RAM (on FPGA) 4. then read FIFO / write FIFOfrom/to Block-RAM 5. event buffer 6. small event decoder 7. issue event yes/no decisiondiscard event or write back Shuo Yang
Example Algorithm #1: HADES Event Selector • Event selection rates of 100% & 25% • Different FIFO sizes (DMA sizes) • Processing throughput of <150 & <100 MB/s (expected to be higher if DMA size increased) Shuo Yang
Example Algorithm #2: HADES Track Finder here max. 12 drift chamber wires fired per 1 track Nucleus+Nucleus collisions, large track density expected Ming Liu
Example Algorithm #2: HADES Track Finder • PLB slave interface (PLB IPIF) for system control • LocalLink master interface for data read/write from/to DDR2 memory • algorithm processor (track finding) Ming Liu
Example Algorithm #2: HADES Track Finder Results: • FPGA resource utilization of Virtex-4 FX60 (<1/5) – acceptable! • Timing limitation: 125 MHz without optimization • Clock frequency fixed at 100 MHz, to match the PLB speed • Processing: • C program running on the Xeon 2.4 GHz computer as software reference • different wire multiplicities (10, 30, 50, 200, 400 fired wires out of 2110) • speedup of 10.8 – 24.3 times per module (compared to reference) • tried integration of multiple cores on 1 FPGA for parallel processing (even higher performance speedup of ~2 orders of magnitude) Ming Liu
A remark on a data compression algorithm • In 2003 for the DAQ Workshop in Nara, we tried MPEG-2 compression on SVD1.5 data (test runs taken by Itoh-san) • DST (not MDST)i.e. incl. SVD raw data panther banks • L4 switched offi.e. incl. some background(but L3 on) • Compression factor ~1.83 achieved • MPEG encoding is working on frames (data chunks)might be easily parallelized on FPGA • C source code for MPEG-2 is available free
Additional Algorithm Developmentall incl. FPGA implementation(work ongoing) • Johannes Roskoss – HADES • RICH ring finder • match ring to drift chamber track(straight line, <12 wires per track) • David Münchow – PANDA • Track Finder for PANDA Straw Tube Tracker conformal mapping and Hough transform helix, 30 hits per 1 track, ~10 tracks per event • Andreas Kopp – HADES • drift chamber track incl. momentum kick in dipole field • match to TOF (2-sided read out scintillator paddles) • match to EM Shower detector
Open Questions on Data Reduction • Can the compute nodes get the CDC and SVD data from COPPER?and then run a track finder algorithm? • by GB Ethernet • if yes, what is the latency? data size? rate? protocol? • At the input of the event builder orat the output of the event builder (i.e. input of L3)? • Is it acceptable for DAQ group? etc. etc.
Plan: Increase of input bandwidth • At the Valencia meeting, it was decided to a.) increase # of pixel rows to increase resolution in z directionb.) increase readout time (accordingly) to 20 s> factor 2 higher data volume • Modification of CN required • If we increase # of links from 44 to 88:no problem (in 1 ATCA shelf there are 112 links) • If we keep # of optical links to 44optical links are tested for 1.6 Gbit/swe need <5 Gbit/s • change FPGAVIRTEX-4 FX60-10 $ 904,-VIRTEX-4 FX60-11 $1131,- • change optical link transceiverFTLF8528P2BCK $140,-FTLF8519P2BNL $ 45,-
Plan: Increase of input bandwidth • Price per 1 compute nodeincreases by ~20% (from $ 8100,- to $ 9995,-)> must reduce # of CN from 14 to 11 to keep budget • Note: bandwidth >1.6 Gbit/s per 1 linkwas never tested • Plan:for the next CN iteration (expected May/June 2009)1 FPGA on 1 of the new CN plus transceiverswill be replaced (in any case) • Testing will be done soon afterwards.
BMBF Application, Details Manpower (applied for)1 postdoc2 Ph.D. students Manpower (other funding sources)Wolfgang Kühn 20%Sören Lange 35%1 Ph.D. student (funded by EU FP7) 50% Travel budget2 trips to KEK per year (2 persons)2 trips inside Germany per year (3 persons)in 2011 6 months at KEK for one personin 2012 3 months at KEK for one person Our share for the workshops(electronics and fine mechanics)is 1:1:1 for Panda:Hades:Super-Bellebut electronic workshop is not involved in compute node