150 likes | 294 Views
LHCb upgrade Workshop, Oxford, 07.12.2010 Xavier Gremaud (EPFL, Switzerland ). TELL40 Data processing. Data flow Input data format Time reordering Clusterization Output format Conclusion. TELL40 Data processing. DATA FLOW. Data from two column processors.
E N D
LHCb upgrade Workshop, Oxford, 07.12.2010 Xavier Gremaud(EPFL, Switzerland) TELL40 Data processing
Data flow Input data format Time reordering Clusterization Output format Conclusion Xavier Gremaud, EPFL TELL40 Data processing
Xavier Gremaud, EPFL DATA FLOW Data from two column processors Split the GBT data in 2x40b Reconstruct the Super Pixel Packet (SPP) 80b wide Linker 0, assemble data from 2 SPP data stream Time Reordering Clusterization + ToTcorrection (subtraction) (maybe lookup table based calibration)
Xavier Gremaud, EPFL DATA FLOW Linker 1, assemble data from 3 GBT, 64b->128b Linker 2, assemble data from 2x3 GBT, 128b->256b Linker 3, assemble data from 2x2x3 GBT, 256b->512b Linker 4, assemble data from 2x2x2x3 GBT, 512b MEP assembly (note : average event is only 2..4 512-bit word long) External memory 2x256b Ethernet framer 512b
For 1 link : 80b/25ns = 3.2 Gb/s For 24 links : 77 Gb/s The 80b wide GBT word is divided into two 40b data streams which are filled by the column processor (fixed position in the 80b data word). Xavier Gremaud, EPFL Input data format
The RAM space is divided in 512 equally sized memory blocks (space reserved for data arriving in random order) • RAM location defined with LSBs of BxID (BCNT) Note: The total memory space required is: max. time delay allowed * the max. event size allowed (space for every event has to be reserved!) Xavier Gremaud, EPFL Time reordering
In the current FPGA EP4SGX530 (largest Altera Stratix IV device) «only» 64x144kB memory blocks are available. Choosing a time reorder buffer of 512 events deep and 8 word event size occupies 48 memory blocks (maximum size reached!) Note: There are no other large memories required for the other processing steps. Conclusion: Each GBT link is restricted to 8 SPP (Super Pixel Packets) smaller than 64bit. For the total pixel chip, the maximum number of SPPs is 5x8=40/event. Time reorder is possible for up to 512-16=498 events. Xavier Gremaud, EPFL Time reordering
Clusterization requires to split up the SPP format (for example two isolated pixels can be in the same SPP)! Most obvious approach for clusterization is to use one seeding pixel and search for possible neighbours. Very difficult to perform “perfect” clusters, average time per cluster is limited to 25ns if done in a pipeline, otherwise 25ns for the complete event! The 16b seeding hit address is reconstructed from the 12b address, the 4b row header and the 4b hitmap. An additional link source id is required to identify data from 24 different GBT links (+5bit) Xavier Gremaud, EPFL Clusterization
The principal goal of the clusterization is data reduction, “perfect” clustering like for Tell1 is not possible anymore. Additional processing in a CPU is required to finish: • Forming clusters over boundaries of GBT links • Combining separated clusters • Forming clusters for events with too high pixel count (see illustration next slide) Xavier Gremaud, EPFL Clusterization The cluster form depend of the seeding hit, which is the first hit. One “normal cluster” can be split in two clusters.
To pipeline the cluster search, only one cluster per pipeline step is formed. One pipeline step takes 25ns (2-300Mhz processing frequency) In average the hottest region has 2..4 pixels “only” per event and per GBT (10..20 pixel per chip)! The cluster search is performed by searching neighbors from the first hit in the data. Each consecutive pipeline stage has the identical function. The total number of clusters that can be formed is limited by the number of pipeline stages. Xavier Gremaud, EPFL Clusterization pipelined
Xavier Gremaud, EPFL The cluster size is restricted to multiple of bytes! (Data processing on the FPGA but also on the CPU becomes very difficult otherwise) The expected data reduction from clustering taking for 50% 1-hit and 50% 2-hit clusters is order of 14%. Q: Is it worth while doing “not perfect” clustering for 14% data reduction? Q: Does the CPU take advantage from such clusters? Q: Does anybody know an other feasible clustering approach? Clusterization data reduction performance
After the 24 links are linked together, the data are put in a MEP format to reduce the data before the DDR3 SDRAM. The Bcnt appears only once per event (small data reduction can be expected) (-12bit). Xavier Gremaud, EPFL OUTput format
The real challenge of the data processing is not to spend more than 25ns per event! Pipelining is required everywhere! • Time reordering for 512 events reaches the limit of the FPGA internal memory. • ToT calculation from BCnt and timestamp is no problem. Calibration per pixel is impossible! • No more real data reduction (zero suppression) like in TELL1. • Small reduction from removing BCNT (-12-bit / SPP) • Small increase from source ID (+5-bit / cluster) • Small decrease from clustering (-14%) • Largest reduction due to not fully loaded GBT links from furthest pixel chips from the beam. • Long time average reduction due to empty bunch crossings. Xavier Gremaud, EPFL Conclusions (I)
Very wide buses require large multiplexers for padding (eg a 512-bit bus requires for byte padding a multiplexer of 512x64 (32K connections)). Maybe at some stage in the processing the padding has to be reduced to 32-bit minimal size. Clusterization useful and fast enough? Need some test with real data and a distribution of the cluster sizes. Xavier Gremaud, EPFL Conclusions (II)
Implementation of the processing including clustering in VHDL Simulation of the processing with MC data Place and route of the design to get better idea of possible processing frequency and resource management. Xavier Gremaud, EPFL Outlook