Communication & Data Flow

Communication & Data Flow Marlon Barbero, Bonn University FE-I4 Review, CERN Nov. 3rd - 4th 2009

Contents 10:00 Communication and data flow (1h00') - Input clock and command - Clock multiplier - Data output - Readout architecture overview, simulations & calculations

Talk Overview Pixel Array: 80×336 digital pixels L1T, token, read, … What drove choice of region architecture? Efficiency region? Extra features? pixel config token token 28 b × 40 DC EoC EoC EoC L1T, token, read, … data formatting / compression config. monitoring Periphery: Asynch. FIFO (hamming code) PLL (40MHz  160MHz). Input clock, MUX, clock Xer, high speed serializer… Data formatting block. How? Why chosen format? pixel config Storage in FIFO. How? Why? Data flow in DC. Coding, format. Command decoder: Configuration, reset & L1T. Protocol chosen. 8b10b encoder unit. Why? Specs/Protocol? trigger FIFO digital ctrl block L1T 160MHz global config PLL, 40MHz in, 160MHz out global register bank interface clk select 40MHz ‘LVDS’-out 160Mb/s Powering aux 2

A- Inputs Plan: • A- Inputs. • B- Readout Architecture & Data Flow. • C- Other means of communication -I/O. • A- Inputs: • A1- Clocks & clock multiplication. • A2- Command decoder.

A1- Clock Input • Main clock input: LVDS 40_MHz_clock_in.  This is the clock which is: • sent to the PLL  higher frequency clock generation. • sent to and used by ‘all’ blocks: Command Decoder (CMD), End Of CHip Logic (EOCHL), End Of Double-Column Logic (EODCL), Pixel Digital Region (PDR)… • Note that Data Output Block (DOB) uses higher frequency clock to stream out data at 160Mb/s. We’ll come back to that. • Auxiliary clock input: LVDS AUX_clock_in.  This is the clock which goes: • to the PLL. Bypass PLL and use AUX for data streaming possible. • might be used by somewhere else? (stop mode?) Andre Clock & timing distribution, Abder’s Tomek Implement. in progress

CLKGEN: Clock Multiplier Andre I/O choices for ATLAS IBL, ATLAS Pixel System Design Task Force • For IBL, need to transmit data out at BW of 160Mb/s • 2 options: • send a 80MHz CLK to the FE and use both edges to transmit • Needs modification of BOC / ROD to produce higher speed TTC • Needs synchronization protocol on the FE between 80MHz clock & beam crossing. • A new DORIC needs to decode CLK at twice frequency • send a 40MHz CLK to the FE and multiply clock on FE • Needs a clock multiplier on chip • Note: synergy with what the strip MCC need • In FE-I4, we have FE clock multiplier + AUX clock input: • Clock multiplier from the 40MHz input clock • AUX: possibility to send “your choice of clock” to the FE

CLKGEN specs • I/O: • CLKGEN Input: REFCLK 40MHz input clock  PLL: 640MHz, divided down to 320 /160 / 80 / 40 MHz. • CLKGEN Output: 2 Single Ended clocks selected from internal clocks (not 640MHz), 40MHz in or AUX in. • Why 640 MHz: • Good Duty Cycle for divided down clocks (dual edge serializing initially intended). • Higher freq VCO  Smaller LF cap, reduced area. • Synergy with other projects. • Drawbacks: Power, switching noise. • Area 236×281 μm2, Iaverage_nominal ~ 3-4 mA, settling time ~1.2 μs, loss of lock detect.

PLL Overview Voltage Controlled Oscillator Charge Pump Loop Filter Phase Frequency Detector OUT 640 MHz IN: 40 MHz CLK fed back Frequency Divider 40, 80, 160, 320 available Conversion, Enabling and Buffering

CLKGEN Overview EN_PLL ENables registers config -> registers EN_320M EN_160M 320 EN_80M 160 EN_40M MUX 80 CLK0_out 40 PLL 40 In 640 320 Used for data stream-out in DOB / serializer AUX Ref (40M In) 160 80 40 Each current controlled by 8 bit DAC registers Used in 4 to 1 MUX 320 ICP (from DAC) Ibias (from DAC) 160 MUX 80 CLK1_out 40 40 In Ref2Fast Fb2Fast AUX config -> registers

CLKGEN • MUX scheme allows: 1- serializing data with various clock. 2- tests of 4-chip modules for sLHC in star configuration. 3 FEs, each send data at 80 Mb/s This FE has a special role. Accepts 80 Mb/s streams from 3 FE and streams out at 320 Mb/s.

A2- Command Decoder Roberto • There exists a dual path for configuration. • Test Mode: CMOS pins + Shift Register à la FE-I4_proto1. • We focus here on “standard input”: command decoder. • Select between Command Decoder & Bypass from bond (InMUX_select=1). • CmdDec Inputs: LVDS Command in, LVDS clock in. • Similar to FE-I3 command decoder. • 3 classes of commands: trigger, fast, slow. • Issuing commands during running? No automatic exit from RunMode anymore, but a choice of user (slow ctrl command needed to exit). • If RunMode off, fast command and trigger NOT accepted. Maurice test overview tomorrow

Main feature • Robust vs. SEU: • All triplicated logic + majority vote (Address / WrRegData corrected each clock cycle). • State machine returns to idle quickly by construction (no need of reset -FE-I3 like-). • Error detection provided (XOR of all triplicated outputs). (Increments counter & stored in CmdBitFlip[4:0] Config. Reg.) • Trigger 11101 single bit flip safe (bit flip flagged, but trigger issued). • Various error counters (e.g. invalid field 1, 2, 3) • Fully scan-able: 3 ports, TST_SE (Enable), TST_SI (Scan In), TST_Out (Scan Out). More details provided tomorrow

Commands Trigger OR Fast / Slow? • trigger, fast, slow: • Trigger: only the LV1 command • Fast: 3 commands. • Slow: allows 6 commands (16 possible). Which of 3 Fast OR Slow?

Commands • Trigger: LV1: In RunMode only. Trigger acquisition of event. Only Field 1 (11101) needed  OK with ATLAS requirement (1 trigger per 5 clock cycle). • Fast: In RunMode only. Field 2 (not 1000): • BCR: Bunch Counter set to 0. • ECR: Event Counter reset. Clears data path (all memory pointers, data structures, clears pending events). Interrupts data transmission if in progress. • CAL: Calibration pulse sent in response. Delay (bx granularity) up to 64 bx. Width (1-256). Dig & Analog hits. • Slow: Accepted at all time. No automatic taking of RunMode. Field 2 is 1000. Cal Inj. Abder Dig. Inj. Tomek

Slow Commands Identifies slow command 3+1 bits Chip ID; broadcast 1xxx 6 bit address, for WrReg & RdReg used by all write operations: into RegisterBank OR Config FE (40 DC shift register, 672 bits) Note: A DisableRunMode command will be added.

Notes on slow commands -1- • List of global register (8-16b): doc FEI4_Global_Register_vX. • These are SEU-hard latches: • Analog Pixel tuning: e.g.: PrmpVbpf, DisVbnA, Amp2Vbn… (~20) • FE Config.: PxStrobes (13 bits / 13 latches), PxSRSetup (Write to which DC, S0, S1, …  global communication to/from analog DC). • LVDS (bias) / PLL (bias, clk config.) / VCAL (Internal calibration, 10 bits + setting delay 6 bits, LSB~1ns). • DIG mode (DC clock source in stop mode, 8b/10b disable,…) • ColMask / ErrorMask / Trigger (Latency setting, self trigger, # consecutive…). • Empty Record (empty pattern when 8b10b disabled). • ANAsel 1/2/3: MUX test analog buffer. • CMOSout 1/2: sel for InMUX. Abder Michael Andre Tomek/ J-D see data out protocol see InMUX

Notes on slow commands -2- • Global Register: • Also EFuse shadow register: for redundant SR of Double-Columns, trimming of references (CREF, VREF), Chip Serial Number. • Grand Total: ~ 50 Global Register, either 8b or 16b wide.

Notes on slow commands -3- • WrFrontEnd: writing configuration to FE, 672 bits / 1 Double-Column granularity. Which DC addressed set by PxSRSetup register (0-40 possible). 13b / pixel  Configuration of complete FE takes ~9ms. • Note on WrFrontEnd: writing the register is also shift register out.

Notes on slow commands -4- • GlobalReset: Reset the whole FE to initial state. • GlobalPulse: Reset command of various length. Selective reset based on length (à la FE-I3). Also used to latch / read data, inject dig hit, ctrl stop mode. • EnDataTake: Sets the FE in RunMode. Can then decode L1T and Fast Commands. No automatic exit.

B- Readout Architecture, Data Flow • B1- 4-pixel digital region. • B2-Data transfer through the Double-Column. • B3- Compression / formatting at End of Column. • B4- Storage in FIFO. • B5- 8b10b coder and protocol out.

B1- 4-pixel digital region Tomek • Choice made based on 3 ways of checking performance of architecture chosen: • C++ description of chip: flexible framework with time-based description of pixel region / DC / Chip / Communication protocol  all 1st studies coupling pixels in phi / z / z&phi, various region size… Based on physics hits (see backup). Identified 4-pixel region architecture. • Verilog model and test bench: Towards implementation. Other sources of inefficiency + power. • Analytical model: Mathematical crosscheck of inefficiency (not time-based, no protocol).  Coherent picture  For 3×LHC full lumi & 3.7cm layer: 4-pixel region tied in phi & z the winner! David Arutinov Tomek

4-pixel region specs • Storage of up to five 4-pixel + neighbor events. • Small / big hit discrimination, 3 programmable modes (of course no discrimination available too). 2 BX association for small hits. • Analog info = 4b ToT. • Neighbor Logic (small hits in adjacent pixels -phi-): 4 bits. • Records up to 16 consecutive triggers. Programmable latency up to 255 BX.

Digital Pixel: Regional Architecture Digital Region 4-Pixel Unit Read & Trigger Token disc. top left disc. top right hit proc.: TS/sm/big/ToT 5 ToT memory /pixel disc. bot. left disc. bot. right L1T Read Neighbor 5 latency counter / region low traffic on DC bus local storage • Store hits locally in region until L1T: 0.25% of pixel hits shipped to EoC  DC bus traffic “low”. • Consequences of regional architecture: • Each pixel is tied to its neighbors -time info- (clustered nature of real hits). Small hits are close to large hits! To record small hits, use position instead of time. Handle on TW. Spatial association of digital hit to recover lower analog performance. • Lowers digital power consumption (below 10 μW / pixel at IBL occupancy). • Physics simulation  Efficient architecture.

Performance / Efficiency IBL: charge sharing in Z comparable to phi Regional Buffer Overflow η=0 0.6% @ IBL rate, pile-up inefficiency is the dominant source of inefficiency • Inefficiency: • Pile-up inefficiency (related to pixel x-section and return to baseline behavior of analog pixel)  ~ 0.5%. • Regional buffer overflow  ~0.05%. • Inefficiency under control for IBL occupancy. Mean ToT = 4

Digital Power Drop on Vdd • Digital power: • at IBL occupancy, • digital power < 10μW/pixel. 4-pixel region Tomek gives more recent estimate in his PDR for 21 regions <7mV

Tomek B2- Digital DC / Data transfer • Made of 168 4-pixel digital region. • In DC, Token based readout (dual token scheme DC / EoC with triple redundancy + majority voting). • 21 4-pixel digital region the base structure for clock / buffering: • Skew-compensated clock routing ~0.8ns skew for all pixels of array? • Buffering of read / L1T. • Data transferred to FIFO asap. All controlled by EOCHL. • Address transfer with minimal number of gates for yield enhancement (thermal encoder scheme). Data + Address is hamming coded, decoded and corrected before data compression block. Jan-David

B3- Formatting at EoC Tomek • Reducing bandwidth an issue, both at IBL & sLHC. • Estimated data rates with the same tools as previously described, physics-based (MC data from Vadim Kostyukhin), extrapolation at various radius and various possibility to reduce rates. • Studied clustering possibility, proximity algorithms, formatting. See backup formatting section. • Formatting also to fit FIFO / 8b10b coding needs.

e.g.: 10×LHC (50ns bx) / sLHC FE-I4, 50μm×250μm. FE-I4 simul., 50μm×250μm. FE-I4 Nigel, 50μm×250μm. FE-I4 sdtf 220908, 50×250μm2. r [mm] mean: 3.9 210 201 η=1.0 η=0 η=0.1 η=0.2 η=0.3 η=0.4 η=0.5 η=0.6 η=0.7 η=0.8 η=0.9 200 η=1.2 160 rates given in [pixel hits.bx-1cm-2] 150 131 122.5 mean: 8.4 120 η=1.5 mean: 13.4 88.5 80 η=2.0 70 mean: 34 50.5 η=2.5 40 61.18 58.74 60.02 60.12 59.15 55.10 38.67 mean: 60 37/37 η=3.0 η=3.5 z [mm] 0 0 100 200 300 400 500 600 324 524

Pixel occupancy  Data bandwidth • Pixel hit rate  FE output bandwidth: • # bits / pixel transmitted? • address 7+9 bits, analog info 4+2 bits 22b? • data output protocol? • Reduce data output by taking into account clustered nature of physics hits / geometry. NUMBER OF PIXELS 3xLHC FE-I4, central module, 3.7cm layer 3xLHC 10xLHC FE-I4, central module, 3.7cm layer FE-I4, central module, 21cm layer

Formatting considered histo distance numbering scheme • Clustering: Z-clustering. Can have logic in EoC to calculate Zcluster size (above certain # pixel adjacent in Z) discard analog info (long clusters in Z info not useful), ship out pixel ID + size of cluster  At η~2.0, 0.6 BW? BUT Very dependant on FE location, and throw away analog info. NO. • Proximity algorithm: Send out relative addresses pix: 7+9b  1 + 3b address (8 “next pixels” coded this way). 0.8 BW? BUT variable data format, error prone. • Fixed format clustered data transfer.

Fixed format clustered data preliminary • compression factor (all at 3×LHC) 3.7cm (vs. 21cm), η=0 • indiv pixels: 4.09 (0.25)×(7+9+4+2)= 1.00 (1.00) A.U. • static 1×2: 3.45 (0.18)×(7+8+2×4+2)=0.96 (0.83) A.U. • dynamic 1×2: 3.02 (0.15)×(7+9+2×4+2)= 0.87(0.74) A.U. • static 1×4: 2.86 (0.17)×(6+8+4×4+4)=1.08(1.08) A.U. • dyn. in-DC 1×4: 2.43 (0.15)×(6+9+4×4+4)= 0.95(0.95) A.U. • dynamic 1×4: 2.13 (0.14)×(7+9+4×4+4)= 0.85(0.94) A.U. column NL 106.count.FE-1.s-1 row ToT (×336) row (×40) DC Choice: Dynamic phi-pairing (dynamic 1×2) merge neighbours and small hits in process. Compression ok, simple to do and good format, 24 bits (nice for FIFO and 8b10b). Note that hamming decoding needed before formatter.

B4- data storage FIFO Jan-David • In FIFO, record words stored as 3×8b words. • Beginning of data event, EOCHL stores 24-b Header in FIFO. • Then data words are stored, address (16-b) + ToTs (8-b) for 2 pixels. • In FIFO are also stored: • Read back from Configuration. • Service messages. • More in summary data format below.

From DC to FIFO Read Back Service Event Builder Fifo 8 places 3 * 12 Bit DC 6b Word 0 Hamming Encoder From Columns Region Add 8b Word 1 Hamming Encoder Hamming Decoder Data 20b Data Switch Word 2 Hamming Encoder 8 Bits 12 Bits Busy Read Header 36 Bits Write Full Read out Control

B5- 8b10b encoder and protocol • Normal mode is 8b10b coded. • Test mode is 8b10b off. Good thing for testing the link (requirement of off-detector group).

8b10b I/O choices for ATLAS IBL, ATLAS Pixel System Design Task Force • For IBL, need to transmit data out at BW of 160Mb/s • At BOC/ROD: • Data rate 4 times the clock rate • Phase adjustment • Use Clock Data Recovery mechanism • CDR requires an output data stream with good engineering properties • 8b10b: • adequate for this purpose, enough transitions for reliable CDR • widely used  easy to implement • provides some level of error detection • provides comma for frame identification & synchronization

Control symbols & Commas • Control symbols: a set of 12 extra valid 10-bit sequences. • Can be used as command. • K28.1, K28.5, K28.7: commas. 11111 or 00000 can not be found anywhere else in data stream

K.28.7 & bit flip • Single bit flip in K.28.7 cannot transform the stream into another meaningful stream (only into K.28.1 & K.28.5). • K.28.1, K.28.5, K.28.7 can be used for re-synchronization of the 10-bit streams in case of loss of synchronization (only streams having a running disparity of +/-5). • In sync. state, flip in regular data can not generate K.28.7. • Note: Restriction. Not 2 K.28.7 in a row  use K.28.5/1 for fillers.

Frame • Frame built up (meaningful frames + empty records) such as: • detects single bit flips in synchronized state. • tolerant to loss of sync (data slipping)  re-sync on next commas. • state machine in receiving part: • can do CDR. • can search the 11111 or 00000 streamand re-synchronize to the commas. • state machine in receiving part needs to: • check for violation of 8b10b protocol.

Format implemented -1 • Three 8b10b commas used: • SOF: K.28.7. • EOF: K.28.5. • Idle state: K.28.1. • Record words: (all that follows shown before 8b10b for clarity) • 24 bits long. • all start with 11101 (except: data record DR & empty record ER). • Data Header (DH): | 11101 | 001 | xxxx | [3:0]trigID | [7:0]bcID | • header for transmission of regular data. • 001: 1-b flip give invalid code. • xxxx: for later uses. • trigID is trigger ID as received by ROD, bcID bunch crossing ID, needed for internal consecutive triggers (up to 16 trigg depending on RunMode), stop mode (up to 255 trigg).

Format implemented -2 • Data Record (DR): | [6:0]Column | [8:0]Row | [3:0]ToTtop | [3:0]ToTbot | • Column numbering goes from 0000001 to 1010000. • Address Record (AR): | 11101 | 010 | Type | [14:0]Address | • Address of a global register, or the position of the shift register. • 010: Flags the Address Record; 1b flip gives invalid code. • Type: 1 bit information. 0 = Global Register; 1 = Shift Register. • [14:0]Address: If Type 0, address gives the Global Register ID. If Type 1, address gives the Shift Register position. • Note that the transmission of an Address Record always requires the transmission of an associated Value Record. • Value Record (VR): | 11101 | 100 | [15:0]Value | • Value of a global register, or value contained in the shift register. • Note that the use of 11101 followed by 100 for a Value Record allows also for sending Value Records with no Address Record before.

Format implemented -3 • Service Record (SR): | 11101 | 111 | [15:0]Message | • A service message (e.g. error message). • 111: Flags the Service Record; 1b flip gives invalid code. • [15:0]Message: Service message. • Note that SR can belong to a data stream  only a single SR is then allowed. • Empty Record (ER): | 3×ER[xxxx.xxxx] | • When 8b10b coding is turned off, to fit the 24-bit long record word requirement and to ease the recognition of the end of a data stream, Empty Records are simply made of as many as needed 24-bit long programmable words. These can be 0-frames, but also 11001100 for example (that’s the 40MHz clk sent back). • When 8b10b is on, and no data / SR / config. read back is pending, ERs are then transmitted out, made of as many as needed K.28.1 commas.

Summary data format e.g.: SOF | DH | DR | DR | DR | SR | EOF | Idle | Idle | AR| VR | AR | VR | Idle…..

Output rates (sLHC) Requirements for SLHC pixel electrical system (system design task force) Now dynamic phi-pairing, 24 bits / 2 pixels. I/O recommendations for IBL, Dec. 08. (system design task force) The simulations for an IBL at 3.7 cm radius and a luminosity of x3 LHC indicate that a data rate of at least 86 Mbps per FE.

C- Other means of communication -I/O • C1- Stop mode. • C2- External control of DC. • C3- No 8b10b. • C4- InMUX.

C1- Stop mode • Stop standard data acquisition and read all hits from chip. • How: • clock gating. • L1T and clock controlled externally by user (or logic). • e.g. procedure: • Set latency to proper value (max?). • StopMode on. Clock control gated. • Send one trigger, send one clock, read all FE. • Implementation details still worked on. • This will be an important test mode: test PDR!

C2- External control of DC • A feature implemented in test submission of digital region (3D Tezzaron-Chartered). • Control of DC externally with few signals: • L1T, read, peripherical trigger_counter value (and clock) to be provided from outside. • Off-chip, sense token (sent out) to know if data is available. • Might turn out to be an interesting test feature too. • Needed? Still debatable. Implementation not yet done.

C3- No 8b10b • Link test like (need?) clock sent back from FE  turn off 8b10b, send empty frame with appropriate Empty Record value to mimic clock. • Convenient to more directly check output data. • Speed

C4- InMUX • Multipurpose slow control access: • 4 configurable CMOS inputs and 4 outputs (their function depends on the 3 configuration pins InMUX_select). • InMUX 1: direct control of Global registers and pixel SR. • InMUX 2: manual control of EODCL. • InMUX 3: control of end of chip logic. • InMUX 4: scan chain for CMD. • InMUX 5: scan chain for EOCHL.

THE END • MORE INFO IN BACK-UP SLIDES (organized by topic) IF NEED IS.

BACKUP • BACKUP CLOCK

Communication & Data Flow