ALICE Week 17.11.99 Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth

ALICE Week 17.11.99 Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth Universität Heidelberg

What‘s new?? • TPC occuppancy is much higher than originally assumed • New Trigger Detector TRD • First time TPC selective readout becomes relevant • New Readout/L3 Architecture • No intermediate buses and buffer memories - use PCI and local memory instead • New dead-time or throtteling architecture

TRD/TPC Overall Timeline Track segment track processing matching TEC drift 0 1 2 3 4 5 m Time in s event pretrigger data sampling, Data shipping Trigger at TPC (Gate opens) end of TEC drift linear fit off detector TRD trigger at L1 TRD

TPC L3 trigger and processing L0 L1 TRD Trigger Global Trigger Trigger and readout TPC ~2 kHz L2 (144 Links, ~60 MB/evt) Front-End/ Trigger Other Trigger Detectors, TRD L0pre L2 Ship Zero suppressed TPC Data Sector parallel Ship TRD e+/e- Tracks On-line data reduction (tracking, reconstruction, partial readout, data compression) TPC intelligent Readout Select Regions of Interest seeds Conical zero suppressed readout Tracking of e+/e- Candidates inside TPC Verify e+/e- Hypothesis e+/e- Tracks plus RoIs Track segments and space points Reject event DAQ

Architecture from TP PDS PDS PDS PDS TPC ITS PID PHOS TRIG TPC ITS PID PHOS TRIG 10 Hz Pb - Pb 4 10 Hz Pb - Pb 4 FEE FEE FEE FEE FEE FEE 10 Hz p-p 5 10 Hz p-p 5 Event Rate Event Rate Trigger Trigger Data Data DDL L0 Trigger L0 Trigger FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC FEDC L1 Trigger LDC LDC LDC LDC LDC LDC L1 Trigger BUSY BUSY 10 Hz Pb - Pb 3 10 Hz Pb - Pb 3 10 Hz p-p 4 10 Hz p-p 4 1.5-2 µs 1.5-2 µs 2500 MB/s Pb + Pb 2500 MB/s Pb + Pb Switch EDM L2 Trigger EDM L2 Trigger 20 MB/s p+p 20 MB/s p+p 50 Hz zentral + 50 Hz zentral + 1 kHz dimuon Pb - Pb 1 kHz dimuon Pb - Pb 550 Hz p-p 550 Hz p-p GDC GDC GDC GDC GDC GDC GDC GDC 10-100 µs 10-100 µs STL 1250 MB/s Pb + Pb 1250 MB/s Pb + Pb Switch 20 MB/s p+p 20 MB/s p+p

Some Technology Trends Kapazität Geschwindigkeit (Latenz) Logic: 2x in 3 years 2x in 3 Jahren DRAM: 4x in 3 years 2x in 15 Jahren Disk: 4x in 3 years 2x in 10 Jahren DRAM Jahr Size Cycle Time 1980 64 Kb250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns ……. ……. ……. 1000:1! 2:1!

Prozessor-DRAM Memory Gap “Moore’s Law” µProc 60%/yr. (2X/1.5yr) 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 6%/yr. (2X/15 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Dave Patterson, UC Berkeley

Testing the uniformity of memory // Vary the size of the array, to determine the size of the cache or the // amount of memory covered by TLB entries. for (size = SIZE_MIN; size <= SIZE_MAX; size *= 2) { // Vary the stride at which we access elements, // to determine the line size and the associativity for (stride = 1; stride <= size; stride *=2) { // Do the following test multiple times so that the granularity of the // timer is better and the start-up effects are reduced. sec = 0; iter = 0; limit = size - stride + 1; iterations = ITERATIONS; do { sec0 = get_seconds(); for (i = iterations; i; i--) // The main loop. // Does a read and a write from various memory locations. for (index = 0; index < limit; index += stride) *(array + index) += 1; sec += (get_seconds() - sec0); iter += iterations; iterations *= 2; } while (sec < 1); stride Address stride size iteration stride stride

360 MHz Pentium MMX 4094 bytes 32 bytes 190 ns 95 ns 2.7 ns L1 Instruct. Cache: 16 kB L1 Data Cache: 16 kB (4-way associative, 16Byte line) L2 Cache: 512 kB (unified) MMU: 32 I / 64D TLB (4-way assoc)

360 MHz Pentium MMX L2 Cache off All Caches off

Vergleich zweier Supercomputer HP V-Class (PA-8x00) SUN E10k (UltraSparc II) L1 Instruct. Cache: 512 kB L1 Data Cache: 1024 kB (4-way associative, 16Byte line) MMU: 160 fully assoc. TLB L1 Instruct. Cache: 16 kB L1 Data Cache: 16 kB (write-through, non allocate, direct mapped,32Byte line) L2 Cache: 512 kB (unified) MMU: 2x64 fully assoc. TLB

LogP P P P M M M P(Prozessoren) o(overhead) o(overhead) NIC NIC NIC g(gap) g(gap) L(Latenz) Volume limited by L/g (aggregate Throughput) Verbindungs-Netzwerk L: Time, a packet travels in the network from sender to receiver o: CPU overhead to send or receive a message g: shortest time between sent or received message P: Number of processors NIC: Network Interface Card Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; PPOPP, May 1993

2-Node Ethernet Cluster Gigabit Ethernet Gigabit Ethernet with Carrier Extension Fast Ethernet (100 Mb/s) Test: • SUN Gigabit Ethernet PCI Karte IP 2.0 • 2 SUN 450 Ultra Server 1 CPU each • Sender produces TCP datastream with large Data buffers; receiver simply throws data away • Prozessor Utilization:Sender 40%; Receiver 60% ! • Throughput ca. 160 Mbits ! • Netto Throughput increases if receiver is implemented as twin processor Quelle: Intel Why is the TCP/IP Gigabit Ethernet performance so much worse than the theoretically possible?? Note: CMS implemented their own propriate network API for Gethernet and Myrinet

First Conclusions - Outlook • Memory Bandwidth is the limiting and determining factor. Moving Data requires significant memory bandwidth. • Number of TPC Data links dropped from (528 ) to 180 • Aggregate data rate per link ~34 MB/sec @ 100 Hz • TPC has highest processing requirements - majority of TPC computation can be done on per sector basis. • Keep the number of CPUs that process one sector in parallel to a minimumToday this number is 5 due to TPC granularity • Try to get Sector data directly into one processor • Selective Readout of TPC sectors can reduce data rate requirement by factors of at least 2-5 • Overall complexity of L3 Processor can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME crates combining the data from multiple TPC links. • DATE already uses a GSM paradigm as memory pool - no software changes

PCI Receiver Card Architecture Data Push readout Pointers FiFo Optical FPGA Multi Event Buffer Receiver PCI 66/64 PCI Hostbridge PCI Host memory PCI

PCI Readout of one TPC sector • Each TPC sector is readout by four optical links, which are fed by a small derandomizing buffer in the TPC front-end. • The optical PCI receiver modules mount directly in a commercial off the shelf (COTS) receiver computer in the counting house • The COTS receiver processor performs any necessary hit level functionality on the data in case of L3 processing • The receiver processor can also perform loss less compression and simply forward it to DAQ implementing the TP baseline functionality. • The receiver processor is much less expensive than any crate based solution

Overall TPC Intelligent Readout Architecture F E E F E E RORC RORC RORC RORC RORC RORC RORC RORC RORC PCI PCI RORC RORC RORC RORC RORC RORC PCI PCI RORC NIC NIC RORC RORC RORC RORC PCI PCI PCI NIC NIC PCI PCI PCI PCI PCI NIC NIC NIC MEM MEM NIC MEM NIC NIC NIC NIC MEM CPU CPU MEM MEM MEM CPU CPU MEM LDC/L3CPU LDC/L3CPU MEM MEM MEM MEM CPU LDC/L3CPU CPU CPU LDC/L3CPU CPU LDC/L3CPU CPU CPU CPU CPU LDC/L3CPU LDC/FEDC LDC/L3CPU LDC/FEDC LDC/FEDC LDC/FEDC LDC/L3CPU NIC NIC PCI PCI NIC NIC PCI PCI NIC NIC MEM MEM PCI PCI NIC NIC MEM MEM PCI PCI CPU CPU NIC NIC MEM MEM PCI PCI CPU CPU GDC/L3CPU GDC/L3CPU NIC NIC MEM MEM PCI PCI CPU CPU GDC/L3CPU GDC/L3CPU MEM MEM CPU CPU GDC/L3CPU GDC/L3CPU MEM MEM CPU CPU GDC/L3CPU GDC/L3CPU CPU CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU NIC NIC NIC NIC PCI PCI PCI PCI NIC NIC NIC NIC PCI PCI PCI PCI NIC NIC NIC NIC MEM MEM MEM MEM PCI PCI PCI PCI NIC NIC NIC NIC MEM MEM MEM MEM PCI PCI PCI PCI CPU CPU CPU CPU NIC NIC NIC NIC MEM MEM MEM MEM PCI PCI PCI PCI CPU CPU CPU CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU NIC NIC NIC NIC MEM MEM MEM MEM PCI PCI PCI PCI CPU CPU CPU CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU MEM MEM MEM MEM CPU CPU CPU CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU MEM MEM MEM MEM CPU CPU CPU CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU CPU CPU CPU CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU GDC/L3CPU Trigger Detectors : Micro Channelplate • Each TPC sector forms an independent sector cluster • The sector clusters merge through a cluster interconnect/network to a global processing cluster. • The aggregate throughput of this network can be scaled up to beyond 5 GB/sec at any point in time allowing to fall back to simple loss less binary readout • All nodes in the cluster are generic COTs processors, which are acquired at the latest possible time • All processing elements can be replaced and upgraded at any point in time • The network is commercial • The resulting multiprocessor cluster is generic and can be used as off-line farm - Zero- Degree Cal. Pa r t i c le In n er - Muon Trigger Chambers Mu on Id e nt i fi c at i on Tr ac ki ng Tr ac ki ng - Transition Radiation Detector Ph ot on 36 TPC Sectors S y s t e m Cham bers Trigger Decisions Sp ec t ro m et er Detector busy F E E F E E F E E F E E F E E F E E DD L Tr ig g er Data L0 Trigger L1 Trigger L3 Matrix L2 Trigger EDM Computer center S w i t c h P D S P D S P D S P D S

Dead Time / Flow Control RcvBd PCI NIC TPC reveiver Buffer > 100 Events TPC FEE Buffer (8 black Events) High water mark - send XOFF low water mark - send XOFF Event Receipt Daisy Chain Scenario I • TPC Dead Time is determined centrally • For every TPC trigger a counter is incremented • For every completely received event the last receiver module produces a message (single bit pulse), which is forwarded through all nodes after they also received the event • The event receipt pulse decrements the counter • The counter reaching count 7 asserts TPC dead time (there could be an other event already in the queue Scenario II • TPC Dead Time is determined centrally based on rates assuming worst case event sizes • Overflow protection for FEE buffers:Assert TPC BUSY if 7 events within 50 ms (assuming 120 MB/event, 1 Gbit) • Overflow protection for receiver buffers:~100 Events in 1 second - ORhigh- water mark in any receiver buffer (preferred way) No need for reverse flow control on optical link No need for dead time signalling at TPC frontend

Summary • Memory bandwidth is a very important factor in designing high performance multi processor systems; it needs to be studied in detail • Do not move data if not required - moving data costs money (except for some granularity effects) • Overall complexity can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME • General purpose COTS processors are less expensive than any crate solution • FPGA based PCI receiver card prototype is built, NT driver completed, Linux driver almost completed • DDL already planned as PCI version • No reverse flow control required for DDL • DDL URD to be revised by collaboration ASAP • No dead time or throtteling required to be implemented at front-end • Two scenarios as to how to implement it for the TPC at back-end without additional cost

ALICE Week 17.11.99 Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth