Status of the L1 STS Tracking

Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009

L1 CA Track Finder Efficiency • Fluctuated magnetic field? • Too large STS acceptance? • Too large distance between STS stations? I. Rostovtseva Ivan Kisel, GSI

Many-core HPC ? GP CPU Intel: Larrabee ? ? CPU/GPU AMD: Fusion FPGA Xilinx • High performance computing (HPC) • Highest clock rate is reached • Performance/power optimization • Heterogeneous systems of many (>8) cores • Similar programming languages (OpenCL, Ct and CUDA) • We need a uniform approach to all CPU/GPU families CPU Intel: XX-cores Gaming STI: Cell ? ? GP GPU Nvidia: Tesla • On-line event selection • Mathematical and computational optimization • SIMDization of the algorithm (from scalars to vectors) • MIMDization (multi-threads, many-cores) Ivan Kisel, GSI

Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. Cores Threads SIMD width • Future programming is 3-dimentional • The amount of data is doubling every 18-24 month • Massive data streams • The RMS (Recognition, Mining, Synthesis) workload in real time • Supercomputer-level performance in ordinary servers and PCs • Applications, like real-time decision-making analysis Ivan Kisel, GSI

Cores and Threads CPU architecture in 19XX 1 Process per CPU CPU architecture in2000 Process Thread1 Thread2 exer/w r/wexe exer/w ... ... 2 Threads per Process per CPU CPU architecture in 2009 CPU of your laptop in2015 Ivan Kisel, GSI

SIMD Width S S S S D2 D1 S8 S4 S4 S8 S12 S4 S16 S2 S6 S10 S6 S14 S2 S2 S15 S3 S3 S7 S3 S11 S7 S1 S5 S1 S9 S13 S5 S1 CPU SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism Vector Scalar D Scalar double precision (64 bits) Faster or Slower ? D Vector (SIMD) double precision (128 bits) 2 or 1/2 Vector (SIMD) single precision (128 bits) 4 or 1/4 Intel AVX (2010) vector single precision (256 bits) 8 or 1/8 Intel LRB (2010) vector single precision (512 bits) 16 or 1/16 Ivan Kisel, GSI

SIMD KF Track Fit on Intel Multicore Systems: Scalability Real-time performance on the quad-core Xeon 5345 (Clovertown) at 2.4 GHz – speed-up 30 with 16 threads Real-time performance on different Intel CPU platforms Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1 H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Ivan Kisel, GSI

Intel Larrabee: 32 Cores • LRB vs. GPU: • Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: • use the x86 instruction set with Larrabee-specific extensions; • feature cache coherency across all its cores; • include very little specialized graphics hardware. • LRB vs. CPU: • The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's 32 x86 cores will be based on the much simpler Pentium design; • each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; • each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; • LRB includes explicit cache control instructions; • LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; • LRB includes one fixed-function graphics hardware unit. L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. Ivan Kisel, GSI

General Purpose Graphics Processing Units (GPGPU) • Substantial evolution of graphics hardware over the past years • Remarkable programmability and flexibility • Reasonably cheap • New branch of research – GPGPU Ivan Kisel, GSI

NVIDIA Hardware • Streaming multiprocessors • No overhead thread switching • FPUs instead of cache/control • Complex memory hierarchy • SIMT – Single Instruction Multiple Threads • GT200 • 30 multiprocessors • 30 DP units • 8 SP FPUs per MP • 240 SP units • 16 000 registers per MP • 16 kB shared memory per MP • >= 1 GB main memory • 1.4 GHz clock • 933 GFlops SP S. Kalcher, M. Bach Ivan Kisel, GSI

SIMD/SIMT Kalman Filter on the CSC-Scout Cluster 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB)) GPU 9100 CPU 1600 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth Ivan Kisel, GSI

CPU/GPU Programming Frameworks • Cg, OpenGL Shading Language, Direct X • Designed to write shaders • Require problem to be expressed graphically • AMD Brook • Pure stream computing • No hardware specific • AMD CAL (Compute Abstraction Layer) • Generic usage of hardware on assembler level • NVIDIA CUDA (Compute Unified Device Architecture) • Defines hardware platform • Generic programming • Extension to the C language • Explicit memory management • Programming on thread level • Intel Ct (C for throughput) • Extension to the C language • Intel CPU/GPU specific • SIMD exploitation for automatic parallelism • OpenCL (Open Computing Language) • Open standard for generic programming • Extension to the C language • Supposed to work on any hardware • Usage of specific hardware capabilities by extensions Ivan Kisel, GSI

On-line = Off-line Reconstruction ? • Off-line and on-line reconstructions will and should be parallelized • Both versions will be run on similar many-core systems or even on the same PC farm • Both versions will use (probably) the same parallel language(s), such as OpenCL • Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? • If the final code is fast, can we think about a global on-line event reconstruction and selection? Ivan Kisel, GSI

Summary • Think parallel ! • Parallel programming is the key to the full potential of the Tera-scale platforms • Data parallelism vs. parallelism of the algorithm • Stream processing – no branches • Avoid direct accessing main memory, no maps, no look-up-tables • Use SIMD unit in the nearest future (many-cores, TF/s, …) • Use single-precision floating point where possible • In critical parts use double precision if necessary • Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) • New parallel languages appear: OpenCL, Ct, CUDA • GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! • Should we start buying them for testing? CPU Intel: XXX-cores Gaming STI: Cell OpenCL? GP GPU Nvidia: Tesla GP CPU Intel: Larrabee CPU/GPU AMD: Fusion FPGA Xilinx Ivan Kisel, GSI

Status of the L1 STS Tracking

Status of the L1 STS Tracking

Presentation Transcript

Metric Tracking Data Evaluation (MTDE) Support for STS

Status Tracking

Status Tracking

Status of Fast Tracking Algorithm MdcHough

Status of the STS Reconstruction

Status of the Silicon Tracking System

Tracking/Alignment Status

DT L1 emulator status and notes

L1 Tracking – Status

Status of LIT tracking

Tracking Event Model, Status

The Ultimate Guide to L1 to H1B Change of Status

Research status of PD-1 and PD-L1

Tracking software status

L1 Pixel Trigger Status Report

L1 Tracking – Status

The STS-module-assembly: Status and Challenges