Event Reconstruction in STS

Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009

Many-core HPC CPU Intel: XX-cores Gaming STI: Cell GP CPU Intel: Larrabee ? GP GPU Nvidia: Tesla CPU/GPU AMD: Fusion ? FPGA Xilinx: Virtex ? • On-line event selection • Mathematical and computational optimization • Optimization of the detector ? OpenCL ? • Heterogeneous systems of many cores • Uniform approach to all CPU/GPU families • Similar programming languages (CUDA, Ct, OpenCL) • Parallelization of the algorithm (vectors, multi-threads, many-cores) Ivan Kisel, GSI

Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. Cores HW Threads SIMD width • Future programming is 3-dimentional • The amount of data is doubling every 18-24 month • Massive data streams • The RMS (Recognition, Mining, Synthesis) workload in real time • Supercomputer-level performance in ordinary servers and PCs • Applications, like real-time decision-making analysis Ivan Kisel, GSI

Cores and HW Threads CPU architecture in 19XX CPU architecture in2000 Process Thread1 Thread2 exer/w r/wexe exer/w ... ... 2 Threads per Process per CPU CPU architecture in 2009 CPU of your laptop in2015 Cores and HW threads are seen by an operating system as CPUs: > cat /proc/cpuinfo Maximum half of threads are executed 1 Process per CPU Ivan Kisel, GSI

SIMD Width S S S S D2 D1 S8 S4 S4 S8 S12 S4 S16 S2 S6 S10 S6 S14 S2 S2 S15 S3 S3 S7 S3 S11 S7 S1 S5 S1 S9 S13 S5 S1 CPU SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism Vector Scalar D Scalar double precision (64 bits) Faster or Slower ? D Vector (SIMD) double precision (128 bits) 2 or 1/2 Vector (SIMD) single precision (128 bits) 4 or 1/4 Intel AVX (2010) vector single precision (256 bits) 8 or 1/8 Intel LRB (2010) vector single precision (512 bits) 16 or 1/16 Ivan Kisel, GSI

SIMD KF Track Fit on Intel Multicore Systems: Scalability 10.00 2xCell SPE ( 16 ) Woodcrest ( 2 ) Clovertown ( 4 ) Dunnington ( 6 ) 1.00 0.10 0.01 scalar 8 32 single -> 2 4 16 double Real-time performance on different Intel CPU platforms Speed-up 3.7 on the Xeon 5140 (Woodcrest) Fit time, ms/track # threads Real-time performance on different CPU architectures – speed-up 100 with 32 threads H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Ivan Kisel, GSI

Intel Larrabee: 32 Cores • LRB vs. GPU: • Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: • use the x86 instruction set with Larrabee-specific extensions; • feature cache coherency across all its cores; • include very little specialized graphics hardware. • LRB vs. CPU: • The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's 32 x86 cores will be based on the much simpler Pentium design; • each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; • each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; • LRB includes explicit cache control instructions; • LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; • LRB includes one fixed-function graphics hardware unit. L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. Ivan Kisel, GSI

General Purpose Graphics Processing Units (GPGPU) • Substantial evolution of graphics hardware over the past years • Remarkable programmability and flexibility • Reasonably cheap • New branch of research – GPGPU Ivan Kisel, GSI

NVIDIA Hardware • Streaming multiprocessors • No overhead thread switching • FPUs instead of cache/control • Complex memory hierarchy • SIMT – Single Instruction Multiple Threads • GT200 • 30 multiprocessors • 30 DP units • 8 SP FPUs per MP • 240 SP units • 16 000 registers per MP • 16 kB shared memory per MP • >= 1 GB main memory • 1.4 GHz clock • 933 GFlops SP S. Kalcher, M. Bach Ivan Kisel, GSI

SIMD/SIMT Kalman Filter on the CSC-Scout Cluster 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB)) GPU 9100 CPU 1600 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth Ivan Kisel, GSI

CPU/GPU Programming Frameworks • Cg, OpenGL Shading Language, Direct X • Designed to write shaders • Require problem to be expressed graphically • AMD Brook • Pure stream computing • No hardware specific • AMD CAL (Compute Abstraction Layer) • Generic usage of hardware on assembler level • NVIDIA CUDA (Compute Unified Device Architecture) • Defines hardware platform • Generic programming • Extension to the C language • Explicit memory management • Programming on thread level • Intel Ct (C for throughput) • Extension to the C language • Intel CPU/GPU specific • SIMD exploitation for automatic parallelism • OpenCL (Open Computing Language) • Open standard for generic programming • Extension to the C language • Supposed to work on any hardware • Usage of specific hardware capabilities by extensions Ivan Kisel, GSI

Cellular Automaton Track Finder 10 200 500 Ivan Kisel, GSI

L1 CA Track Finder: Efficiency • Fluctuated magnetic field? • Too large STS acceptance? • Too large distance between STS stations? I. Rostovtseva Ivan Kisel, GSI

L1 CA Track Finder: Changes I. Kulakov Ivan Kisel, GSI

L1 CA Track Finder: Timing old – old version (from CBMRoot DEC08) new – new paralleled version Statistic: 100 central events Processor: Pentium D, 3.0 GHz, 2 MB. I. Kulakov Ivan Kisel, GSI

On-line = Off-line Reconstruction ? • Off-line and on-line reconstructions will and should be parallelized • Both versions will be run on similar many-core systems or even on the same PC farm • Both versions will use (probably) the same parallel language(s), such as OpenCL • Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? • If the final code is fast, can we think about a global on-line event reconstruction and selection? Ivan Kisel, GSI

Summary • Think parallel ! • Parallel programming is the key to the full potential of the Tera-scale platforms • Data parallelism vs. parallelism of the algorithm • Stream processing – no branches • Avoid direct accessing main memory, no maps, no look-up-tables • Use SIMD unit in the nearest future (many-cores, TF/s, …) • Use single-precision floating point where possible • In critical parts use double precision if necessary • Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) • New parallel languages appear: OpenCL, Ct, CUDA • GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! • Should we start buying them for testing the algorithms now? Ivan Kisel, GSI

Back-up Slides (1-5) Back-up Ivan Kisel, GSI

Back-up Slides (1/5) Back-up Ivan Kisel, GSI

Back-up Slides (4/5) SIMD is out of consideration (I.K.) Back-up Ivan Kisel, GSI

Tracking Workshop Please be invited to the Tracking Workshop 15-17 June 2009 at GSI Ivan Kisel, GSI

Event Reconstruction in STS

Event Reconstruction in STS

Presentation Transcript

MiniBooNE Event Reconstruction and Particle Identification

Event reconstruction in the LHCb Online Farm

STS

STS

Event Phase Reconstruction

Towards Online Event Reconstruction in the CBM Experiment

Atmospheric Neutrino Event Reconstruction

Status of the STS Reconstruction

A New STAR Event Reconstruction Chain

STS

CA+KF Track Reconstruction in the STS

Full Event Reconstruction in Java

D 0 K - , + reconstruction with CBM STS detector

Neutrino Event Reconstruction

CA+KF Track Reconstruction in the STS

Developing Event Reconstruction for CTA

Event Reconstruction and Particle Identification

Status of BESIII Event Reconstruction System