240 likes | 392 Views
Event Reconstruction in STS. I. Kisel GSI. CBM-RF-JINR Meeting Dubna, May 21, 2009. Many-core HPC. CPU Intel: XX-cores. Gaming STI: Cell. GP CPU Intel: Larrabee. ?. GP GPU Nvidia: Tesla. CPU/GPU AMD: Fusion. ?. FPGA Xilinx: Virtex. ?. On-line event selection
E N D
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009
Many-core HPC CPU Intel: XX-cores Gaming STI: Cell GP CPU Intel: Larrabee ? GP GPU Nvidia: Tesla CPU/GPU AMD: Fusion ? FPGA Xilinx: Virtex ? • On-line event selection • Mathematical and computational optimization • Optimization of the detector ? OpenCL ? • Heterogeneous systems of many cores • Uniform approach to all CPU/GPU families • Similar programming languages (CUDA, Ct, OpenCL) • Parallelization of the algorithm (vectors, multi-threads, many-cores) Ivan Kisel, GSI
Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. Cores HW Threads SIMD width • Future programming is 3-dimentional • The amount of data is doubling every 18-24 month • Massive data streams • The RMS (Recognition, Mining, Synthesis) workload in real time • Supercomputer-level performance in ordinary servers and PCs • Applications, like real-time decision-making analysis Ivan Kisel, GSI
Cores and HW Threads CPU architecture in 19XX CPU architecture in2000 Process Thread1 Thread2 exer/w r/wexe exer/w ... ... 2 Threads per Process per CPU CPU architecture in 2009 CPU of your laptop in2015 Cores and HW threads are seen by an operating system as CPUs: > cat /proc/cpuinfo Maximum half of threads are executed 1 Process per CPU Ivan Kisel, GSI
SIMD Width S S S S D2 D1 S8 S4 S4 S8 S12 S4 S16 S2 S6 S10 S6 S14 S2 S2 S15 S3 S3 S7 S3 S11 S7 S1 S5 S1 S9 S13 S5 S1 CPU SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism Vector Scalar D Scalar double precision (64 bits) Faster or Slower ? D Vector (SIMD) double precision (128 bits) 2 or 1/2 Vector (SIMD) single precision (128 bits) 4 or 1/4 Intel AVX (2010) vector single precision (256 bits) 8 or 1/8 Intel LRB (2010) vector single precision (512 bits) 16 or 1/16 Ivan Kisel, GSI
SIMD KF Track Fit on Intel Multicore Systems: Scalability 10.00 2xCell SPE ( 16 ) Woodcrest ( 2 ) Clovertown ( 4 ) Dunnington ( 6 ) 1.00 0.10 0.01 scalar 8 32 single -> 2 4 16 double Real-time performance on different Intel CPU platforms Speed-up 3.7 on the Xeon 5140 (Woodcrest) Fit time, ms/track # threads Real-time performance on different CPU architectures – speed-up 100 with 32 threads H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Ivan Kisel, GSI
Intel Larrabee: 32 Cores • LRB vs. GPU: • Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: • use the x86 instruction set with Larrabee-specific extensions; • feature cache coherency across all its cores; • include very little specialized graphics hardware. • LRB vs. CPU: • The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's 32 x86 cores will be based on the much simpler Pentium design; • each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; • each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; • LRB includes explicit cache control instructions; • LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; • LRB includes one fixed-function graphics hardware unit. L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. Ivan Kisel, GSI
General Purpose Graphics Processing Units (GPGPU) • Substantial evolution of graphics hardware over the past years • Remarkable programmability and flexibility • Reasonably cheap • New branch of research – GPGPU Ivan Kisel, GSI
NVIDIA Hardware • Streaming multiprocessors • No overhead thread switching • FPUs instead of cache/control • Complex memory hierarchy • SIMT – Single Instruction Multiple Threads • GT200 • 30 multiprocessors • 30 DP units • 8 SP FPUs per MP • 240 SP units • 16 000 registers per MP • 16 kB shared memory per MP • >= 1 GB main memory • 1.4 GHz clock • 933 GFlops SP S. Kalcher, M. Bach Ivan Kisel, GSI
SIMD/SIMT Kalman Filter on the CSC-Scout Cluster 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB)) GPU 9100 CPU 1600 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth Ivan Kisel, GSI
CPU/GPU Programming Frameworks • Cg, OpenGL Shading Language, Direct X • Designed to write shaders • Require problem to be expressed graphically • AMD Brook • Pure stream computing • No hardware specific • AMD CAL (Compute Abstraction Layer) • Generic usage of hardware on assembler level • NVIDIA CUDA (Compute Unified Device Architecture) • Defines hardware platform • Generic programming • Extension to the C language • Explicit memory management • Programming on thread level • Intel Ct (C for throughput) • Extension to the C language • Intel CPU/GPU specific • SIMD exploitation for automatic parallelism • OpenCL (Open Computing Language) • Open standard for generic programming • Extension to the C language • Supposed to work on any hardware • Usage of specific hardware capabilities by extensions Ivan Kisel, GSI
Cellular Automaton Track Finder 10 200 500 Ivan Kisel, GSI
L1 CA Track Finder: Efficiency • Fluctuated magnetic field? • Too large STS acceptance? • Too large distance between STS stations? I. Rostovtseva Ivan Kisel, GSI
L1 CA Track Finder: Changes I. Kulakov Ivan Kisel, GSI
L1 CA Track Finder: Timing old – old version (from CBMRoot DEC08) new – new paralleled version Statistic: 100 central events Processor: Pentium D, 3.0 GHz, 2 MB. I. Kulakov Ivan Kisel, GSI
On-line = Off-line Reconstruction ? • Off-line and on-line reconstructions will and should be parallelized • Both versions will be run on similar many-core systems or even on the same PC farm • Both versions will use (probably) the same parallel language(s), such as OpenCL • Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? • If the final code is fast, can we think about a global on-line event reconstruction and selection? Ivan Kisel, GSI
Summary • Think parallel ! • Parallel programming is the key to the full potential of the Tera-scale platforms • Data parallelism vs. parallelism of the algorithm • Stream processing – no branches • Avoid direct accessing main memory, no maps, no look-up-tables • Use SIMD unit in the nearest future (many-cores, TF/s, …) • Use single-precision floating point where possible • In critical parts use double precision if necessary • Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) • New parallel languages appear: OpenCL, Ct, CUDA • GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! • Should we start buying them for testing the algorithms now? Ivan Kisel, GSI
Back-up Slides (1-5) Back-up Ivan Kisel, GSI
Back-up Slides (1/5) Back-up Ivan Kisel, GSI
Back-up Slides (2/5) Back-up Ivan Kisel, GSI
Back-up Slides (3/5) Back-up Ivan Kisel, GSI
Back-up Slides (4/5) SIMD is out of consideration (I.K.) Back-up Ivan Kisel, GSI
Back-up Slides (5/5) Back-up Ivan Kisel, GSI
Tracking Workshop Please be invited to the Tracking Workshop 15-17 June 2009 at GSI Ivan Kisel, GSI