140 likes | 339 Views
Status of the L1 STS Tracking. I. Kisel GSI / KIP. CBM Collaboration Meeting GSI, March 12, 2009. L1 CA Track Finder Efficiency. Fluctuated magnetic field? Too large STS acceptance? Too large distance between STS stations?. I. Rostovtseva. Many-core HPC. ?. GP CPU Intel: Larrabee.
E N D
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009
L1 CA Track Finder Efficiency • Fluctuated magnetic field? • Too large STS acceptance? • Too large distance between STS stations? I. Rostovtseva Ivan Kisel, GSI
Many-core HPC ? GP CPU Intel: Larrabee ? ? CPU/GPU AMD: Fusion FPGA Xilinx • High performance computing (HPC) • Highest clock rate is reached • Performance/power optimization • Heterogeneous systems of many (>8) cores • Similar programming languages (OpenCL, Ct and CUDA) • We need a uniform approach to all CPU/GPU families CPU Intel: XX-cores Gaming STI: Cell ? ? GP GPU Nvidia: Tesla • On-line event selection • Mathematical and computational optimization • SIMDization of the algorithm (from scalars to vectors) • MIMDization (multi-threads, many-cores) Ivan Kisel, GSI
Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. Cores Threads SIMD width • Future programming is 3-dimentional • The amount of data is doubling every 18-24 month • Massive data streams • The RMS (Recognition, Mining, Synthesis) workload in real time • Supercomputer-level performance in ordinary servers and PCs • Applications, like real-time decision-making analysis Ivan Kisel, GSI
Cores and Threads CPU architecture in 19XX 1 Process per CPU CPU architecture in2000 Process Thread1 Thread2 exer/w r/wexe exer/w ... ... 2 Threads per Process per CPU CPU architecture in 2009 CPU of your laptop in2015 Ivan Kisel, GSI
SIMD Width S S S S D2 D1 S8 S4 S4 S8 S12 S4 S16 S2 S6 S10 S6 S14 S2 S2 S15 S3 S3 S7 S3 S11 S7 S1 S5 S1 S9 S13 S5 S1 CPU SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism Vector Scalar D Scalar double precision (64 bits) Faster or Slower ? D Vector (SIMD) double precision (128 bits) 2 or 1/2 Vector (SIMD) single precision (128 bits) 4 or 1/4 Intel AVX (2010) vector single precision (256 bits) 8 or 1/8 Intel LRB (2010) vector single precision (512 bits) 16 or 1/16 Ivan Kisel, GSI
SIMD KF Track Fit on Intel Multicore Systems: Scalability Real-time performance on the quad-core Xeon 5345 (Clovertown) at 2.4 GHz – speed-up 30 with 16 threads Real-time performance on different Intel CPU platforms Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1 H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Ivan Kisel, GSI
Intel Larrabee: 32 Cores • LRB vs. GPU: • Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: • use the x86 instruction set with Larrabee-specific extensions; • feature cache coherency across all its cores; • include very little specialized graphics hardware. • LRB vs. CPU: • The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's 32 x86 cores will be based on the much simpler Pentium design; • each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; • each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; • LRB includes explicit cache control instructions; • LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; • LRB includes one fixed-function graphics hardware unit. L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. Ivan Kisel, GSI
General Purpose Graphics Processing Units (GPGPU) • Substantial evolution of graphics hardware over the past years • Remarkable programmability and flexibility • Reasonably cheap • New branch of research – GPGPU Ivan Kisel, GSI
NVIDIA Hardware • Streaming multiprocessors • No overhead thread switching • FPUs instead of cache/control • Complex memory hierarchy • SIMT – Single Instruction Multiple Threads • GT200 • 30 multiprocessors • 30 DP units • 8 SP FPUs per MP • 240 SP units • 16 000 registers per MP • 16 kB shared memory per MP • >= 1 GB main memory • 1.4 GHz clock • 933 GFlops SP S. Kalcher, M. Bach Ivan Kisel, GSI
SIMD/SIMT Kalman Filter on the CSC-Scout Cluster 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB)) GPU 9100 CPU 1600 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth Ivan Kisel, GSI
CPU/GPU Programming Frameworks • Cg, OpenGL Shading Language, Direct X • Designed to write shaders • Require problem to be expressed graphically • AMD Brook • Pure stream computing • No hardware specific • AMD CAL (Compute Abstraction Layer) • Generic usage of hardware on assembler level • NVIDIA CUDA (Compute Unified Device Architecture) • Defines hardware platform • Generic programming • Extension to the C language • Explicit memory management • Programming on thread level • Intel Ct (C for throughput) • Extension to the C language • Intel CPU/GPU specific • SIMD exploitation for automatic parallelism • OpenCL (Open Computing Language) • Open standard for generic programming • Extension to the C language • Supposed to work on any hardware • Usage of specific hardware capabilities by extensions Ivan Kisel, GSI
On-line = Off-line Reconstruction ? • Off-line and on-line reconstructions will and should be parallelized • Both versions will be run on similar many-core systems or even on the same PC farm • Both versions will use (probably) the same parallel language(s), such as OpenCL • Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? • If the final code is fast, can we think about a global on-line event reconstruction and selection? Ivan Kisel, GSI
Summary • Think parallel ! • Parallel programming is the key to the full potential of the Tera-scale platforms • Data parallelism vs. parallelism of the algorithm • Stream processing – no branches • Avoid direct accessing main memory, no maps, no look-up-tables • Use SIMD unit in the nearest future (many-cores, TF/s, …) • Use single-precision floating point where possible • In critical parts use double precision if necessary • Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) • New parallel languages appear: OpenCL, Ct, CUDA • GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! • Should we start buying them for testing? CPU Intel: XXX-cores Gaming STI: Cell OpenCL? GP GPU Nvidia: Tesla GP CPU Intel: Larrabee CPU/GPU AMD: Fusion FPGA Xilinx Ivan Kisel, GSI