150 likes | 268 Views
L1 Event Reconstruction in the STS. I. Kisel GSI / KIP. CBM Collaboration Meeting Dubna, October 16, 2008. Many-core HPC. Gaming STI: Cell. ?. ?. GP GPU Nvidia: Tesla. ?. GP CPU Intel: Larrabee. CPU/GPU AMD: Fusion. High performance computing (HPC)
E N D
L1 Event Reconstructionin the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008
Many-core HPC Gaming STI: Cell ? ? GP GPU Nvidia: Tesla ? GP CPU Intel: Larrabee CPU/GPU AMD: Fusion • High performance computing (HPC) • Highest clock rate is reached • Performance/power optimization • Heterogeneous systems of many (>8) cores • Similar programming languages (Ct and CUDA) • We need a uniform approach to all CPU/GPU families • On-line event selection • Mathematical and computational optimization • SIMDization of the algorithm (from scalars to vectors) • MIMDization (multi-threads, multi-cores) • Optimize the STS geometry (strips, sector navigation) • Smooth magnetic field Ivan Kisel, GSI
NVIDIA GeForce GTX 280 • NVIDIA GT200GeForce GTX 280 1024MB. • 933 GFlops single precision (240 FPUs). • finally double precision support, but only ~ 90 GFlops (8 core Xeon ~80 GFlops). • Currently under investigation: • Tracking • Linpack • Image Processing CUDA (Compute Unified Device Architecture) Sebastian Kalcher Ivan Kisel, GSI
Intel Larrabee: 32 Cores • Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: • use the x86 instruction set with Larrabee-specific extensions; • feature cache coherency across all its cores; • include very little specialized graphics hardware. • The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's x86 cores will be based on the much simpler Pentium design; • each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; • LRB includes one fixed-function graphics hardware unit; • LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; • LRB includes explicit cache control instructions; • each core supports 4-way simultaneous multithreading, with 4 copies of each processor register. L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. Ivan Kisel, GSI
Intel Ct Language Extend C++ for Throughput-Oriented Computing • Ct adds new data types (parallel vectors) & operators to C++ • Library-like interface and is fully ANSI/ISO-compliant • Ct abstracts away architectural details • Vector ISA width / Core count / Memory model / Cache sizes • Ct forward-scales software written today • Ct platform-level API, Virtual Intel Platform (VIP), is designed to be dynamically retargetable to SSE, SSEx, LRB, etc • Ct is fully deterministic • No data races • Nested data parallelism and deterministic task parallelism differentiate Ct on parallelizing irregular data and algorithm 1 2 3 Reduction (a global sum) Element-wise multiply Vector operations subsumes loop 1 3 2 The basic type in Ct is a TVEC Dot Product Using C Loops for (i = 0; i < n; i++) { dst += src1[i] * src2[i]; } Dot Product Using Ct TVEC<F64> Dst, Src1(src1, n), Src2(src2, n); Dst = addReduce(Src1*Src2); Ct: Throughput Programming in C++. Tutorial. Intel. Ivan Kisel, GSI
Ct vs. CUDA Matthias Bach Ivan Kisel, GSI
Multi/Many-Core Investigations • CA: Game of Life • L1/HLT CA Track Finder • SIMD KF Track Fitter • LINPACK • MIMDization (multi-threads, multi-cores) GSI, KIP, CERN, Intel Ivan Kisel, GSI
SIMD KF Track Fit on Multicore Systems: Scalability Real fit time/track (ms) #threads Using Intel Threading Building Blocks – linear scaling on multiple cores Håvard Bjerke Ivan Kisel, GSI
Parallelization of the L1 CA Track Finder Create tracklets Collect tracks 1 2 GSI, KIP, CERN, Intel, ITEP, Uni-Kiev Ivan Kisel, GSI
L1 Standalone Package for Event Selection Igor Kulakov Ivan Kisel, GSI
KFParticle: Primary Vertex Finder The algorithm is implemented and passed first tests. Ruben Moor Ivan Kisel, GSI
L1 Standalone Package for Event Selection Efficiency of D+ selection: 48.9% Igor Kulakov, Iouri Vassiliev Ivan Kisel, GSI
Magnetic Field: Smooth in the Acceptance • Approximate with a polynomial in the plane of each station • Approximate with a parabolic function between each 3 stations We need a smooth magnetic field in the acceptance Ivan Kisel, GSI
CA on the STS Geometry with Overlapping Sensors UrQMD MC central Au+Au 25AGeV Efficiency and fraction of killed tracks ok up to ∆Z = Zhit - Zstation < ~0.2cm Irina Rostovtseva Ivan Kisel, GSI
Summary and Plans • Learn Ct (Intel) and CUDA (Nvidia) programming languages • Develop the L1 standalone package for event selection • Parallelize the CA track finder • Investigate large multi-core systems (CPU and GPU) • Parallel hardware -> parallel languages -> parallel algorithms Ivan Kisel, GSI