300 likes | 360 Views
Parallel considerations of VELO PatPixelTracking. Daniel Hugo Cámpora Pérez LHCb Online team. Outline. PatPixel problem description Test setup, some results Integration with Gaudi framework.
E N D
Parallel considerations of VELO PatPixelTracking Daniel Hugo Cámpora Pérez LHCb Online team
Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Fast Pixel problem description Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Fast Pixel problem description Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Fast Pixel problem description • 48 sensors with 12 chips each • Each chip has 256x256 pixels • Clustered 2x2 by readout board • Right and left sensors at different z with overlap Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Fast Pixel problem description • The algorithm searches for hits starting from the last pixel lattice. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Fast Pixel problem description • The algorithm searches for hits starting from the last pixel lattice. • Per hit, it searches for compatible hits (on a given radius) in the next pixel lattice. • Finding at least three compatible hits forms a track. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Fast Pixel problem description However, the current approach is very sequential (albeit efficient!). • Hits must not be already used. • Continue instructions, break the loop and make it fast. Porting the same algorithm to other programming models as is makes for a proof of concept (produced physics are the same). Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Current test setup Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Current test setup We are interested in the search bit! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Current test setup • Input is 200 Monte-Carlo generated events. • Implementations produce exactly the same output as Brunel, unless stated otherwise. • Current setup runs TBB with a variable number of threads specified by task_scheduler_init init(i); • 1000 experiments are run per configuration. Results shown are the mean of those, standard deviation is checked as well. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Comparing apples to… • Lab13 • Intel Xeon CPU E5-2650 (2 CPUs) • 20M Cache, 2.00 GHz (2.80 GHz TB) • 8 cores, 16 HW threads • Intel MIC (Pre-Production Intel® Xeon Phi™ coprocessors) • 1.1 GHz • 61 cores, 244 HW threads • GPU • NVIDIA GeForce 680GTX • 1GHz • 1536 CUDA cores (96 SIMD cores) Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Precision • Is there a real need of double point operations? How about single precision instead… Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Precision Mean 1 (correct / produced tracks): 100% Mean 2 (correct / total number of tracks): 99.9964% • We miss one track in 28.000. • No incorrect tracks are generated. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Ma(g)ny-cores like Single Precision! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Current implementation • Setup is decoupled from the Gaudi framework. • Produced physics are the same. • Parallelism is setup as thread per event. • GPU acts as simple SIMD (“speedup” of 0.3x !) • divergent branches and warps are not good friends Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Event-wise parallelism? Using a similar idea to the baselinealgorithm, we can exploit the inherent parallel nature of the problem. • Average #hits per sensor: 22.6 • Average multiplicity (hit x hit): 771.15 • Average multiplicity (hit x hit x hit): 1544.7 Early stage parallel algorithm produces 85% of the correct tracks. Different results doesn’t necessarily mean wrong! Physics demonstration! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
What’s missing? The current setup is cool and dandy for comparing results, but not for testing the real setup! Daniel Hugo Cámpora Pérez 26-10-2012
Integration with Gaudi Current HLT doesn’t consider having coprocessors to help in the execution of any step. Framework is sequential! Per event execution on a coprocessor is not realistic. Memory copies will kill us! • Each event is approximately 50kB. • Processing one single event is trivial. We have to hide the latency! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Pipelining! Eg. #event chunk = 200 Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Integration with Gaudi Gaudihive Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
Integration with Gaudi Gaudihive Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012
In conclusion • The analysis on the sequential algorithm is complete. • A speedup of 10.70x has been obtained by properly configuring TBB. • MIC underperforms because of lack of use of VPUs, more tweaking is necessary. • Using floats rather than doubles is beneficial for many-core architectures, and results are the same. • A parallel version of the PatPixel would show a more realistic architecture comparison, and should be better performant. • The current framework with a good pipeline could enable the use of a coprocessor in a production environment. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012