140 likes | 267 Views
“Update on GPU trigger”. NA62 collaboration meeting 9.12.2009. Gianluca Lamanna Scuola Normale Superiore and INFN Pisa. Introduction.
E N D
“Update on GPU trigger” NA62 collaboration meeting 9.12.2009 Gianluca Lamanna Scuola Normale Superiore and INFN Pisa
Introduction We are investigating the possibleuseofvideo cardsfor trigger purpose in ourexperiment First attemptpresented in Capri meeting based on my notebook’svideo card and preliminary ring finding on the RICH algorithm The first resultwas128 ms tofind a ring ( too slow!!!) Twopossibleapplicationsidentified: L1 and/or L0
GPU parallelizationstructure The GPU (video card processor) isnativelybuildforparallelization The GT200 (in NVIDIA TESLA C1060) has240 computingcoressubdivided in 30 multiprocessors, 8 coreseach. Eachcoreruns a thread, a groupofthreadsiscalledthread block Eachthread block runs on a multiprocessors 32 threads (warp) are scheduledconcurrently in a multiprocessorswith a SIMD structure Instruction pool Data pool PU All the threads in a kerneluse the sameinstruction pool on differentdata sets The divergencebetweenthreads, due todifferent code path, gives a loose in parallelization PU PU PU Fast parallel code (manythreads work on the samproblem) Parallelization in multi-coresstructure Multi-events(manyevents are processed in the sametime)
Memory management The correctuseof the memoryiscrucialforeveryapplication on GPU withhigh performance required. Thr1 The sharedmemoryissubdivided in 16 banks. Eachbank can beaccesssimultaneouslybydifferentthread in the block. Read/writeconflicts are serialized. Thr3 1 2 3 Thr6 4 Thr7 5 Global memoryaccessiscohaleshed in 64 (or 128) bytes 6 7 segments. Allignmentin warpguarantees the best performance. 8 9 Thr12 10 11 12 Thr15 13 14 15 16
GPU in NA62 trigger system In the L1 & L2 the GPUs can support the decision in the PC farm No issueswithlatency and computingtime (butreasonables) Realibility and stability Benefits:smaller L1 PC farm. More sophisticated algorithms can allowhigher trigger purity GPU PC GPU RICH RICH L0TS L0TS MUV MUV PC GPU GPU STRAWS STRAWS PC GPU In the L0 the GPU can beexploitedforeffective trigger rate reduction Latency and computingtimelimitations: 10 MHz input at L0 and 1 ms latency in TELL1 readout-> The trigger decisionhastobeready in 1 ms and the computingtimeis100 nsforevents (example: 100 nstofind a ring in the RICH) Verychallenging!!! Benefits:veryselective L0 triggersbothforsignal and secondaryprocessescollection L0 trigger L0 trigger Trigger primitives primitives Data to L1 Trigger Reduced Data Data to L1
Houghtransform Eachhit is the centerof a test circlewith a givenradius. The ring center is the best common pointof the test circles PMs position -> constantmemory hits -> global memory Testingcircleprototypes -> constantmemory 3D spaceforhistograms in sharedmemory (2D grid VS test circleradius) Limitations due to the total sharedmemoryamount (16K) Onethreadforeach center (hits) -> 32 threads (in onethread block) forevent Pros: naturalparallelization, smallnumberofthreadforevent Cons: unpredictablememoryaccess (read&writeconflicts), heavyuseof on-chip fast memory Y X Radius
Optimizedforproblem multi histos (OPMH) Each PM (1000) isconsideredas the center of a circle. Foreach center anhistogramisconstructedwith the distancesbtw center and hits (<32). The wholeprocessorisusedfor a single event (hugenumberof center) The single threadcomputesfewdistances Severalhistogramscomputed in differentsharedmemoryspaces Notnaturalforprocessor Isn’t possibletoprocess more thanoneeventat the sametime (the parallelismisfullyexploitedtospeed up the computation) Pros: very fast and simplekerneloperation Cons: notnaturalresourcesassignement; the wholeprocessorisusedfor a single event distance
Optimizedfordevice multi histo (ODMH) Exactly the samealgorithmwrt the OPMH, butwithdifferentresourcesassignment The system isexploited in a more natural way: each block isdedicatedto a single event, using the sharedmemoryforonehistogram Severaleventsprocessed in parallel in the sametime Easiertoavoidconflicts in shared and global memory G L O B A L M E MO R Y Shared Shared Shared 1 event -> M threads (eachthreadfor N PMs) 1 event -> M threads (eachthreadfor N PMs) 1 event -> M threads (eachthreadfor N PMs) Pros: naturalmemoryoptimization and resourcesassignment Cons: wasteofresources
Optimization procedure & results Severalstepsofoptimization(stillroomforimprovement) Algo1 and Algo3 are suitabletoprocessmanyevents in concurrent procedure at the sametime (in Algo2 the single eventsaturates the chip resources) us algo 1 algo 2 algo 3 GPU 12 hits Best ring Hitsgenerated NA62 - G4 MC Meminitoptim Memaccessoptim First version Multiple events Kerneloptim. Improvmentwrt the resultsshown in Capri (128 ms per ring): Different Video Card (factorof 10) Betterunderstandingofresourcesassignment Betterunderstandingofmemoryconflicts and memory management Multi eventsapproach Otheralgorithms
Furtheroptimization on ODMH • Twocrucialparameters are: • N: total numberofeventsforkernel • PM: numberofPMsprocessed in the samethread us 8 PMs x Thread N events The timeislinearwith the N events per thread The minimum of the timeas a functionof the numberofPMsprocessed in a single kernelisforPMs=8 us 1000 events x kernel Working Point (N=1000 PMs=8) -> 10.8 us PM x Thread
Presentresult Furtheroptimizations on workingpoint: V11: movingarraysfromregisters (in local) tosharedmemory -> 6us V12: occupancyoptimization -> 5.9 us V13: temporaryhistogramsoptimization in sharedmemory -> 5 us Thismeansthat a groupof 1000 eventsisprocessed in 5 ms, the latency (due to the algorithm) is5 ms and the maximum rate is200 kHz. • Data transfer time(forpacket): • Data Fromhostto GPU ram -> 70 us • ResultsFrom GPU ram tohost -> 7 us • Concurrent copy and kernelexecution in streams
Prospects Nextoptimizationsteps: copy hits in sharedmemory computationof PM center position (insteadofusingconstantmemory) Eliminate the slow operations (divisionbyinteger) Eliminate syncthread in perfectlyallineatedwarpstructure Understanddifferences in time in reading and writingsharedmemory 3 usisn’t a toooptimisticshypotesisfor the nextmonth -> severaloptimizationtobedone 1 usisn’t to far fromourpossibility, newalgorithm(“triplets”) willbeimplementedverysoon. 1 usmeansthatbyusing10 video cards at the sametime in parellelwe can reach the 10 MHzwith a latencyof5 ms -> NA62-TELL1 can easilymanagethislatency -> onlyissuewith GTK readout The next generation of video cards (alreadyavailable on the market) willoffer at least a factorof 2 in performance (probablymuchhigher due todifferentarchitecture)
Goingthrough a working system PCI-E x16 • The maximumbandwidthin each part of the system isvery high • No intrinsic system’s bottleneck • (* the maximum “frequencyequivalent” iscomputeassuming 200 B/event) INTEL PRO/1000 QUAD GBE 4 GB/s (20 MHz)* CPU RAM PCI-E gen2 x16 TESLA VRAM CPU GPU 30 GB/s (150 MHz)* 100GB/s (500 MHz)* 8 GB/s (40 MHz)* • Three importantparts are stillmissingto test a real system: • Intelligent GBE receiver, in which the data comingfrom the TELL1 (ethernet packets) are preparedtobedirectly transfer to the GPU memory • A realtime linux system in a doubleprocessors PC (oneprocessorfor the system, oneprocessortomove data) • A re-syncronization card tosend the trigger decisionsyncronouslyto the TTC
Conclusions The useofGPUs in the trigger willbeusefullboth at L1/L2 and L0 An effectivereductionof the trigger rate at L0 willbeveryimportantbothformain trigger and secondarytriggers, withgreatbenefitsforphysics The L0-GPUisverychallengingto the requirements in latency and processing time A verycarefulluseof the parallelizationstructure and memoryallowstoreach5 usof processing time and 5 ms oflatency (for 1000 eventspackets) The 3 uslevelwillbereachedwithrelativelysmalleffort, optimizing the data transfer and dispatching The 1 usisn’t completelyimpossible Byusing10 video cardsin parallel and increasing the latency in the readout, the system seemstobefeasible More job isneededfor the PC & HW side to design a complete new trigger paradigmwithhigh-qualityprimitivescomputing in the first stage PCs+GPU “quasi-triggerless” withGPUs FE Digitization + buffer + (trigger primitives) L1 PCs+GPU L0