270 likes | 278 Views
This study explores the use of GPU-based algorithms for track recognition in imaging devices, specifically nuclear emulsions. The study focuses on converting raw data to particle tracks in a quasi-real-time manner.
E N D
GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks CristianoBozza Universitàdi Salerno/INFN Pisa – 10/9/2014 on behalf of C. B.1, Umut Kose2, Simona Maria Stellacci1, Chiara De Sio1 1: University of Salerno 2: CERN
Nuclear emulsions as visualizing detectors microtracks emulsion 205 mm plastic 44 mm emulsion • Nuclear emulsions as data source • Used recently in CHORUS, DONUT, PEANUT, OPERA • Application to muon radiography of volcanoes and buildings (e.g. nuclear reactors, nuclear waste depots) • Highest spatial resolution available • 0.1 µm, 1 mrad or better • No time trigger • Ideal test bench for tracking algorithms for many detectors Cristiano Bozza - Università di Salerno / INFN
Nuclear emulsions as visualizing detectors ESS/QSS y z x View #1 View #2 View #3 View #4 View #5 • Automatic microscopes • SySal, ESS, QSS (Europe) • TS, NTS, UTS, S-UTS, HTS (Japan) • Optical tomography: take data from a large volume by “scanning” it in views • XZ moving during data taking • Tracks normally span two views Cristiano Bozza - Università di Salerno / INFN
Nuclear emulsions as visualizing detectors 10 years’ background pile-up 87 mm 124 mm • Typical emulsion image • (from Quick Scanning System) • FOV size: 770×550 µm2, 31 images/view (tomography) • Grain diameter: 0.5 µm • “background” grains (“fog”, radioactivity): 25000/image • m.i.p. track grains: 0~10/image • Task: find all 3D tracks! • Lighting is not uniform • No time trigger • No way to identify“good” grains beforetracking Cristiano Bozza - Università di Salerno / INFN
Nuclear emulsions as visualizing detectors • This study was triggered by R&D on QSS • Current top speed: 41~90 cm2/h/side • Outlook with improved stage control: 150 cm2/h/side • Faster cameras: 250~300 cm2/h/side(image transmission speed sets the limit) • Applications • High energy physics with topological study of events on µm~cm scale • Neutrino physics • Charm physics • Tau physics • Exotic particles with characteristic decay signatures • Muon radiography • Stromboli • Unzen • Teide • La Palma fault Cristiano Bozza - Università di Salerno / INFN
Data flow from automatic microscopes • Relevant figures • QSS scanning speed: 41 or 90 cm2/h/side • Clusters of dark pixels/view (31 images): 5×105 • Grains (dark clusters with size constraints): 1.5×105 • Raw data (grains) / film (120 cm2): 300 GB • Image data rate: 2 or 4 GB/s (2×,4× Camera Link protocol) • Raw data rate/microscope: 50 or 110 GB/h (110 or 250 Mbps) • Processed data (tracks as sequences of grains): amount depends on angular acceptance and film quality, but about2 GB/film is realistic • Microscopes/laboratory: 2~10 Cristiano Bozza - Università di Salerno / INFN
Data flow from automatic microscopes • Living in a GPU-less world All numbers for 1 microscope, 20 cm2/h/side (ESS), “standard” angular acceptance • 2D image preprocessing by FPGA device: Matrox Odyssey • Dark cluster detection by host CPU: 4 cores • 3D tracking by networked servers: 50 cores • Processing hardware cost: ~40 k€ • GPU-powered data acquisition All numbers for 1 microscope, 41 cm2/h/side (QSS), “standard” angular acceptance • 2D image processing by GPU on host PC: NVidia GTX 590/690 • 3D tracking by GPU-powered servers: 6×GTX 690 (18432 cores) • Temporary staging area (RAMDisk) for raw data: 32 GB • Processing hardware cost: 7 k€ (GTX 690) • Includes cost of host workstation Cristiano Bozza - Università di Salerno / INFN
Data flow from automatic microscopes GTX 590/690 hosted inmicroscope workstation GTX 590 Temporary storage serverEnsures constant flow Manages job allocation Dynamic reconfiguration Data protocol: networked file system Control protocol: HTTP + SAWI(Server Application with Web Interface) Integrates web interface andinterprocess communication RAMDisk 32 GB GTX 690 GTX 690 GTX 690 GTX 690 GTX 690 GTX 690 Tracking servershost 1 or 2 GTX 690 each Final storage Work organisation with GPU Cristiano Bozza - Università di Salerno / INFN
From images to microtracks • Step #1: from images to dark clusters • GPU on Data acquisition workstation • Each image is treated separately • Need feedback to drive Z axis (check emulsion entry/exit surfaces) • Step #2: from sets of dark clusters to grains • GPU’s on tracking servers • Correct optical aberrations • Correct vibrations and motion effects • Step #3: from grains to microtracks • GPU’s on tracking servers • Find microtracks = sequences of aligned grains • Algorithms could be ported to other detectors Cristiano Bozza - Università di Salerno / INFN
Step#1: From images to dark clusters grey (0-255) grey (0-255) • Image upload to GPU • The images in the same view (31 for 44 µm) are uploaded together (~124 MB bunch) • Equalization of grey-level histogram • 5 kernels • Convolution with a 5×5 FIR filter and threshold • 1 kernel (1 thread per output pixel) • Find horizontal “segments” of dark pixels • 2 kernels (1 thread per horizontal line) • Assembling dark clusters from segments • 5 kernels (1 thread per horiz. line, recursive) • Output to host memory Cristiano Bozza - Università di Salerno / INFN
Step#1: From images to dark clusters 2011 prices • Image upload to GPU • Equalization of grey-level histogram • Convolution with a 5×5 FIR filter and threshold • Output to host memory • Comparison with FPGA devices:Matrox Odyssey (v1 and v2) • Full processing: 2.5 ms/MB (GTX 590) • Includes segments + clusters • ~10 ms for 4 MPixel image Cristiano Bozza - Università di Salerno / INFN
Step#2: From dark clusters to grains XY curvature Image n Z curvature y View n Image n+1 z View n+1 Z axisslant(X and Y) x x XY trapezium Magnificationvs. Z • Dark cluster data upload to GPU • The dark clusters in the same view are uploaded together (raw data file) • Correction of optical aberrations • 5 kernels • Correction of “in-view” alignment • 23 kernels • The X and Z axes move during readoutThe mechanics is not perfectly rigidVibrations in the XY plane can occurPattern matching of clusters seen in consecutiveimages yields highest precision • Correction of “cross-view” alignment • 26 kernels • Overall misalignment due to vibrations is corrected by 3D pattern matching of clustersin overlap region between views • Clusters can be merged to form grains • Useful in some operational conditions • Z is obtained by weighted average • Output to host memory is optional • Data immediately reused for tracking • Dump to host memory used only for debugging Cristiano Bozza - Università di Salerno / INFN
Step#2: From dark clusters to grains In-viewimage-to-imagealignment Cross-viewalignment mm mm XY alignments: 0.12 µm XY alignments : 0.15 µm Z alignments : 2.6 µm • Results of pattern matching Cristiano Bozza - Università di Salerno / INFN
Step #3: From grains to microtracks • Cylinder geometryavailable in different “flavours”: • XY distance from axis • XY+weighted Z • XYZ distance from axis • Combinatorial complexity • 6 30 grains per track • N 5×105/view • N2possibletracks • N3possiblegrains in tracks • Reducing combinations • Checkonlyneighbouringgrains • Checkonlycombinationswithina definedangulartolerance • “Constructively” enforce the constraints:browsingcombinations and discardingthemisalready a wasteoftime! • Track recognition: search for aligned grains • Images of grains w.r.t. straight line fit: σ = 50 nm • All grains in a track should lie within a cylinder defined bytwo grains (track “seed”) Cristiano Bozza - Università di Salerno / INFN
Step #3: From grains to microtracks • Grain proximity in position/direction space • In 2D: arrangegrains in a gridofcells, checkproximityonlywithinsamecell(or nearestneighbours) • In 3D: scan the angular acceptance region in fixed stepsFor each direction step, define a set of skewed prisms Arrange grains in the skewed prisms and check for tracksonly within each prism • Size of prisms and angular step arechosen by fine-tuning • Tracking time is proportional to angular acceptance(in this presentation,1.32 sr 11% of 4p) Cristiano Bozza - Università di Salerno / INFN
Step #3: From grains to microtracks • The tracking algorithm can be easily adapted to other detectors • Stackedplanesof 2D pixels resembles Z layers • Volume detectors grains are alwaystreatedas(e.g. Liquid Argon) 3D entities • 4pangularacceptance in this “flavour”, no prisms are usedtoconstrain the slope, but a limit on tracklengthis set(deviationfromstraightfit due to multiple scattering, bremsstrahlung, etc.)Long tracks are obtainedbystitching “short” pieces in the trackmerging stage Cristiano Bozza - Università di Salerno / INFN
Step #3: From grains to microtracks • Pitfalls of GPU-coding for this algorithm • #1 Filling prisms • If each thread corresponds to one grain, the risk of “collisions” (i.e. threads accessing the same prism at the same time) is very high • “Atomic” functions (CUDA 1.1 or higher) can be used to settle “race” conditions • With too many collisions the code becomes “quasi-serial” • “Striding” threads: each thread handles a sequential block of n grains, to increase the chance that they access different prisms • Drawback of thread striding: memory access is poorly coalesced (but tracks are not known in advance, and thememory span is broad) • The code is non-deterministic: the exact order of filling is not specified, while it is ensured that all prisms will contain the right set of grains Cristiano Bozza - Università di Salerno / INFN
Step #3: From grains to microtracks • Pitfalls of GPU-coding for this algorithm • #2 Seed scanning • One seed is formed by a pair of grains in the same prism • If each thread corresponds to one prism, with an average fill of N grains the fluctuations will be O(N1/2) • The fluctuations in seeds will be O(N3/2) • example: if the number of grains fluctuates from 6 to 12, the number of pairs fluctuates from 15 to 66!!! • Only a few threads will be running while all others have completed • Allocate one thread per seed • A seed in a crowded prism will still take more time because of more grains to check, but the fluctuation is only O(N1/2) • Further optimisation is possible, but not worth the effort • Drawback: memory access is poorly coalesced(but tracks are not known in advance, and thememory span is broad) Cristiano Bozza - Università di Salerno / INFN
Step #3: From grains to microtracks • Pitfalls of GPU-coding for this algorithm • #3 Track merging • In the ideal case, a track with n grains has n(n-1)/2 seeds • The same track is reconstructed several times • Track “clones” must be merged • Tracks are checked in pairs comparing position and direction • The track with fewer grains is suppressed (or the “second” in case of a tie) • Reduction of combinations by proximity (tracks are stored in a grid of XY cells) • The code is non-deterministic: the orderof tracks matters in producing the result • The “quality” of the set is always the same, but small differences can arise (0.1%) Cristiano Bozza - Università di Salerno / INFN
Performances: view correction/mapping Time (ms) GTX640 Tesla C2050 GTX690 (1/2) GTX780Ti Dark clusters • Time spent in image/view correction Cristiano Bozza - Università di Salerno / INFN
Performances: view correction/mapping • Most time is spent in tracking • Fraction of total time (Steps 1+2+3) spent in cluster-grain processing (Steps 1+2) Cristiano Bozza - Università di Salerno / INFN
Performances: tracking Time (ms) GTX640 Tesla C2050 GTX690 (1/2) GTX780Ti grains • Tracking time vs. grains/view Cristiano Bozza - Università di Salerno / INFN
Performances: tracking Time (ms) GTX640 Tesla C2050 GTX690 (1/2) GTX780Ti tracks No visible nonlinear bottlenecks • Tracking time vs. tracks/view Cristiano Bozza - Università di Salerno / INFN
Performances: tracking Log10(Time) Log10(Time) GTX780Ti GTX640 Exponent: 1.88 Exponent: 1.71 Log10(grains/view) Log10(grains/view) The dependency is better than N2: the weight of computation stages withhigh combinatorial complexity is relatively small • Tracking time vs. tracks/view Cristiano Bozza - Università di Salerno / INFN
Performances: tracking Log10(computework) GTX640 (Fermi 2.1) Tesla C2050 (Fermi 2.0) GTX690 (1/2) (Kepler 3.0) GTX780Ti (Kepler 3.5) Log10(grains/view) • Compute work vs. grains • Computework := Time(ms)×Cores×Clock(MHz) • More recent architectures seem less efficientEffect of branch divergence with more cores/multiprocessor? Cristiano Bozza - Università di Salerno / INFN
Performances: tracking • Identifying and understanding bottlenecks • Data from GTX640 • Branch divergence is almost only end-wait (completed threads wait for others running) • Difficult to improve without additional complications in the code • May be worth the effort for Maxwell architecture • Track merging jumps over “Track” data blocks that are anyway large • Memory could be coalesced by reshuffling thread order – under study • Track fitting is negligible w.r.t. track recognition Cristiano Bozza - Università di Salerno / INFN
Conclusions • The solution shown uses GPU’s to implement a complex algorithm with many logical branches and non-trivial memory access patterns • The tracking portion (Step #3) is suitable for a wide range of types of straight tracks (magnetic field weak or absent) Tracking planes or volume detectors natively supported • “Know your code”: GPU’s are effective, but efficient solutions need careful optimisation • The algorithm performances scale well with data size • There is room to improve the performances with the latest generation of boards Cristiano Bozza - Università di Salerno / INFN