1 / 27

GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks

This study explores the use of GPU-based algorithms for track recognition in imaging devices, specifically nuclear emulsions. The study focuses on converting raw data to particle tracks in a quasi-real-time manner.

rsmoot
Download Presentation

GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU-based quasi-real-time track recognition in imaging devices: from raw data to particle tracks CristianoBozza Universitàdi Salerno/INFN Pisa – 10/9/2014 on behalf of C. B.1, Umut Kose2, Simona Maria Stellacci1, Chiara De Sio1 1: University of Salerno 2: CERN

  2. Nuclear emulsions as visualizing detectors microtracks emulsion 205 mm plastic 44 mm emulsion • Nuclear emulsions as data source • Used recently in CHORUS, DONUT, PEANUT, OPERA • Application to muon radiography of volcanoes and buildings (e.g. nuclear reactors, nuclear waste depots) • Highest spatial resolution available • 0.1 µm, 1 mrad or better • No time trigger • Ideal test bench for tracking algorithms for many detectors Cristiano Bozza - Università di Salerno / INFN

  3. Nuclear emulsions as visualizing detectors ESS/QSS y z x View #1 View #2 View #3 View #4 View #5 • Automatic microscopes • SySal, ESS, QSS (Europe) • TS, NTS, UTS, S-UTS, HTS (Japan) • Optical tomography: take data from a large volume by “scanning” it in views • XZ moving during data taking • Tracks normally span two views Cristiano Bozza - Università di Salerno / INFN

  4. Nuclear emulsions as visualizing detectors 10 years’ background pile-up 87 mm 124 mm • Typical emulsion image • (from Quick Scanning System) • FOV size: 770×550 µm2, 31 images/view (tomography) • Grain diameter: 0.5 µm • “background” grains (“fog”, radioactivity): 25000/image • m.i.p. track grains: 0~10/image • Task: find all 3D tracks! • Lighting is not uniform • No time trigger • No way to identify“good” grains beforetracking Cristiano Bozza - Università di Salerno / INFN

  5. Nuclear emulsions as visualizing detectors • This study was triggered by R&D on QSS • Current top speed: 41~90 cm2/h/side • Outlook with improved stage control: 150 cm2/h/side • Faster cameras: 250~300 cm2/h/side(image transmission speed sets the limit) • Applications • High energy physics with topological study of events on µm~cm scale • Neutrino physics • Charm physics • Tau physics • Exotic particles with characteristic decay signatures • Muon radiography • Stromboli • Unzen • Teide • La Palma fault Cristiano Bozza - Università di Salerno / INFN

  6. Data flow from automatic microscopes • Relevant figures • QSS scanning speed: 41 or 90 cm2/h/side • Clusters of dark pixels/view (31 images): 5×105 • Grains (dark clusters with size constraints): 1.5×105 • Raw data (grains) / film (120 cm2): 300 GB • Image data rate: 2 or 4 GB/s (2×,4× Camera Link protocol) • Raw data rate/microscope: 50 or 110 GB/h (110 or 250 Mbps) • Processed data (tracks as sequences of grains): amount depends on angular acceptance and film quality, but about2 GB/film is realistic • Microscopes/laboratory: 2~10 Cristiano Bozza - Università di Salerno / INFN

  7. Data flow from automatic microscopes • Living in a GPU-less world All numbers for 1 microscope, 20 cm2/h/side (ESS), “standard” angular acceptance • 2D image preprocessing by FPGA device: Matrox Odyssey • Dark cluster detection by host CPU: 4 cores • 3D tracking by networked servers: 50 cores • Processing hardware cost: ~40 k€ • GPU-powered data acquisition All numbers for 1 microscope, 41 cm2/h/side (QSS), “standard” angular acceptance • 2D image processing by GPU on host PC: NVidia GTX 590/690 • 3D tracking by GPU-powered servers: 6×GTX 690 (18432 cores) • Temporary staging area (RAMDisk) for raw data: 32 GB • Processing hardware cost: 7 k€ (GTX 690) • Includes cost of host workstation Cristiano Bozza - Università di Salerno / INFN

  8. Data flow from automatic microscopes GTX 590/690 hosted inmicroscope workstation GTX 590 Temporary storage serverEnsures constant flow Manages job allocation Dynamic reconfiguration Data protocol: networked file system Control protocol: HTTP + SAWI(Server Application with Web Interface) Integrates web interface andinterprocess communication RAMDisk 32 GB GTX 690 GTX 690 GTX 690 GTX 690 GTX 690 GTX 690 Tracking servershost 1 or 2 GTX 690 each Final storage Work organisation with GPU Cristiano Bozza - Università di Salerno / INFN

  9. From images to microtracks • Step #1: from images to dark clusters • GPU on Data acquisition workstation • Each image is treated separately • Need feedback to drive Z axis (check emulsion entry/exit surfaces) • Step #2: from sets of dark clusters to grains • GPU’s on tracking servers • Correct optical aberrations • Correct vibrations and motion effects • Step #3: from grains to microtracks • GPU’s on tracking servers • Find microtracks = sequences of aligned grains • Algorithms could be ported to other detectors Cristiano Bozza - Università di Salerno / INFN

  10. Step#1: From images to dark clusters grey (0-255) grey (0-255) • Image upload to GPU • The images in the same view (31 for 44 µm) are uploaded together (~124 MB bunch) • Equalization of grey-level histogram • 5 kernels • Convolution with a 5×5 FIR filter and threshold • 1 kernel (1 thread per output pixel) • Find horizontal “segments” of dark pixels • 2 kernels (1 thread per horizontal line) • Assembling dark clusters from segments • 5 kernels (1 thread per horiz. line, recursive) • Output to host memory Cristiano Bozza - Università di Salerno / INFN

  11. Step#1: From images to dark clusters 2011 prices • Image upload to GPU • Equalization of grey-level histogram • Convolution with a 5×5 FIR filter and threshold • Output to host memory • Comparison with FPGA devices:Matrox Odyssey (v1 and v2) • Full processing: 2.5 ms/MB (GTX 590) • Includes segments + clusters • ~10 ms for 4 MPixel image Cristiano Bozza - Università di Salerno / INFN

  12. Step#2: From dark clusters to grains XY curvature Image n Z curvature y View n Image n+1 z View n+1 Z axisslant(X and Y) x x XY trapezium Magnificationvs. Z • Dark cluster data upload to GPU • The dark clusters in the same view are uploaded together (raw data file) • Correction of optical aberrations • 5 kernels • Correction of “in-view” alignment • 23 kernels • The X and Z axes move during readoutThe mechanics is not perfectly rigidVibrations in the XY plane can occurPattern matching of clusters seen in consecutiveimages yields highest precision • Correction of “cross-view” alignment • 26 kernels • Overall misalignment due to vibrations is corrected by 3D pattern matching of clustersin overlap region between views • Clusters can be merged to form grains • Useful in some operational conditions • Z is obtained by weighted average • Output to host memory is optional • Data immediately reused for tracking • Dump to host memory used only for debugging Cristiano Bozza - Università di Salerno / INFN

  13. Step#2: From dark clusters to grains In-viewimage-to-imagealignment Cross-viewalignment mm mm XY alignments: 0.12 µm XY alignments : 0.15 µm Z alignments : 2.6 µm • Results of pattern matching Cristiano Bozza - Università di Salerno / INFN

  14. Step #3: From grains to microtracks • Cylinder geometryavailable in different “flavours”: • XY distance from axis • XY+weighted Z • XYZ distance from axis • Combinatorial complexity • 6 30 grains per track • N  5×105/view • N2possibletracks • N3possiblegrains in tracks • Reducing combinations • Checkonlyneighbouringgrains • Checkonlycombinationswithina definedangulartolerance • “Constructively” enforce the constraints:browsingcombinations and discardingthemisalready a wasteoftime! • Track recognition: search for aligned grains • Images of grains w.r.t. straight line fit: σ = 50 nm • All grains in a track should lie within a cylinder defined bytwo grains (track “seed”) Cristiano Bozza - Università di Salerno / INFN

  15. Step #3: From grains to microtracks • Grain proximity in position/direction space • In 2D: arrangegrains in a gridofcells, checkproximityonlywithinsamecell(or nearestneighbours) • In 3D: scan the angular acceptance region in fixed stepsFor each direction step, define a set of skewed prisms Arrange grains in the skewed prisms and check for tracksonly within each prism • Size of prisms and angular step arechosen by fine-tuning • Tracking time is proportional to angular acceptance(in this presentation,1.32 sr 11% of 4p) Cristiano Bozza - Università di Salerno / INFN

  16. Step #3: From grains to microtracks • The tracking algorithm can be easily adapted to other detectors • Stackedplanesof 2D pixels resembles Z layers • Volume detectors grains are alwaystreatedas(e.g. Liquid Argon) 3D entities • 4pangularacceptance  in this “flavour”, no prisms are usedtoconstrain the slope, but a limit on tracklengthis set(deviationfromstraightfit due to multiple scattering, bremsstrahlung, etc.)Long tracks are obtainedbystitching “short” pieces in the trackmerging stage Cristiano Bozza - Università di Salerno / INFN

  17. Step #3: From grains to microtracks • Pitfalls of GPU-coding for this algorithm • #1 Filling prisms • If each thread corresponds to one grain, the risk of “collisions” (i.e. threads accessing the same prism at the same time) is very high • “Atomic” functions (CUDA 1.1 or higher) can be used to settle “race” conditions • With too many collisions the code becomes “quasi-serial” • “Striding” threads: each thread handles a sequential block of n grains, to increase the chance that they access different prisms • Drawback of thread striding: memory access is poorly coalesced (but tracks are not known in advance, and thememory span is broad) • The code is non-deterministic: the exact order of filling is not specified, while it is ensured that all prisms will contain the right set of grains Cristiano Bozza - Università di Salerno / INFN

  18. Step #3: From grains to microtracks • Pitfalls of GPU-coding for this algorithm • #2 Seed scanning • One seed is formed by a pair of grains in the same prism • If each thread corresponds to one prism, with an average fill of N grains the fluctuations will be O(N1/2) • The fluctuations in seeds will be O(N3/2) • example: if the number of grains fluctuates from 6 to 12, the number of pairs fluctuates from 15 to 66!!! • Only a few threads will be running while all others have completed • Allocate one thread per seed • A seed in a crowded prism will still take more time because of more grains to check, but the fluctuation is only O(N1/2) • Further optimisation is possible, but not worth the effort • Drawback: memory access is poorly coalesced(but tracks are not known in advance, and thememory span is broad) Cristiano Bozza - Università di Salerno / INFN

  19. Step #3: From grains to microtracks • Pitfalls of GPU-coding for this algorithm • #3 Track merging • In the ideal case, a track with n grains has n(n-1)/2 seeds • The same track is reconstructed several times • Track “clones” must be merged • Tracks are checked in pairs comparing position and direction • The track with fewer grains is suppressed (or the “second” in case of a tie) • Reduction of combinations by proximity (tracks are stored in a grid of XY cells) • The code is non-deterministic: the orderof tracks matters in producing the result • The “quality” of the set is always the same, but small differences can arise (0.1%) Cristiano Bozza - Università di Salerno / INFN

  20. Performances: view correction/mapping Time (ms) GTX640 Tesla C2050 GTX690 (1/2) GTX780Ti Dark clusters • Time spent in image/view correction Cristiano Bozza - Università di Salerno / INFN

  21. Performances: view correction/mapping • Most time is spent in tracking • Fraction of total time (Steps 1+2+3) spent in cluster-grain processing (Steps 1+2) Cristiano Bozza - Università di Salerno / INFN

  22. Performances: tracking Time (ms) GTX640 Tesla C2050 GTX690 (1/2) GTX780Ti grains • Tracking time vs. grains/view Cristiano Bozza - Università di Salerno / INFN

  23. Performances: tracking Time (ms) GTX640 Tesla C2050 GTX690 (1/2) GTX780Ti tracks No visible nonlinear bottlenecks • Tracking time vs. tracks/view Cristiano Bozza - Università di Salerno / INFN

  24. Performances: tracking Log10(Time) Log10(Time) GTX780Ti GTX640 Exponent: 1.88 Exponent: 1.71 Log10(grains/view) Log10(grains/view) The dependency is better than N2: the weight of computation stages withhigh combinatorial complexity is relatively small • Tracking time vs. tracks/view Cristiano Bozza - Università di Salerno / INFN

  25. Performances: tracking Log10(computework) GTX640 (Fermi 2.1) Tesla C2050 (Fermi 2.0) GTX690 (1/2) (Kepler 3.0) GTX780Ti (Kepler 3.5) Log10(grains/view) • Compute work vs. grains • Computework := Time(ms)×Cores×Clock(MHz) • More recent architectures seem less efficientEffect of branch divergence with more cores/multiprocessor? Cristiano Bozza - Università di Salerno / INFN

  26. Performances: tracking • Identifying and understanding bottlenecks • Data from GTX640 • Branch divergence is almost only end-wait (completed threads wait for others running) • Difficult to improve without additional complications in the code • May be worth the effort for Maxwell architecture • Track merging jumps over “Track” data blocks that are anyway large • Memory could be coalesced by reshuffling thread order – under study • Track fitting is negligible w.r.t. track recognition Cristiano Bozza - Università di Salerno / INFN

  27. Conclusions • The solution shown uses GPU’s to implement a complex algorithm with many logical branches and non-trivial memory access patterns • The tracking portion (Step #3) is suitable for a wide range of types of straight tracks (magnetic field weak or absent) Tracking planes or volume detectors natively supported • “Know your code”: GPU’s are effective, but efficient solutions need careful optimisation • The algorithm performances scale well with data size • There is room to improve the performances with the latest generation of boards Cristiano Bozza - Università di Salerno / INFN

More Related