1 / 115

High Performance Sorting and Searching using Graphics Processors

High Performance Sorting and Searching using Graphics Processors. Naga K. Govindaraju Microsoft Concurrency. Sorting and Searching. “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching !” -Don Knuth. Sorting and Searching.

ryanadan
Download Presentation

High Performance Sorting and Searching using Graphics Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Sorting and Searching using Graphics Processors Naga K. GovindarajuMicrosoft Concurrency

  2. Sorting and Searching “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth

  3. Sorting and Searching • Well studied • High performance computing • Databases • Computer graphics • Programming languages • ... • Google map reduce algorithm • Spec benchmark routine!

  4. Massive Databases • Terabyte-data sets are common • Google sorts more than 100 billion terms in its index • > 1 Trillion records in web indexed! • Database sizes are rapidly increasing! • Max DB sizes increases 3x per year (http://www.wintercorp.com) • Processor improvements not matching information explosion

  5. CPU(3 GHz) AGP Memory(512 MB) CPU vs. GPU GPU (690 MHz) Video Memory(512 MB) 2 x 1 MB Cache System Memory(2 GB) PCI-E Bus(4 GB/s) GPU (690 MHz) Video Memory(512 MB)

  6. Massive Data Handling on CPUs • Require random memory accesses • Small CPU caches (< 2MB) • Random memory accesses slower than even sequential disk accesses • High memory latency • Huge memory to compute gap! • CPUs are deeply pipelined • Pentium 4 has 30 pipeline stages • Do not hide latency - high cycles per instruction (CPI) • CPU is under-utilized for data intensive applications

  7. Massive Data Handling on CPUs • Sorting is hard! • GPU a potentially scalable solution to terabyte sorting and scientific computing • We beat the sorting benchmark with GPUs and provide a scaleable solution

  8. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Low memory latency pipeline • Programmable • High growth rate • Power-efficient

  9. GPU: Commodity Processor Laptops Consoles Cell phones PSP Desktops

  10. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • 10x more operations per sec than CPUs • High memory bandwidth • Better hides memory latency pipeline • Programmable • High growth rate • Power-efficient

  11. Parallelism on GPUs Graphics FLOPS GPU – 1.3 TFLOPS CPU – 25.6 GFLOPS

  12. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • 10x more memory bandwidth than CPUs • High growth rate • Power-efficient

  13. Low pipeline depth Graphics Pipeline 56 GB/s programmable vertex processing (fp32) vertex polygon setup, culling, rasterization setup polygon rasterizer Hides memory latency!! programmable per- pixel math (fp32) pixel per-pixel texture, fp16 blending texture Z-buf, fp16 blending, anti-alias (MRT) memory image

  14. NON-Graphics Pipeline Abstraction programmable MIMD processing (fp32) data Courtesy: David Kirk,Chief Scientist, NVIDIA SIMD “rasterization” setup lists rasterizer programmable SIMD processing (fp32) data data fetch, fp16 blending data predicated write, fp16 blend, multiple output memory data

  15. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • High growth rate • Power-efficient

  16. GPU Growth Rate CPU Growth Rate Exploiting Technology Moving Faster than Moore’s Law

  17. Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • High growth rate • Power-efficient

  18. CPU vs. GPU(Henry Moreton: NVIDIA, Aug. 2005)

  19. GPUs for Sorting and Searching: Issues • No support for arbitrary writes • Optimized CPU algorithms do not map! • Lack of support for general data types • Cache-efficient algorithms • Small data caches • No cache information from vendors • Out-of-core algorithms • Limited GPU memory

  20. Outline • Overview • Sorting and Searching on GPUs • Applications • Conclusions and Future Work

  21. Sorting on GPUs • Adaptive sorting algorithms • Extent of sorted order in a sequence • General sorting algorithms • External memory sorting algorithms

  22. Adaptive Sorting on GPUs • Prior adaptive sorting algorithms require random data writes • GPUs optimized for minimum depth or visible surface computation • Using depth test functionality • Design adaptive sorting using only minimum computations

  23. N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005 Adaptive Sorting Algorithm • Multiple iterations • Each iteration uses a two pass algorithm • First pass – Compute an increasing sequence M • Second pass - Compute the sorted elements in M • Iterate on the remaining unsorted elements

  24. Increasing Sequence Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S

  25. Increasing Sequence X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn M is an increasing sequence

  26. Compute Increasing SequenceComputation X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

  27. Compute Xn≤ ∞ Increasing SequenceComputation Xn

  28. Compute Yes. Prepend xi to MMin = xi xi≤Min? Increasing SequenceComputation XiXi+1 … Xn-1 Xn

  29. Compute Increasing SequenceComputation X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn x1≤{x2,…,xn}?

  30. Computing Sorted Elements Theorem 1:Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)

  31. Computing Sorted Elements X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

  32. ≤ ≤ Computing Sorted Elements X1X3… XiXi+1 … Xn-2 Xn-1 X2 … Xi-1 … Xn

  33. Computing Sorted Elements • Linear-time algorithm • Maintaining minimum

  34. Compute Computing Sorted Elements X1X3… XiXi+1 … Xn-2 Xn-1 X2 … Xi-1 … Xn

  35. Compute No. Update min Xi in M? Yes. Append Xito sorted list Xi≤ min? Computing Sorted Elements X1 X2 … Xi-1Xi

  36. Compute Computing Sorted Elements X1 X2 … Xi-1 Xi Xi+1 … Xn-1Xn

  37. Algorithm Analysis Knuth’s measure of disorder: Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I) Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations

  38. Pictorial Proof X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ...Xq Xq+1 Xq+2 ...Xn

  39. 2 iterations 2 iterations 2 iterations Pictorial Proof X1 X2 …XlXl+1 Xl+2 …XmXm+1 Xm+2 ...XqXq+1 Xq+2 ...Xn

  40. Example 8 1 2 3 4 5 6 7 9

  41. Sorted Example 8 9 1 2 3 4 5 6 7

  42. Analysis

  43. Advantages • Linear in the input size and sorted extent • Works well on almost sorted input • Maps well to GPUs • Uses depth test functionality for minimum operations • Useful for performing 3D visibility ordering • Perform transparency computations on dynamic 3D environments • Cons: • Expected time: O(n2 – 2 n√n) on random sequences

  44. Video: Transparent PowerPlant • 790K polygons • Depth complexity ~ 13 • 1600x1200 resolution • NVIDIA GeForce 6800 • 5-8 fps

  45. Video: Transparent PowerPlant

  46. N. Govindaraju,N. Raghuvanshi and D. Manocha, Proc. Of ACM SIGMOD, 2005 General Sorting on GPUs • General datasets • High performance

  47. General Sorting on GPUs • Design sorting algorithms with deterministic memory accesses – “Texturing” on GPUs • 56 GB/s peak memory bandwidth • Can better hide the memory latency!! • Require minimum and maximum computations – “Blending functionality” on GPUs • Low branching overhead • No data dependencies • Utilize high parallelism on GPUs

  48. GPU-Based Sorting Networks • Represent data as 2D arrays • Multi-stage algorithm • Each stage involves multiple steps • In each step • Compare one array element against exactly one other element at fixed distance • Perform a conditional assignment (MIN or MAX) at each element location

  49. Sorting Flash Animation here

  50. 2D Memory Addressing • GPUs optimized for 2D representations • Map 1D arrays to 2D arrays • Minimum and maximum regions mapped to row-aligned or column-aligned quads

More Related