850 likes | 1.3k Views
Database Operations on GPU. Changchang Wu 4/18/2007. Outline. Database Operations on GPU Point List Generation on GPU Nearest Neighbor Searching on GPU. Database Operations on GPU. Design Issues. Low bandwidth between GPU and CPU A void frame buffer readbacks No arbitrary writes
E N D
Database Operations on GPU Changchang Wu 4/18/2007
Outline • Database Operations on GPU • Point List Generation on GPU • Nearest Neighbor Searching on GPU
Design Issues • Low bandwidth between GPU and CPU • Avoid frame buffer readbacks • No arbitrary writes • Avoid data rearrangements • Programmable pipeline has poor branching • Evaluate branches using fixed function tests
Design Overview • Use depth test functionality of GPUs for performing comparisons • Implements all possible comparisons <, <=, >=, >, ==, !=, ALWAYS, NEVER • Use stencil test for data validation and storing results of comparison operations • Use occlusion query to count number of elements that satisfy some condition
Basic Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT)
Basic Operations • Predicates – ai op constant or ai op aj • Op is one of <,>,<=,>=,!=, =, TRUE, FALSE • Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation • Aggregations – COUNT, SUM, MAX, MEDIAN, AVG
Predicate Evaluation • ai op constant (d) • Copy the attribute values ai into depth buffer • Define the comparison operation using depth test • Draw a screen filling quad at depth d glDepthFunc(…) glStencilOp(fail,zfail,zpass);
Predicate Evaluation • Comparing two attributes: • ai op ajis treated as (ai – aj) op 0 • Semi-linear queries • Easy to compute with fragment shader
Boolean Combinations • Expression provided as a CNF • CNF is of form (A1 AND A2 AND … AND Ak) where Ai = (Bi1 OR Bi2 OR … OR Bimi ) • CNF does not have NOT operator • If CNF has a NOT operator, invert comparison operation to eliminate NOT Eg. NOT (ai < d) => (ai >= d) • For example, compute ai within [low, high] • Evaluated as ( ai >= low ) AND ( ai <= high )
Range Query • Compute ai within [low, high] • Evaluated as ( ai >= low ) AND ( ai <= high )
Aggregations • COUNT, MAX, MIN, SUM, AVG • No data rearrangements
COUNT • Use occlusion queries to get pixel pass count • Syntax: • Begin occlusion query • Perform database operation • End occlusion query • Get count of number of attributes that passed database operation • Involves no additional overhead!
MAX, MIN, MEDIAN • We compute Kth-largest number • Traditional algorithms require data rearrangements • We perform no data rearrangements, no frame buffer readbacks
K-th Largest Number • By comparing and counting, determinate every bit in order of MSB to LSB
Example: Parallel Max • S={10,24,37,99,192,200,200,232} • Step 1: Draw Quad at 128(10000000) • S = {10,24,37,99,192,200,200,232} • Step 2: Draw Quad at 192(11000000) • S = {10,24,37,192,200,200,232} • Step 3: Draw Quad at 224(11100000) • S = {10,24,37,192,200,200,232} • Step 4: Draw Quad at 240(11110000) • – No values pass • Step 5: Draw Quad at 232(11101000) • S = {10,24,37,192,200,200,232} • Step 6,7,8: Draw Quads at 236,234,233 – No values pass, Max is 232
Accumulator, Mean • Accumulator - Use sorting algorithm and add all the values • Mean – Use accumulator and divide by n • Interval range arithmetic • Alternative algorithm • Use fragment programs – requires very few renderings • Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03]
Accumulator • Data representation is of form ak 2k + ak-1 2k-1 + … + a0 Sum = sum(ak) 2k+ sum(ak-1) 2k-1+…+sum(a0) Current GPUs support no bit-masking operations
The Algorithm >=0.5 means i-th bit is 1
Implementation • Algorithm • CPU – Intel compiler 7.1 with hyper-threading, multi-threading, SIMD optimizations • GPU – NVIDIA Cg Compiler • Hardware • Dell Precision Workstation with Dual 2.8GHz Xeon Processor • NVIDIA GeForce FX 5900 Ultra GPU • 2GB RAM
Benchmarks • TCP/IP database with 1 million records and four attributes • Census database with 360K records
Analysis: Issues • Precision • Copy time • Integer arithmetic • Depth compare masking • Memory management • No Branching • No random writes
Analysis: Performance • Relative Performance Gain • High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count • Medium Performance – Kth-largest number • Low Performance - Accumulator
High Performance • Parallel pixel processing engines • Pipelining • Early Z-cull • Eliminate branch mispredictions
Medium Performance • Parallelism • FX 5900 has clock speed 450MHz, 8 pixel processing engines • Rendering single 1000x1000 quad takes 0.278ms • Rendering 19 such quads take 5.28ms. Observed time is 6.6ms • 80% efficiency in parallelism!!
Low Performance • No gain over SIMD based CPU implementation • Two main reasons: • Lack of integer-arithmetic • Clock rate
Advantages • Algorithms progress at GPU growth rate • Offload CPU work • Fast due to massive parallelism on GPUs • Algorithms could be generalized to any geometric shape • Eg. Max value within a triangular region • Commodity hardware!
GPU Point List Generation • Data compaction
Timing Reduces a highly sparse matrix with N elements to a list of its M active entries in O(N) + M (log N) steps,
Applications • Image Analysis • Feature Detection • Volume Analysis • Sparse Matrix Generation