150 likes | 254 Views
GPGPU Assignment 2 String matching. by Dominik Seifert B97902122. Overview. Data A lignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little Things. Data Alignment (1/3) The Alignment Trap. x86 supports this But GPUs don’t!
E N D
GPGPUAssignment 2String matching by Dominik Seifert B97902122
Overview • Data Alignment • Hashtable • MurMurHashFunction • “Stupid Parallel Hashing” • Lookup • The Complete Algorithm • The Little Things
Data Alignment (1/3)The Alignment Trap • x86 supports this • But GPUs don’t! • Word-sized pointers are always word-size-aligned!
Data Alignment (2/3) • Copy all words (corpus & query each) into a new array, consisting of 4-byte chunks • Improves memory access patterns • Allows us to always consider 4 bytes at a time • Needs more space but who cares! • Keep old offsets and translate to new offsets with: • AlignedWordOffset = OrigWordOffset / 4 + WordIndex • What’s the size of the i’th string? • strlen(i’thstring) == offset(i+1) – offset(i) - 1
Data Alignment (3/3) • AlignedWordOffset = OrigWordOffset / 4 + WordIndex • NewSize = 4 x (TotalSize / 4 + WordCount) • Example: TotalSize = 10 WordCount = 3 NewSize = 4 x (10 / 4 + 3) = 4 x 5 = 20 Original String (10 bytes): Aligned String (5 x 4= 20 bytes):
HashtableMotivation and overview • A hash is an index into an array that contains a value • Hashtables are perfect for exact matching • Simple • Build time: O(1) • Lookup time: O(1) • Databases always use hashtables if they don’t need to support range queries • Trees are too much work, slower and way harder to parallelize • Idea: • Build hashtableof all corpus words • Search for every query word
HashtableMurMurHash Function (1/2) • Simple • Only a few lines (available online) • Fast • Always considers 4 bytes at a time • Conflict-resilient • Very few strings have the same hash • I improved it slightly for my case: 6 lines were removed which handle strings of sizes that are not divisible by 4 (since all my aligned string sizes are divisible by 4) • Largest bucket size for corpus (found out through trial & error): 4 • Hashtable of query strings has largest bucket size 6 • Inverting the lookup was slower!
HashtableStupid Parallel Hashing (1/2) • No space optimization constraint • Available space: About 900 MB (without the required space for input & output) • Outline: • Create H layers, each of about 900/H MB in size (Should be a prime number!) • A layer is an array that maps hash to index • For each layer L: • Place all previously conflicting words in L • Amount of layers = Largest bucket size: 4 • Conflicting parallel writes = race condition • CUDA C Programming Guide, section 4.1: If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined. • One thread will always succeed!!!
HashtableStupid Parallel Hashing (2/2) • Note: • Rows = Layers • Columns = Buckets Input Layer 1 Layer 2 Layer 3 = Occupied = Occupied / Conflicted = Empty (-1)
Lookup Problems Slowest kernel! Needs too many registers! Did not benefit from shm! (But should)
The Complete Algorithm • Alignwords into 4 byte chunks • Compute hashes of all Corpuswords • For each hashtablelayer L(Total of 4): • Place all previously conflicting words in L • Use templates to determine the layer number • Lookup the index for every word in every layer L until the next word matches or the current layer has no such hash • Four kernels:
The Little Things (1/2) • A previous presenter inspired this idea: • Init: Allocate & memset (using max sizes) • Cleanup: Free all arrays
The Little Things (2/2) • Compare Words: • I did not really use shared memory • Did not improve performance even though it should have due to load balancing • Every thread roughly reads average word size • Vs. some threads reading only 1 byte and some reading 100 bytes • Did not investigate further since speed was already very fast
References • MurMurHash: https://sites.google.com/site/murmurhash/MurmurHash2.cpp?attredirects=0 • Real-time Parallel Hashing on the GPU • ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2009) • by Dan A. Alcantara, Andrei Sharf, FatemehAbbasinejad, ShubhabrataSengupta, Michael Mitzenmacher, John D. Owens, and Nina Amenta • I took some ideas from it but did not implement it at all • Needs atomicAdd