270 likes | 440 Views
Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems. Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München. Hardware trends …. Massive processing parallelism Huge main memory Non-uniform Memory Access (NUMA) Our server :
E N D
Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität München
Hardware trends … • Massive processingparallelism • Hugemainmemory • Non-uniform Memory Access (NUMA) • Ourserver: • 4 CPUs • 32 cores • 1 TB RAM • 4 NUMA partitions CPU 0
Main memorydatabasesystems MPSM forefficient OLAP queryprocessing • VoltDB, SAP HANA, MonetDB • HyPer*: real-time businessintelligencequerieson transactionaldata * http://www-db.in.tum.de/HyPer/ ICDE 2011 A. Kemper, T. Neumann
How to exploitthesehardwaretrends? • Parallelizealgorithms • Exploit fast mainmemoryaccess Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast JoinImplementation on Modern Multi-Core CPUs. VLDB‘09 Blanas, Li, Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD‘11 • AND beaware of fast local vs. slow remote NUMA access
Howmuchdifferencedoes NUMA make? 100% 22756 ms 22756 ms 417344 ms 417344 ms 1000 ms 837 ms scaledexecution time 7440 ms 7440 ms 12946 ms remote local local remote synchronized sequential mergejoin(sequentialread) sort partitioning
IgnoringNUMA … NUMA partition 1 NUMA partition 3 core 1 core 5 hashtable core 2 core 6 NUMA partition 2 NUMA partition 4 core 3 core 7 core 4 core 8
The three NUMA commandments C2 Thoushaltreadthyneighbor‘smemoryonlysequentially-- lettheprefetcherhidethe remote accesslatency. C3 Thoushalt not waitforthyneighbors-- don‘tusefine-grainedlatchingorlockingandavoidsynchronizationpoints of parallel threads. C1 Thoushalt not writethyneighbor‘smemoryrandomly-- chunkthedata, redistribute, andthensort/workon yourdatalocally.
Basic idea of MPSM chunkprivate R private R R chunks S chunks public S chunkpublic S
Basic idea of MPSM • C1: Work locally: sort • C3: Work independently: sortandmergejoin • C2: Access neighbor‘sdataonlysequentially chunkprivate R R chunks sort R chunkslocally mergejoinchunks MJ MJ MJ MJ S chunks sort S chunkslocally chunkpublic S
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers R chunks rangepartition R rangepartitionedR chunks
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers S isimplicitlypartitioned rangepartitionedR chunks sort R chunks S chunks sort S chunks
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers S isimplicitlypartitioned rangepartitionedR chunks sort R chunks MJ MJ MJ MJ mergejoinonlyrelevant parts S chunks sort S chunks
Range partitioning of private input R • Time efficient • branch-free • comparison-free • synchronization-free and • Space efficient denselypacked • in-place byusingradix-clusteringandprecomputedtargetpartitions to scatterdatato
Range partitioning of private input prefixsum of worker W1 histogram of worker W1 19=10011 19 19 9 W1 <16 0 4 7 = 00111 7 1 ≥16 0 3 chunk of worker W1 3 21 21 2 17 = 10001 W2 1 17 17 2=00010 histogram of worker W2 prefixsum of worker W2 2 19 W1 23 23 5 <16 4 3 4 ≥16 3 4 chunk of worker W2 31 31 8 W2 20 20 26 26
Skewresilience of MPSM • Locationskewisimplicitlyhandled • Distribution skew • Dynamicallycomputedpartitionbounds • Determinedbasedon the global datadistributionsof R and S • CostbalancingforsortingR andjoiningR and S 2 19 19 1 2 13 9 2 4 4 7 3 5 31 31 3 5 8 8 21 21 7 10 20 20 1 10 13 6 7 17 20 20 24 24 R1 R2 28 28 25 25 S1 S2
Skewresilience 1. Global S datadistribution • Localequi-heighthistograms (forfree) • Combined to Cumulative Distribution Function (CDF) CDF # tuples 1 2 2 4 16 3 5 12 5 8 7 10 10 13 20 24 28 25 S1 S2 keyvalue 15
Skewresilience 2. Global R datadistribution • Localequi-widthhistograms as before • More fine-grainedhistograms histogram 2 2 = 00010 13 13 <8 3 4 [8,16) 2 31 31 8 = 01000 [16,24) 1 8 8 ≥24 1 20 20 20 6 R1
Skewresilience 3. Computesplittersso thatoverallworkloadsarebalanced*: greedilycombinebuckets, therebybalancingthecosts of eachthreadforsorting R andjoining R and S arebalanced CDF # tuples histogram 2 + = 4 3 6 2 1 13 1 31 8 20 keyvalue * Ross andCieslewicz: Optimal Splitters for Database Partitioningwith Size Bounds. ICDT‘09
Performance evaluation • MPSM performance in a nutshell: • 160 miorecordsjoinedper second • 27 biorecordsjoinedin lessthan 3 minutes • scaleslinearlywiththenumber of cores • Platform: • Linux server • 1 TB RAM • 32 cores • Benchmark: • Jointables R and S withschema{[joinkey: 64bit, payload: 64bit]} • Dataset sizes ranging from 50GB to 400GB
Execution time comparison • MPSM, Vectorwise (VW), andBlanashashjoin* • 32 workers • |R| = 1600 mio(25 GB), varyingsize of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD 2011
Scalability in thenumber of cores • MPSM andVectorwise (VW) • |R| = 1600 mio(25 GB),|S|=4*|R|
Locationskew • Locationskew in Rhasnoeffectbecause of repartitioning • Locationskew in S:in the extreme caseall joinpartners ofRiarefound in onlyoneSj(eitherlocalor remote)
Distribution skew: anti-correlateddata R: 80% of thejoinkeysarewithinthe 20% high end of thekeydomainS: 80% of thejoinkeysarewithinthe 20% low end of thekeydomain withoutbalancedpartitioning withbalancedpartitioning
Conclusions • MPSM is a sort-based parallel joinalgorithm • MPSM isNUMA-aware & NUMA-oblivious • MPSM isspaceefficient(works in-place) • MPSM scaleslinearlyin thenumber of cores • MPSM isskewresilient • MPSM outperformsVectorwise(4X) andBlanas et al.‘shashjoin (18X) • MPSM isadaptablefordisk-basedprocessing • See details in paper
Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität MünchenTHANK YOU FOR YOUR ATTENTION!