1 / 27

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems. Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München. Hardware trends …. Massive processing parallelism Huge main memory Non-uniform Memory Access (NUMA) Our server :

kuri
Download Presentation

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität München

  2. Hardware trends … • Massive processingparallelism • Hugemainmemory • Non-uniform Memory Access (NUMA) • Ourserver: • 4 CPUs • 32 cores • 1 TB RAM • 4 NUMA partitions CPU 0

  3. Main memorydatabasesystems MPSM forefficient OLAP queryprocessing • VoltDB, SAP HANA, MonetDB • HyPer*: real-time businessintelligencequerieson transactionaldata * http://www-db.in.tum.de/HyPer/ ICDE 2011 A. Kemper, T. Neumann

  4. How to exploitthesehardwaretrends? • Parallelizealgorithms • Exploit fast mainmemoryaccess  Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast JoinImplementation on Modern Multi-Core CPUs. VLDB‘09  Blanas, Li, Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD‘11 • AND beaware of fast local vs. slow remote NUMA access

  5. Howmuchdifferencedoes NUMA make? 100% 22756 ms 22756 ms 417344 ms 417344 ms 1000 ms 837 ms scaledexecution time 7440 ms 7440 ms 12946 ms remote local local remote synchronized sequential mergejoin(sequentialread) sort partitioning

  6. IgnoringNUMA … NUMA partition 1 NUMA partition 3 core 1 core 5 hashtable core 2 core 6 NUMA partition 2 NUMA partition 4 core 3 core 7 core 4 core 8

  7. The three NUMA commandments C2 Thoushaltreadthyneighbor‘smemoryonlysequentially-- lettheprefetcherhidethe remote accesslatency. C3 Thoushalt not waitforthyneighbors-- don‘tusefine-grainedlatchingorlockingandavoidsynchronizationpoints of parallel threads. C1 Thoushalt not writethyneighbor‘smemoryrandomly-- chunkthedata, redistribute, andthensort/workon yourdatalocally.

  8. Basic idea of MPSM chunkprivate R private R R chunks S chunks public S chunkpublic S

  9. Basic idea of MPSM • C1: Work locally: sort • C3: Work independently: sortandmergejoin • C2: Access neighbor‘sdataonlysequentially chunkprivate R R chunks sort R chunkslocally mergejoinchunks MJ MJ MJ MJ S chunks sort S chunkslocally chunkpublic S

  10. Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers

  11. Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers R chunks rangepartition R rangepartitionedR chunks

  12. Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers  S isimplicitlypartitioned rangepartitionedR chunks sort R chunks S chunks sort S chunks

  13. Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers  S isimplicitlypartitioned rangepartitionedR chunks sort R chunks MJ MJ MJ MJ mergejoinonlyrelevant parts S chunks sort S chunks

  14. Range partitioning of private input R • Time efficient • branch-free • comparison-free • synchronization-free and • Space efficient  denselypacked • in-place byusingradix-clusteringandprecomputedtargetpartitions to scatterdatato

  15. Range partitioning of private input prefixsum of worker W1 histogram of worker W1 19=10011 19 19 9 W1 <16 0 4 7 = 00111 7 1 ≥16 0 3 chunk of worker W1 3 21 21 2 17 = 10001 W2 1 17 17 2=00010 histogram of worker W2 prefixsum of worker W2 2 19 W1 23 23 5 <16 4 3 4 ≥16 3 4 chunk of worker W2 31 31 8 W2 20 20 26 26

  16. Skewresilience of MPSM • Locationskewisimplicitlyhandled • Distribution skew • Dynamicallycomputedpartitionbounds • Determinedbasedon the global datadistributionsof R and S • CostbalancingforsortingR andjoiningR and S 2 19 19 1 2 13 9 2 4 4 7 3 5 31 31 3 5 8 8 21 21 7 10 20 20 1 10 13 6 7 17 20 20 24 24 R1 R2 28 28 25 25 S1 S2

  17. Skewresilience 1. Global S datadistribution • Localequi-heighthistograms (forfree) • Combined to Cumulative Distribution Function (CDF) CDF # tuples 1 2 2 4 16 3 5 12 5 8 7 10 10 13 20 24 28 25 S1 S2 keyvalue 15

  18. Skewresilience 2. Global R datadistribution • Localequi-widthhistograms as before • More fine-grainedhistograms histogram 2 2 = 00010 13 13 <8 3 4 [8,16) 2 31 31 8 = 01000 [16,24) 1 8 8 ≥24 1 20 20 20 6 R1

  19. Skewresilience 3. Computesplittersso thatoverallworkloadsarebalanced*: greedilycombinebuckets, therebybalancingthecosts of eachthreadforsorting R andjoining R and S arebalanced CDF # tuples histogram 2 + = 4 3 6 2 1 13 1 31 8 20 keyvalue * Ross andCieslewicz: Optimal Splitters for Database Partitioningwith Size Bounds. ICDT‘09

  20. Performance evaluation • MPSM performance in a nutshell: • 160 miorecordsjoinedper second • 27 biorecordsjoinedin lessthan 3 minutes • scaleslinearlywiththenumber of cores • Platform: • Linux server • 1 TB RAM • 32 cores • Benchmark: • Jointables R and S withschema{[joinkey: 64bit, payload: 64bit]} • Dataset sizes ranging from 50GB to 400GB

  21. Execution time comparison • MPSM, Vectorwise (VW), andBlanashashjoin* • 32 workers • |R| = 1600 mio(25 GB), varyingsize of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD 2011

  22. Scalability in thenumber of cores • MPSM andVectorwise (VW) • |R| = 1600 mio(25 GB),|S|=4*|R|

  23. Locationskew • Locationskew in Rhasnoeffectbecause of repartitioning • Locationskew in S:in the extreme caseall joinpartners ofRiarefound in onlyoneSj(eitherlocalor remote)

  24. Distribution skew: anti-correlateddata R: 80% of thejoinkeysarewithinthe 20% high end of thekeydomainS: 80% of thejoinkeysarewithinthe 20% low end of thekeydomain withoutbalancedpartitioning withbalancedpartitioning

  25. Distribution skew : anti-correlateddata

  26. Conclusions • MPSM is a sort-based parallel joinalgorithm • MPSM isNUMA-aware & NUMA-oblivious • MPSM isspaceefficient(works in-place) • MPSM scaleslinearlyin thenumber of cores • MPSM isskewresilient • MPSM outperformsVectorwise(4X) andBlanas et al.‘shashjoin (18X) • MPSM isadaptablefordisk-basedprocessing • See details in paper

  27. Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität MünchenTHANK YOU FOR YOUR ATTENTION!

More Related