1 / 33

Joins in Hadoop

Joins in Hadoop. Gang and Ronnie. Agenda. Introduction of new types of joins Experiment results Join plan generator Summary and future work. Problem at hand. Map join (fragment-duplicate join). Fragment (large table). Map tasks:. Duplicate (small table).

deidra
Download Presentation

Joins in Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joins in Hadoop Gang and Ronnie

  2. Agenda • Introduction of new types of joins • Experiment results • Join plan generator • Summary and future work

  3. Problem at hand • Map join (fragment-duplicate join) Fragment (large table) Map tasks: Duplicate (small table)

  4. Slide taken from project proposal • Too many copies of the small table are shuffled across the network • Partially Solved • Distributed Cache • Doesn’t work with too many nodes involved §there are 64 nodes in our cluster, and distributed cache will copy the data no more than that amount of time

  5. Slide taken from project proposal II • Memory Limitation • Hash table is not memory-efficient. • The table size is usually larger than the heap memory assigned to a task Out Of Memory Exception!

  6. Solving Not-Enough-Memory problem New Map Joins: • Multi-phase map join (MMJ) • Reversed map join (RMJ) • JDBM-based map join (JMJ) small table as: duplicate large table as: fragment

  7. Multi-phase map join • n-phase map join Duplicate Part 2 Duplicate Part n Duplicate Part 1 Fragment Map tasks: … Duplicate Problem? - Reading large table multiple times!

  8. Reversed map join • Default map join (in each Map task):1. read duplicate to memory, build hash table2. for each tuple in fragment, probe the hash table • Reversed map join (in each Map task): :1. read fragment to memory, build hash table2. for each tuple in duplicate , probe the hash table Problem? – not really a Map job…

  9. JDBM-based map join • JDBM is a transactional persistence engine for Java. • Using JDBM, we can eliminate OutOfMemoryException. The size of the hash table is no longer bound by the heap size. Problem? – Probing a hashtable on disk might take much time!

  10. Advanced Joins • Step 1:Semi join on join key only; • Step 2:Use the result to filter the table; • Step 3:Join new tables. • Can be applied to both map and reduce-side joins Problem? – Step 1 and 2 have overhead!

  11. The Nine Candidates • AMJ/no dist advanced map join without DC • AMJ/dist advanced map join with DC • DMJ/no dist default map join without DC • DMJ/dist default map join with DC • MMJ multi-phase map join • RMJ/dist reversed map join with DC • JMJ/dist JDBM-based map join with DC • ARJ/dist advanced reduce join with DC • DRJ default reduce join

  12. Experiment Setup • TPC-DS benchmark • Evaluated query:JOIN customer, web_sales ON cid • Performed on different scales of generated data, e.g. 10GB, 170GB (not actual table size) • Each combination is performed five (5) times • Results are analyzed with error bars

  13. Hadoop Cluster • 128 — Hewlett Packard DL160 Compute Building Blocks • Each equipped with: • 2 quad-core CPUs • 16 GB RAM • 2 TB storage • High-speed network connection • Used in the experiment: • Hadoop Cluster (Altocumulus):64 nodes

  14. Result analysis Some results ignored

  15. One small note • What does 50*200 mean? • TABLE customer: from 50GB version of TPC-DS - actual table size: about 100MBTABLE web_sales: 200GB version of TPC-DS - actual table size: about 30GB

  16. Distributed Cache

  17. Distributed Cache II • Distributed cache introduces an overhead when converting the file in HDFS to local disks. • The following situations are in favor of Distributed cache (compared to non-DC):1. number of nodes is low2. number of map tasks is high

  18. Advanced vs. Default

  19. Advanced vs. Default II

  20. Advanced vs. Default III • The overhead of semi-join and filtering is heavy. • The following situations are in favor of advanced joins (compared to reduce joins):1. join selectivity gets lower2. network becomes slower (true!)3. we need to handle skewed data

  21. Map Join vs Reduce Join--Part I

  22. Map Join vs Reduce Join-- Part II

  23. Map Join vs Reduce Join • In most situations, Default Map Join performs better than Default Reduce Join • Eliminate the data transfer and sorting at shuffle stage • The gap is not significant due to the fast network • Potential problems of Map Joins • A job involving too many map tasks causes large amount of data transferred over network • Distributed cache may do harm to performance

  24. Beyond Default Map Join • Multi-Phase Map Join • Succeed in all experiment groups. • Performance comparable with DMJ when only one phase is involved. • Performance degrades sharply when phase number are greater than 2, due to the much more tasks we launch. • Currently no support for distributed cache, not scalable

  25. Beyond Default Map Join • Reversed Map Join • Succeed in all experiment groups. • Not performs as good as DRJ due the overhead of distributed cache • Performs best when

  26. Beyond Default Map Join • JDBM Map Join • Fail for the last two experiment groups, mainly due to the improper configuration settings.

  27. Join Plan Generator • Cost-based + rule-based • Focus on three aspects • Whether or not to use distributed cache • Whether to use Default Map Join • Map joins or reduce side join • Parameters

  28. Join Plan Generator • Whether to use distributed cache • Only works for map join approaches • Cost model • With distributed cache: • where is the average overhead to distribute one file • Without distributed cache:

  29. Join Plan Generator • Whether to use Default Map Join • We give Default Map Join the highest priority since it usually works best • The choice on distributed cache can ensure Default Map Join works efficiently • Rule: if small table can fit into memory entirely, just do it.

  30. Join Plan Generator • Map Joins or Default Reduce side Join • In those situations where DMJ fails, Reversed Map Join is most promising in terms of usability and scalability. • Cost model: • RMJ: • (without distributed cache) • (with distributed cache) • where is the average overhead to distribute one file • DRJ:

  31. Join Plan Generator Distributed cache? Y N Default Map Join? Y Do it N Reversed Map Join / Default Reduce side Join Do it

  32. Summary • Distributed cache is a double-edge sword • When using distributed cache properly, Default Map Join performs best • The three new map join approaches extend the usability of default map join

  33. Future Work • SPJA workflow(selection, projection, join, aggregation) • Better optimizer • Multi-way join • Build to hybrid system • Need a dedicated (slower) cluster…

More Related