330 likes | 454 Views
Joins in Hadoop. Gang and Ronnie. Agenda. Introduction of new types of joins Experiment results Join plan generator Summary and future work. Problem at hand. Map join (fragment-duplicate join). Fragment (large table). Map tasks:. Duplicate (small table).
E N D
Joins in Hadoop Gang and Ronnie
Agenda • Introduction of new types of joins • Experiment results • Join plan generator • Summary and future work
Problem at hand • Map join (fragment-duplicate join) Fragment (large table) Map tasks: Duplicate (small table)
Slide taken from project proposal • Too many copies of the small table are shuffled across the network • Partially Solved • Distributed Cache • Doesn’t work with too many nodes involved §there are 64 nodes in our cluster, and distributed cache will copy the data no more than that amount of time
Slide taken from project proposal II • Memory Limitation • Hash table is not memory-efficient. • The table size is usually larger than the heap memory assigned to a task Out Of Memory Exception!
Solving Not-Enough-Memory problem New Map Joins: • Multi-phase map join (MMJ) • Reversed map join (RMJ) • JDBM-based map join (JMJ) small table as: duplicate large table as: fragment
Multi-phase map join • n-phase map join Duplicate Part 2 Duplicate Part n Duplicate Part 1 Fragment Map tasks: … Duplicate Problem? - Reading large table multiple times!
Reversed map join • Default map join (in each Map task):1. read duplicate to memory, build hash table2. for each tuple in fragment, probe the hash table • Reversed map join (in each Map task): :1. read fragment to memory, build hash table2. for each tuple in duplicate , probe the hash table Problem? – not really a Map job…
JDBM-based map join • JDBM is a transactional persistence engine for Java. • Using JDBM, we can eliminate OutOfMemoryException. The size of the hash table is no longer bound by the heap size. Problem? – Probing a hashtable on disk might take much time!
Advanced Joins • Step 1:Semi join on join key only; • Step 2:Use the result to filter the table; • Step 3:Join new tables. • Can be applied to both map and reduce-side joins Problem? – Step 1 and 2 have overhead!
The Nine Candidates • AMJ/no dist advanced map join without DC • AMJ/dist advanced map join with DC • DMJ/no dist default map join without DC • DMJ/dist default map join with DC • MMJ multi-phase map join • RMJ/dist reversed map join with DC • JMJ/dist JDBM-based map join with DC • ARJ/dist advanced reduce join with DC • DRJ default reduce join
Experiment Setup • TPC-DS benchmark • Evaluated query:JOIN customer, web_sales ON cid • Performed on different scales of generated data, e.g. 10GB, 170GB (not actual table size) • Each combination is performed five (5) times • Results are analyzed with error bars
Hadoop Cluster • 128 — Hewlett Packard DL160 Compute Building Blocks • Each equipped with: • 2 quad-core CPUs • 16 GB RAM • 2 TB storage • High-speed network connection • Used in the experiment: • Hadoop Cluster (Altocumulus):64 nodes
Result analysis Some results ignored
One small note • What does 50*200 mean? • TABLE customer: from 50GB version of TPC-DS - actual table size: about 100MBTABLE web_sales: 200GB version of TPC-DS - actual table size: about 30GB
Distributed Cache II • Distributed cache introduces an overhead when converting the file in HDFS to local disks. • The following situations are in favor of Distributed cache (compared to non-DC):1. number of nodes is low2. number of map tasks is high
Advanced vs. Default III • The overhead of semi-join and filtering is heavy. • The following situations are in favor of advanced joins (compared to reduce joins):1. join selectivity gets lower2. network becomes slower (true!)3. we need to handle skewed data
Map Join vs Reduce Join • In most situations, Default Map Join performs better than Default Reduce Join • Eliminate the data transfer and sorting at shuffle stage • The gap is not significant due to the fast network • Potential problems of Map Joins • A job involving too many map tasks causes large amount of data transferred over network • Distributed cache may do harm to performance
Beyond Default Map Join • Multi-Phase Map Join • Succeed in all experiment groups. • Performance comparable with DMJ when only one phase is involved. • Performance degrades sharply when phase number are greater than 2, due to the much more tasks we launch. • Currently no support for distributed cache, not scalable
Beyond Default Map Join • Reversed Map Join • Succeed in all experiment groups. • Not performs as good as DRJ due the overhead of distributed cache • Performs best when
Beyond Default Map Join • JDBM Map Join • Fail for the last two experiment groups, mainly due to the improper configuration settings.
Join Plan Generator • Cost-based + rule-based • Focus on three aspects • Whether or not to use distributed cache • Whether to use Default Map Join • Map joins or reduce side join • Parameters
Join Plan Generator • Whether to use distributed cache • Only works for map join approaches • Cost model • With distributed cache: • where is the average overhead to distribute one file • Without distributed cache:
Join Plan Generator • Whether to use Default Map Join • We give Default Map Join the highest priority since it usually works best • The choice on distributed cache can ensure Default Map Join works efficiently • Rule: if small table can fit into memory entirely, just do it.
Join Plan Generator • Map Joins or Default Reduce side Join • In those situations where DMJ fails, Reversed Map Join is most promising in terms of usability and scalability. • Cost model: • RMJ: • (without distributed cache) • (with distributed cache) • where is the average overhead to distribute one file • DRJ:
Join Plan Generator Distributed cache? Y N Default Map Join? Y Do it N Reversed Map Join / Default Reduce side Join Do it
Summary • Distributed cache is a double-edge sword • When using distributed cache properly, Default Map Join performs best • The three new map join approaches extend the usability of default map join
Future Work • SPJA workflow(selection, projection, join, aggregation) • Better optimizer • Multi-way join • Build to hybrid system • Need a dedicated (slower) cluster…