Join algorithms using mapreduce

Join algorithms using mapreduce Haiping Wang ctqlwhp1022@163.com

Outline • MapReduce Framework • MapReduce implementation on Hadoop • Join algorithms using MapReduce

MapReduce: Simplified data processing on large clusters. In OSDI, 2004

MapReduce WordCount Diagram file1 file2 file3 file4 file5 file6 file7 ah ah er ah if or or uh or ah if map(String inputkey, String inputvalue): ah:1 ah:1 ah:1 er:1 if:1 or:1 or:1 uh:1 or:1 ah:1 if:1 ah:1,1,1,1 er:1 if:1,1 or:1,1,1 uh:1 reduce(Stringoutputkey, Iteratorintermediate_alues): 1 3 1 4 2 (ah) (er) (if) (or) (uh)

MapReduce implementation on Hadoop JobTracker InputFormat OutputFormat Record Writer Copy RecordReader Mapper Partitioner SorterReducer TaskTracker

MapReduce implementation on Hadoop

Hadoop mapreduce framework architecture

Join algorithms using mapreduce • Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters sigmod07 • Semi-join Computation on Distributed File Systems Using Map-Reduce-Merge Model Sac10 • Optimizing joins in a map-reduce environment VLDB09,EDBT2010 • A Comparison of Join Algorithms for Log Processing in MapReduce sigmod10

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters sigmod07

Map-Reduce-Merge Implementations ofRelational Join Algorithms

Example: Hash Join • Read from two sets of reducer outputs that share the same hashing buckets • One is used as a build set and the other probe merger merger merger • Read from every mapper for one designated partition reducer reducer reducer reducer reducer reducer mapper mapper mapper mapper mapper mapper • Use a hash partitioner split split split split split split split split

analysis and conclusion • Connections • A(ma, ra ), B(mb , rb ), r mergers suppose ra=rb=r • Map->Reduce connections= ra*ma+rb*mb=r*(ma+mb) • Reduce->Merge in one-to-one case, connections=2r • matcher: compare tuples to see id they should be merged or not • Conclusion • Use multiple map-reduce job • Partitioner may cause data skew problem • The number of ma, ra, mb, rb, r ra=rb? –> connections

Semi-join computation steps and workflow Equal join reduce communication costs disk I/O costs Insensitive to data skew ?

A Comparison of Join Algorithms for Log Processing in MapReduce sigmod10 • Equi-join between a log table L and a reference table R on a single column. • L,R and the Join Result is stored in DFS. • Scans are used to access L and R. • Each map or reduce task can optionally implement two additional functions: init() and close() . • These functions can be called before or after each map or reduce task. L ⊲⊳L.k=R.k R, with |L| ≫ |R|

repartition join(Hive) input map shuffle reduce output Pairs: (key, targeted record) Group by join key 1::1193::5::978300760 1::661::3::978302109 1::661::3::978301968 1::661::4::978300275 1 ::1193::5::97882429 Drawback: all records may have to be buffered (2355, [R:2355::B’…]) (3408, [R:3408::Eri…]) 661, R:661::James and the Gla… 914, R: 914::My Fair Lady.. 1193, R: 1193::One Flew Over … 2355, R: 2355::Bug’s Life, A… 3408, R: 3408::Erin Brockovi… (1,Ja..,3, …) (1,Ja..,3, …) (1,Ja..,4, …) 1193, L:1::1193::5::978300760 661, L :1::661::3::978302109 661, L :1::661::3::978301968 661, L :1::661::4::978300275 1193, L :1 ::1193::5 ::97882429 (661, [L :1::661::3::97…], [R:661::James…], [L:1::661::3::978…], [L:1::661::4::97…]) (661, …) (661, …) (661, …) (661, …) (2355, …) (3048, …) {(661::James…) } X (1::661::3::97…), (1::661::3::97…), (1::661::4::97…) (1193, …) (1193, …) (914, …) (1193, …) L: Ratings.dat Buffers records into two sets according to the table tag + Cross-product 661::James and the Glant… 914::My Fair Lady.. 1193::One Flew Over the… 2355::Bug’s Life, A… 3408::Erin Brockovich… R: movies.dat Out of memory • The key cardinality is small • The data is highly skewed

The Cost Measure for MR Algorithms • The communication cost of a process is the size of the input to the process • This paper does not count the output size for a process • The output must be input to at least one other process • The final output is much smaller than its input • The total communication cost is the sum of the communication costs of all processes that constitute an algorithm • The elapsed communication cost is defined on the acyclic graph of processes • Consider a path through this graph, and sum the communication costs of the processes along that path • The maximum sum, over all paths is the elapsed communication cost

2-Way Join in MapReduce Input Reduce input R Final output • R(A,B) S(B,C) Map Reduce S b->(a, c)

Joining Several Relations at Once • R(A,B) S(B,C) T(C,D) Input Reduce input R Final output S Map Reduce T

Joining Several Relations at Once • R(A,B) S(B,C) T(C,D) • Let h be a hash function with range 1, 2, …, m • S(b, c) -> (h(b), h(c)) • R(a, b) -> (h(b), all) • T(c, d) -> (all, h(c)) • Each Reduce process computes the join of the tuples it receives h(S.b) = 2 h(S.c) = 1 h(T.c) = 1 h(c) = 0 1 3 2 h(b) = 0 1 2 3 h(R.b) = 2 Reduce processes (# of Reduce processes: 42 = 16) m=4, k=16

Problem Solving • Problem solving using the method of Lagrange Multipliers • Take derivatives with respect to the three variables a, b, c • Multiply the three equations

Special Cases • Star Joins • Chain Joins • A chain join is a join of the form

Conclusion • Just suitable for Equal join • Use one map-reduce • Does not consider the IO ( intermediate <K,V> pairs IO ) and CPU time • Main contribution: use “Lagrangean multipliers” method

Join algorithms using mapreduce

Join algorithms using mapreduce

Presentation Transcript

Genetic Algorithms by using MapReduce

Join Using MapReduce

Scaling Genetic Algorithms using MapReduce

MapReduce, GPGPU and Iterative Data mining algorithms

Designing MapReduce Algorithms

Scaling Simple and Compact Genetic Algorithms using MapReduce

Structural Join Algorithms

MapReduce Algorithms

Improving Similarity Join Algorithms using Vertical Clustering Techniques

Join Algorithms

Structural Join Algorithms – Examples

MAXIMAL CLIQUES and JOIN TREE Algorithms

A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce

MapReduce A Common Mistake Theory of MapReduce Algorithms Some Examples

A Comparison of Join Algorithms for Log Processing in MapReduce

Join Algorithms

Join algorithms using mapreduce

MapReduce Algorithms

Structural Join Algorithms – Examples

Join Algorithms

Selected Topics: External Sorting, Join Algorithms, …