Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters Presented by : Mohammed Ali Alawi Shehab

Outlines Introduction Heterogeneous & Homogeneous databases Map-Reduce Map-Reduce-Merge Optimizations Enhancements Conclusions

Introduction • Search engines process and manage a vast amount of data collected from WWW. • To reduce cost of DBMS usually built large clusters of shared-nothing commodity nodes. • Ex : Google File System (GFS). • Hadoop: is open-source implementation. • Google ,Yahoo !, Facebook ,Amazon and others users.

Heterogeneous & Homogeneous databases • Homogeneous: different nodes have same technology at each of the locations. • Heterogeneous: different nodes may have differentand incompatible technology at each of the locations. • Technology examples: • operating system used • data structures used • database application

Map-Reduce Map process: function to process input key/value pairs and get list of results. Reduce process: function to merge all intermediate pairs associated with the same key and then generate outputs. Map-Reduce framework is best at handling homogeneous datasets Multiple heterogeneous datasets does not quite fit into the Map-Reduce framework

Map-Reduce count… • search engine stores: • Crawled URLs in crawler database • Inverted indexes in index database • Click or execution logs in log databases • These databases are huge and distributed over a large cluster of nodes.

www.google.jo

www.google.com.sa

Map-Reduce Diagram

Map-Reduce Features and Principles • Low-Cost: • High-performance • Symmetric multiprocessing (SMP) • Scalable RAIN Cluster • Fault-Tolerant yet Easy to Administer • Replicate data and launch backup tasks • New nodes can be plugged in at any time • High Throughput

Map-Reduce Features and Principles count …. • Shared-Disk Storage yet Shared-Nothing Computing • Map and Reduce tasks share integrated GFS that makes thousands of disks behave like one. • Distributed Partitioning/Sorting Framework • Partition function Distributed outputs from mapper to reducer by the key/value

Map-Reduce-Merge • Do not change map & reduce functions • Merge: join reduced outputs • It is more efficient and easier to process data relationships among heterogeneous datasets

Map-Reduce-Merge Diagram

Map-Reduce &Map-Reduce-Merge Map-Reduce Map-Reduce-Merge

Example Mapping Reducing Merging

Optimizations 2 • Optimal Reduce-Merge Connections: • Remote read between Mapper (M) and Reducer (R) is (R * (MA+MB)) if datasets size on A & B are same. • If datasets size on A & B are Not same then Remote read is R+R • Remote read between Reducers & Mergers is 2R

Enhancements Library: types of Merges phase Configure Workflow.

Conclusions Map-Reduce and GFS represent a rethinking of data processing This “simplified” philosophy drives down hardware and software cost Map-Reduce does not directly supportjoins of heterogeneous datasets, so adding Merge phase.

Thanks for listening

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters