230 likes | 401 Views
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Presented by : Mohammed Ali Alawi Shehab. Outlines . Introduction Heterogeneous & Homogeneous databases Map-Reduce Map-Reduce-Merge Optimizations Enhancements Conclusions. Introduction.
E N D
Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters Presented by : Mohammed Ali Alawi Shehab
Outlines Introduction Heterogeneous & Homogeneous databases Map-Reduce Map-Reduce-Merge Optimizations Enhancements Conclusions
Introduction • Search engines process and manage a vast amount of data collected from WWW. • To reduce cost of DBMS usually built large clusters of shared-nothing commodity nodes. • Ex : Google File System (GFS). • Hadoop: is open-source implementation. • Google ,Yahoo !, Facebook ,Amazon and others users.
Heterogeneous & Homogeneous databases • Homogeneous: different nodes have same technology at each of the locations. • Heterogeneous: different nodes may have differentand incompatible technology at each of the locations. • Technology examples: • operating system used • data structures used • database application
Map-Reduce Map process: function to process input key/value pairs and get list of results. Reduce process: function to merge all intermediate pairs associated with the same key and then generate outputs. Map-Reduce framework is best at handling homogeneous datasets Multiple heterogeneous datasets does not quite fit into the Map-Reduce framework
Map-Reduce count… • search engine stores: • Crawled URLs in crawler database • Inverted indexes in index database • Click or execution logs in log databases • These databases are huge and distributed over a large cluster of nodes.
Map-Reduce Features and Principles • Low-Cost: • High-performance • Symmetric multiprocessing (SMP) • Scalable RAIN Cluster • Fault-Tolerant yet Easy to Administer • Replicate data and launch backup tasks • New nodes can be plugged in at any time • High Throughput
Map-Reduce Features and Principles count …. • Shared-Disk Storage yet Shared-Nothing Computing • Map and Reduce tasks share integrated GFS that makes thousands of disks behave like one. • Distributed Partitioning/Sorting Framework • Partition function Distributed outputs from mapper to reducer by the key/value
Map-Reduce-Merge • Do not change map & reduce functions • Merge: join reduced outputs • It is more efficient and easier to process data relationships among heterogeneous datasets
Map-Reduce &Map-Reduce-Merge Map-Reduce Map-Reduce-Merge
Example Mapping Reducing Merging
Optimizations 2 • Optimal Reduce-Merge Connections: • Remote read between Mapper (M) and Reducer (R) is (R * (MA+MB)) if datasets size on A & B are same. • If datasets size on A & B are Not same then Remote read is R+R • Remote read between Reducers & Mergers is 2R
Enhancements Library: types of Merges phase Configure Workflow.
Conclusions Map-Reduce and GFS represent a rethinking of data processing This “simplified” philosophy drives down hardware and software cost Map-Reduce does not directly supportjoins of heterogeneous datasets, so adding Merge phase.