COP5725 Advanced Database Systems

COP5725Advanced Database Systems MapReduce Spring 2014

What is MapReduce? • Programming model • expressing distributed computations at a massivescale • “…the computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: map and reduce.”- Jeff Dean and Sanjay Ghemawat [OSDI’04] • Execution framework • organizing and performing data-intensive computations • processing parallelizableproblems across huge datasets using a large number of computers (nodes) • Open-source implementation: Hadoop and others

Why does MapReduce Matter? • We are now in the so-called Big Data era • “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.”- Teradata Magazine, 2011 • “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” – The Mckinsey Global Institute, 2011 The volume, variety and velocity of data that is difficult to manage using traditional data management technology

How Much Data ? • Google processes 100 PB (1015 bytes) a day (2013) • Facebook has 300 PB of user data + 500 TB/day (2013) • YouTube 1000 PB video storage (2013) • CERN’s LHC (Large Hadron Collider) will generate 15 PB a year (2013) 640Kought to be enough for anybody

Who cares ? • Organizations and companies that can leverage large scale consumer-generated data • Consumer markets (hotels, airlines, Amazon, Netflix) • Social media (Facebook, Twitter, LinkedIn, YouTube) • Search providers (Google, Microsoft) • Other Enterprises are slowly getting into it • Healthcare • Financial Institutes • ……

Why not RDBMS? • Types of data • Structured data or transactions, text data, semi-structured data, unstructured data, streaming data, …… • Ways to cook data • Aggregation and statistics • Indexing, searching and querying • Knowledge discovery • Limitations • Very difficult to scale out (but not scale up) • Physically limited to CPUs, memory and disk storage • Require structure of tables with rows and columns • Table schemas have to be pre-defined

What We Need… • A Distributed System • Scalable • Fault-tolerant • Easy to program • Applicable to many real-world Big Data problems • …… Here comes MapReduce

General Idea • Divide & Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”

Scale Out Over Many Machines • Challenges • Workload partitioning: how do we assign work units to workers? • Load balancing: what if we have more work units than workers? • Synchronization: what if workers need to share partial results? • Aggregation: how do we aggregate partial results? • Termination: how do we know all the workers have finished? • Fault tolerance: what if workers die? • Common theme • Communication between workers (e.g., to exchange states) • Access to shared resources (e.g., data)

Existing Methods • Programming models • Shared memory (pthreads) • Message passing (MPI) • Design Patterns • Master-slaves • Producer-consumer flows • Shared work queues Message Passing Shared Memory Memory P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 producer consumer master work queue slaves producer consumer

Problem with Current Solutions • Lots of programming work • communication and coordination • work partitioning • status reporting • optimization • locality • Repeat for every problem you want to solve • Stuff breaks • One server may stay up three years (1,000 days) • If you have 10,000 servers, expect to lose 10 a day

MapReduce: General Ideas • Typical procedure: • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output • Key idea: provide a functional abstraction for these two operations • map (k, v) → <k’, v’> • reduce(k’, v’) → <k’’, v’’> • All values with the same key are sent to the same reducer • The execution framework handles everything else… Map Reduce

General Ideas k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

Two More Functions • Apart from Map and Reduce, the execution framework handles everything else… • Not quite…usually, programmers can also specify: • partition (k’, number of partitions) → partition for k’ • Divides up key space for parallel reduce operations • Often a simple hash of the key, e.g., hash(k’) mod n • combine(k’, v’) → <k’, v’>* • Mini-reducers that run in memory after the map phase • Used as an optimization to reduce network traffic

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

Importance of Local Aggregation • Ideal scaling characteristics: • Twice the data, twice the running time • Twice the resources, half the running time • Why can’t we achieve this? • Synchronization requires communication • Communication kills performance • Thus… avoid communication! • Reduce intermediate data via local aggregation • Combinerscan help

Example: Word Count v1.0 • Input: {<document-id, document-contents>} • Output: <word, num-occurrences-in-web>. e.g. <“obama”, 1000>

<doc2, “hennesy is the president of stanford”> <doc1, “obama is the president”> <docn, “this is an example”> … … <“obama”, 1> <“this”, 1> <“hennesy”, 1> <“is”, 1> <“is”, 1> <“is”, 1> <“the”, 1> <“an”, 1> <“the”, 1> … <“president”, 1> <“example”, 1> Group by reduce key <“the”, {1, 1}> <“obama”, {1}> … <“is”, {1, 1, 1}> … <“the”, 2> <“obama”, 1> <“is”, 3>

Word Count v2.0

Combiner Design • Combiners and reducers share same method signatures • Sometimes, reducers can serve as combiners • Often, not… • Remember: combiner are optional optimizations • Should not affect algorithm correctness • May be run 0, 1, or multiple times • Example: find average of all integers associated with the same key

Computing the Mean v1.0 Why can’t we use reducer as combiner?

Computing the Mean v2.0 • Why doesn’t this work? • combiners must have the same input and output key-value type, which also must be the same as the mapper output type and the reducer input type

Computing the Mean v3.0

MapReduce Runtime • Handles scheduling • Assigns workers to map and reduce tasks • Handles “data distribution” • Moves processes to data • Handles synchronization • Gathers, sorts, and shuffles intermediate data • Handles errors and faults • Detects worker failures and restarts • Everything happens on top of a distributed FS

Execution UserProgram (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output file 0 (5) remote read worker split 1 (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files

Implementation • Google has a proprietary implementation in C++ • Bindings in Java, Python • Hadoopis an open-source implementation in Java • Development led by Yahoo, used in production • Now an Apache project • Rapidly expanding software ecosystem • Lots of custom research implementations • For GPUs, cell processors, etc.

Distributed File System • Don’t move data to workers… move workers to the data! • Store data on the local disks of nodes in the cluster • Start up the workers on the node that has the data local • Why? • Not enough RAM to hold all the data in memory • Disk access is slow, but disk throughput (data transfer rate) is reasonable • A distributed file system is the answer • GFS (Google File System) for Google’s MapReduce • HDFS (Hadoop Distributed File System) for Hadoop

GFS • Commodity hardware over “exotic” hardware • Scale “out”, not “up” • Scale out (horizontally): add more nodes to a system • Scale up (vertically): add resources to a single node in a system • High component failure rates • Inexpensive commodity components fail all the time • “Modest” number of huge files • Multi-gigabyte files are common, if not encouraged • Files are write-once, mostly appended to • Perhaps concurrently • Large streaming reads over random access • High sustained throughput over low latency

Seeks vs. Scans • Consider a 1 TB database with 100-byte records • We want to update 1 percent of the records • Scenario 1: random access • Each update takes ~30 ms (seek, read, write) • 108 updates = ~35 days • Scenario 2: rewrite all records • Assume 100 MB/s throughput • Time = 5.6 hours(!) • Lesson: avoid random seeks!

GFS • Files stored as chunks • Fixed size (64MB) • Reliability through replication • Each chunk replicated across 3+ chunk servers • Single master to coordinate access, keep metadata • Simple centralized management • No data caching • Little benefit due to large datasets, streaming reads • Simplify the API • Push some of the issues onto the client (e.g., data layout)

Relational Databases vs. MapReduce • Relational databases: • Multipurpose: analysis and transactions; batch and interactive • Data integrity via ACID transactions • Lots of tools in software ecosystem (for ingesting, reporting, etc.) • Supports SQL (and SQL integration, e.g., JDBC) • Automatic SQL query optimization • MapReduce (Hadoop): • Designed for large clusters, fault tolerant • Data is accessed in “native format” • Supports many query languages • Programmers retain control over performance • Open source

Workloads • OLTP (online transaction processing) • Typical applications: e-commerce, banking, airline reservations • User facing: real-time, low latency, highly-concurrent • Tasks: relatively small set of “standard” transactional queries • Data access pattern: random reads, updates, writes (involving relatively small amounts of data) • OLAP (online analytical processing) • Typical applications: business intelligence, data mining • Back-end processing: batch workloads, less concurrency • Tasks: complex analytical queries, often ad hoc • Data access pattern: table scans, large amounts of data involved per query

Relational Algebra in MapReduce • Projection • Map over tuples, emit new tuples with appropriate attributes • No reducers, unless for regrouping or resorting tuples • Alternatively: perform in reducer, after some other processing • Selection • Map over tuples, emit only tuples that meet criteria • No reducers, unless for regrouping or resorting tuples • Alternatively: perform in reducer, after some other processing

Relational Algebra in MapReduce • Group by • Example: What is the average time spent per URL? • In SQL: • SELECT url, AVG(time) FROM visits GROUP BY url • In MapReduce: • Map over tuples, emit time, keyed by url • Framework automatically groups values by keys • Compute average in reducer • Optimize with combiners

Join in MapReduce • Reduce-side Join: group by join key • Map over both sets of tuples • Emit tuple as value with join key as the intermediate key • Execution framework brings together tuples sharing the same key • Perform actual join in reducer • Similar to a “sort-merge join” in database terminology

Reduce-side Join: Example • R1 • R4 • S2 • S3 • Map • keys • values • R1 • R4 • S2 • S3 • Reduce • keys • values • R1 • S2 • S3 • R4 Note: no guarantee if R is going to come first or S

Join in MapReduce • R1 • R2 • R3 • R4 • S1 • S2 • S3 • S4 • Map-side Join: parallel scans • Assume two datasets are sorted by the join key A sequential scan through both datasets to join(called a “merge join” in database terminology)

Join in MapReduce • Map-side Join • If datasets are sorted by join key, join can be accomplished by a scan over both datasets • How can we accomplish this in parallel? • Partition and sort both datasets in the same manner • In MapReduce: • Map over one dataset, read from other corresponding partition • No reducers necessary (unless to repartition or resort)

Join in MapReduce • In-memory Join • Basic idea: load one dataset into memory, stream over other dataset • Works if R << S and R fits into memory • Called a “hash join” in database terminology • MapReduce implementation • Distribute R to all nodes • Map over S, each mapper loads R in memory, hashed by join key • For every tuple in S, look up join key in R • No reducers, unless for regrouping or resorting tuples

Which Join Algorithm to Use? • In-memory join > map-side join > reduce-side join • Why? • Limitations of each? • In-memory join: memory • Map-side join: sort order and partitioning • Reduce-side join: general purpose

Processing Relational Data: Summary • MapReduce algorithms for processing relational data: • Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce • Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer • Multiple strategies for relational joins • Complex operations require multiple MapReduce jobs • Example: top ten URLs in terms of average time spent • Opportunities for automatic optimization

COP5725 Advanced Database Systems