Google’s Map Reduce

Google’s Map Reduce

Commodity Clusters • Web data sets can be very large • Tens to hundreds of terabytes • Cannot mine on a single server • Standard architecture emerging: • Cluster of commodity Linux nodes • Gigabit Ethernet interconnect • How to organize computations on this architecture?

CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch Switch … … Each rack contains 16-64 nodes

Map Reduce • Map-reduce is a high-level programming system that allows database processes to be written simply. • The user writes code for two functions, map and reduce. • A master controller divides the input data into chunks, and assigns different processors to execute the map function on each chunk. • Other processors, perhaps the same ones, are then assigned to perform the reduce function on pieces of the output from the map function.

Data Organization • Data is assumed stored in files. • Typically, the files are very large compared with the files found in conventional systems. • For example, one file might be all the tuples of a very large relation. • Or, the file might be a terabyte of "market baskets,“ • Or, the file might be the "transition matrix of the Web," which is a representation of the graph with all Web pages as nodes and hyperlinks as edges. • Files are divided into chunks, which might be complete cylinders of a disk, and are typically many megabytes.

The Map Function • Input is a set of key-value records. • Executed by one or more processes, located at any number of processors. • Each map process is given a chunk of the entire input data on which to work. • Output is a list of key-value pairs. • The types of keys and values for the output of the map function need not be the same as the types of input keys and values. • The "keys" that are output from the map function are not true keys in the database sense. • That is, there can be many pairs with the same key value.

Map Example Constructing an Inverted Index • Input is a collection of documents, • Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Map Function • Input is a set of (i,d) pairs • i is document ID • d is corresponding document. • The map function scans d and for each word w it finds, it emits the pair (w, i). • Notice that in the output, the word is the key and the document ID is the associated value. • Output of map is a list of word-ID pairs. • Not necessary to catch duplicate words in the document; the elimination of duplicates can be done later, at the reduce phase. • The intermediate result is the collection of all word-ID pairs created from all the documents in the input database.

Note. The output of a map-reduce algorithm is always a set of key-value pairs. Useful in some applications to compose two or more map-reduce operations.

The Reduce Function • The second user-defined function, reduce, is also executed by one or more processes, located at any number of processors. • Input a key value from the intermediate result, together with the list of all values that appear with this key in the intermediate result. • The reduce function itself combines the list of values associated with a given key k.

Reduce Example Constructing an Inverted Index • Input is a collection of documents, • Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Reduce Function • The intermediate result consists of pairs of the form (w, [i1, i2,…,in]), • where the i's are a list of document ID's, one for each occurrence of word w. • The reduce function takes a list of ID's, eliminates duplicates, and sorts the list of unique ID's.

Parallelism • This organization of the computation makes excellent use of whatever parallelism is available. • The map function works on a single document, so we could have as many processes and processors as there are documents in the database. • The reduce function works on a single word, so we could have as many processes and processors as there are words in the database. • Of course, it is unlikely that we would use so many processors in practice.

Another Example – Word Count Construct a word count. • For each word w that appears at least once in our database of documents, output pair (w, c), where c is the number of times w appears among all the documents. The map function • Input is a document. • Goes through the document, and each time it encounters another word w, it emits the pair (w, 1). • Intermediate result is a list of pairs (w1,1), (w2,1),…. The reduce function • Input is a pair (w, [1, 1,... ,1]), with a 1 for each occurrence of word w. • Sums the 1's, producing the count. • Output is word-count pairs (w,c).

What about Joins? R(A, B)  S(B, C) The map function • Input is key-value pairs (X, t), • X is either R or S, • t is a tuple of the relation named by X. • Output is a single pair (b, (R, a)) or (b, (S, c)) depending on X • b is the B-value of t. • b is the B-value of t (if X=R). • c is the C-value of t (if X=C). The reduce function • Input is a pair (b, [(R,a), (S,c), …]). • Extracts all the A-values associated with R and all C-values associated with S. These are paired in all possible ways, with the b in the middle to form a tuple of the result.

Reading • Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html • DISCO (Nokia) - Open Source Erlang implementation with Python interface MapReduce http://discoproject.org • Hadoop (Apache) – Open Source implementation of MapReduce http://hadoop.apache.org/core

Word Count in DISCO def fun_map(e, params): return [(w, 1) for w in e.split()] def fun_reduce(iter, out, params): stats = {} for word, count in iter: if word in stats: stats[word] += int(count) else: stats[word] = int(count) for word, total in stats.iteritems(): out.add(word, total)

Word Count in DISCO import sys from disco import Disco, result_iterator master = sys.argv[1] print "Starting Disco job.." print "Go to %s to see status of the job." % master results = Disco(master).new_job( name = "wordcount", input = ["http://discoproject.org/chekhov.txt"], map = fun_map, reduce = fun_reduce).wait() print "Job done. Results:" for word, frequency in result_iterator(results): print word, frequency

Word Count in DISCO mkdir bigtxt split -l 100000 bigfile.txt bigtxt/bigtxt- After running these lines, the directory bigtxt contains many files, named like bigtxt-aa, bigtxt-ab etc. which each contain 100,000 lines (except the last chunk that might contain less).

Decision Trees • Key observation (RainForest): • The best split for a node can be determined if we have the AVC-sets for the node (AVC stands for Attribute-Value, Classlabel). • AVC-sets are typically small and probably fit in main memory. • E.g. • AVC-set for an attribute "age" and class "car type" has a cardinality not more than 100*5. • Remarks: • AVC-sets aren't a compact representation of the dataset. • We can't reconstruct the dataset from the AVC-sets. • Although the AVC-sets can be small, the dataset can be very big! • Challenge: • Computing the AVC-sets.

AVC-sets in Map Reduce The map function • Input is (rid, tuple) pairs, • Output is a list of ((a,v,c), 1) pairs, where • a is attribute name • v is attribute value • c is class label The reduce function • Input is a pair ((a,v,c), [1, 1, …, 1]). • Adds up the 1's.

Google’s Map Reduce