1 / 19

Map Reduce

Map Reduce . Based on A. Rajaraman and J.D.Ullman . Mining Massive Data Sets . Cambridge U. Press, 2009; Ch. 2. Some figures “stolen” from their slides. Big Data and Cluster Computing 1/2. What’s “big data”? Very large – think several Terabytes.

gamma
Download Presentation

Map Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Map Reduce Based on A. Rajaraman and J.D.Ullman. Mining Massive Data Sets. Cambridge U. Press, 2009; Ch. 2. Some figures “stolen” from their slides.

  2. Big Data and Cluster Computing 1/2 • What’s “big data”? • Very large – think several Terabytes. • Often beyond the capacity of a single compute node’s storage capacity. • While there is no unique def. of big data, the kind we focus on here has two properties: • Enormous in size (common to all kinds of big data!) • Updates mostly take the form of appends, at least in-place updates are rare

  3. BD and CC 2/2 • Some examples of big data: • Web graph • Social networks • Computations on big data are expensive: • Computing page rank: iterated matrix vector products over tens of billions of web pages • Finding your friends on facebook (or other social networks): search over a graph with > 100M nodes and > 10B edges • Similarity “search” in recommender systems • Some non-examples: • A bank accounts database, no matter how large (why?) • Any update (i.e., modify) intensive database • Online Retail stores

  4. Compute Node CPU “Big Data” typically far exceeds the capabilities of a compute node. Memory Disk

  5. CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Cluster Computing – Distributed File System 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch Switch … … Examples: Google DFS; Hadoop DFS (Apache); CloudStore (open source; Kosmix, now part of Walmart Labs) Each rack contains 16-64 nodes

  6. DFS • Divide up file into chunks (e.g., 64MB) and replicate chunks in different racks (e.g., 3 times). – redundancy and resiliency against failures. • Divide up computation into independent tasks; if task T fails, can restart it w/o affecting tasks T’  T. • Map Reduce Paradigm. • Tolerant to hardware failure. • Master file: where are the chunks of this file? • Master file is replicated; directory of DFS keeps track of MF copies.

  7. Map Reduce Schematic (key, value) pairs (ki, vi) (k, [v1, ..., vm]) Input Chunks Combined Output Map Tasks Master Controller: Group by key Chunk = {elements}. E.g.: tuple, doc, tweet, ... A map task may get > 1 chunk input Reduce Tasks

  8. Example 1 – Count #words in a collection of documents • Element = doc; key = word; value = frequency. • Chunk = {docs}  Map task  (k,v) pairs, initially just (w, 1). • Master Controller: group by word (across ouput from various Map tasks) and merge all values. • Reduce task: hash each key to some reduce task, which aggregates (in this case, sums) all values associated with that key. • Output from all Reduce tasks merged again. • Typically, Reduce function is associative and commutative, e.g., addition. • Can “push” some of the Reduce functionality inside Map tasks. (Reduce still needs to do its job.) • #Map Tasks & #Reduce Tasks decided by user.

  9. Example 2 – Matrix Vector Multiplication • At the core of Page Rank computation. • x_nx1 = M_nxn x v_nx1. x_i = _j m_ij * v_j; n ~ 10^10. • M extremely sparse (web link graph): ~10-15 non-zeros per row. • Assume v will fit in memory. Then: • Map: (Chunk of M, v)  pairs (i, m_ij x v_j); what is the key for terms in the sum expression for x_i? • Reduce: Add up all values for given key i  x_i.

  10. Matrix Vector Mult – what if v won’t fit in memory? Color = stripe. Each stripe of matrix divided up into chunks. chunk X

  11. Relational Algebra • Review RA (from any database text). • We discuss MR implementation of RA not because we want to implement a DBMS over MR. • Operations/computations over large networks can be captured using RA. • Efficient MR implementation of RA  efficient implementation of a whole family of such computations over large networks. • E.g., (node pairs connected by) paths of length two: PROJECT_{L1.From, L2.To}((RENAME(Link L1) JOIN_{To=From} RENAME(Link L2)). • #friends of each user: GROUP-BY_{User:COUNT(Friend)}(Friends).

  12. MR implementations of SELECT/PROJECT • SELECT_C(R): Map: for each tuple t if t satisfies C, output (t, t). • Reduce: Identity, i.e., simply pass on each i/c (key, value) pair. • Can extract relation by taking just value (or just key!). • PROJECT_X(R): Map: for each tuple t, project it on X; let t’ = t[X], then output (t’, t’). • Reduce: Transform (t’, [t’, ..., t’]) into (t’, t’), i.e., dup-elim. • Optimization: Can throw out encountered duplicates early in Map; still need dup-elim. in Reduce.

  13. MR implementations of Set ops • Union/Intersection/Minus: Map: turn each tuple t in R into (t, R) and each tuple t in S into (t, S). Merging could create (t, [R]), or (t, [S]), or (t, [R,S]) or (t, [S, R]). • Reduce: action depends on operation; for union turn any of those into (t,t); for minus, turn (t,[R]) into (t,t) and turn everything else into (t, NULL); for intersect, turn (t,[R,S]) into (t,t) and everything else into (t, NULL).

  14. MR implementations of Join • Natural Join: Idea works also for equi-join. Consider e.g., R(A,B) and S(B,C). • Map: Map each tuple (a,b) in R to (b, (R,a)) and each tuple (b,c) in S to (b, (S,c)). [Hadoop passes the output to Reduce tasks, sorted on key.] • Reduce: from each pair (b,[a set of pairs of the form (R,a) or (S,c)]) produce (b, (a1,b,c1), (a1,b,c2), ..., (am,b,cn)). The “value” of this key-value pair = subset of join with B=b. • Boldface indicates tuple of attributes/values. • Typically, join selectivity is high, so the cost is close to linear in the total size of the two relations. • What if a(nother) impl. of MR did not pass Map output sorted on key?

  15. MR Implementation of Groupby • Example: R(A,B,C). Want GB_{A: agg1(B1), ..., aggk(Bk)}(R). • Map: Map each tuple (a,b,c) to (a,b). • Reduce: Turn each (a,[b1, ..., bm]) into (a, agg1{b1[1], ..., bm[1]}, ..., aggk{b1[k], ..., bm[k]}). • Optimization: if an agg is associative and commutative, and if b’s associated with the same a are encountered together, can push some computation to Map.

  16. Matrix Mult via Join! 1/2 • First, we will do this by composing two MR steps. View matrix M as M(I,J,V) triples and N as N(J,K,W) triples. • Map: Map M(i,j,m_ij) to (j,(M,i,m_ij)) and N(j,k,n_jk) to (j,(N,k,n_jk)). • Reduce: from each (key, value) pair (j,[triples from M and from N]), and for each (M,i,m_ij) and (N,k,n_jk) in that set of triples, output (j,(i,k,m_ijxn_jk)).

  17. Matrix Mult via Join 2/2 • Second MR: • Map: from each (key, value) pair (j, [(i1,k1,v1), ..., (ip,kp,vp)]), produce the (key, value) pairs ((i1,k1), v1), ..., ((ip,kp), vp). • Reduce: for each (key, value) pair ((i,k), [v1, ..., vm]), produce the output ((i,k), v1+ ... +vm). This is the value in row i and column k of M x N. • Not the most efficient, but interesting: uses joins; composes MR like an algebraic operator!

  18. Matrix Mult in one MR step m_ij row i X = n_jk p x q q x r colk

  19. Matrix Mult in one MR step ((i,1), (M, j, m_ij)) m_ij Map ((i,r), (M, j, m_ij)) ((1,k), (N, j, n_jk)) n_jk Map ((p,k), (N, j, n_jk)) ((i,k), (M,1,m_i1)), ..., ((i,k), (M,q,m_iq)) ((i,k), Σ_j (m_ij . n_jk). Re Reduce ((i,k), (N,1,n_1k)), ..., ((i,k), (N,q,n_qk))

More Related