MapReduce – An overview

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute

Roadmap • Motivation • MapReduce by Google • Tools • Hadoop • Hbase • MapReduce for SPARQL • HRdfStore • References

Motivation • MapReduce inspired by the map and reduce primitives present in Lisp and other functional programming languages. • Managing large amounts of data on the clusters of machines. • Processing this data in distributed fashion without aggregating it at single point. • Minimal user expertise required to carry out the tasks in parallel on cluster of machines.

MapReduce by Google • Input: a set of key-value pairs • Output: a set of key-value pairs (not necessarily same as input)! Example of Word-Count

MapReduce by Google (contd..) • Architecture has a master server • Map task workers • Reduce task workerss • Map task split into M splits and distributed to Map workers • Reduce invocations distributed to R nodes by partitioning the intermediate key space. • E.g. Hash(key) mod R

MapReduce by Google (contd..) • Uses Google File System (GFS) • Provides fault-tolerance • Preserves locality by scheduling jobs on machines on same cluster and having replicated input data. • Trade-off in selection of M and R values. • Master makes O(M + R) scheduling decisions, and keeps O(M*R) states in memory • Typically M = 200,000 R = 5,000 with 2000 worker machines.

Tools • Hadoop (http://hadoop.apache.org/core/) (This talk is not going to detail Hadoop APIs) • Uses Hadoop Distributed File System (HDFS) – specifically meant for large distributed data intensive applications running on commodity hardware. • Inspired by Google File System (GFS) • For MapReduce operations • Master JobTracker and one slave TaskTracker per cluster-node • Applications specify input/output locations • Supply map and reduce functions implementing appropriate interfaces and abstract classes. • Implemented in Java but applications can be written in other languages • Using Hadoop Streaming

Tools (contd..) • Hbase (http://wiki.apache.org/hadoop/Hbase ) • Inspired by Google’s Bigtable architecture for distributed data storage using sparse tables • It’s like a multidimensional sorted map, indexed by a row key, column key, and a timestamp. • A column name has the form <family>:<label> • Single table enforces set of column families. • Column families stored physically close on the disk to improve locality while searching.

Hbase storage Hbase table view

Hbase storage

Hbase architecture • Table is a list of data tuples sorted by the row key. • Physically broken into HRegions -> tablename, start and end key. • HRegion served by HRegionServer. • HStore for each column group. • HStoreFiles B-Tree like structure • HMaster to control HRegionServers • META table to store meta info about HRegions and HRegionServer locations.

Hbase architecture HMaster HRegionServer1 HRegion1 HRegion2 HRegionServer2 HRegion3 HRegion4 HRegionServer3 HRegion5 HRegion6

MapReduce for SPARQL (HRdfStore) • Use HRdfStore Data Loader (HDL) to read RDF files and organize data in HBase. • Sparcity of RDF data specifically useful to store in Hbase. • Hbase’s compression techniques useful • HRdfStore Query Processor (HQP) executes RDF queries on HBase tables. • SPARQL Query -> Parse tree -> Logical operator tree -> Physical operator tree -> Execution

MapReduce for SPARQL(some more thoughts) • How to organize RDF data in Hbase? • Subjects/Object as Row Keys! • “Predicates” column family • Each predicate as “label” e.g. “Predicates-rdf:type”. • Or predicates as row keys • Subjects/Objects as column families. • Convert each SPARQL query into associated query for Hbase.

MapReduce for SPARQL(some more thoughts) • Each RDF triple mapped to one of more keys and stored in Hbase according to these keys. • Each cluster node being responsible for triples associated with one or more particular keys. • Map each triple pattern in the SPARQL query to a key with associated restrictions e.g. FILTERs. • Execute the query by mapping the triple patterns to cluster nodes associated with those keys. • This is nothing but Distributed Hash Table like system. • Can employ a different hashing scheme to avoid skew in triple distribution as experienced in conventional DHT based P2P systems.

Map-Reduce-Merge (an application) • Map-Reduce do not work well with heterogeneous databases. • It does not directly support join. • Map-Reduce-Merge (as proposed by Yahoo! And UCLA researchers) support features of Map-Reduce while providing relational algebra to the list of database principles.

References • MapReduce - http://labs.google.com/papers/mapreduce.html • BigTable - http://labs.google.com/papers/bigtable.html • Hadoop – http://hadoop.apache.org/core/docs/r0.15.3/index.html • Hbase - http://hadoop.apache.org/hbase/docs/r0.1.1/ • HrdfStore - http://wiki.apache.org/incubator/HRdfStoreProposal • IBM MapReduce tool for Eclipse - http://www.alphaworks.ibm.com/tech/mapreducetools • Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, SIGMOD’07.

MapReduce – An overview