1 / 18

MapReduce – An overview

MapReduce – An overview. Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute. Roadmap. Motivation MapReduce by Google Tools Hadoop Hbase MapReduce for SPARQL HRdfStore References. Motivation.

saima
Download Presentation

MapReduce – An overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute

  2. Roadmap • Motivation • MapReduce by Google • Tools • Hadoop • Hbase • MapReduce for SPARQL • HRdfStore • References

  3. Motivation • MapReduce inspired by the map and reduce primitives present in Lisp and other functional programming languages. • Managing large amounts of data on the clusters of machines. • Processing this data in distributed fashion without aggregating it at single point. • Minimal user expertise required to carry out the tasks in parallel on cluster of machines.

  4. MapReduce by Google • Input: a set of key-value pairs • Output: a set of key-value pairs (not necessarily same as input)! Example of Word-Count

  5. MapReduce by Google (contd..) • Architecture has a master server • Map task workers • Reduce task workerss • Map task split into M splits and distributed to Map workers • Reduce invocations distributed to R nodes by partitioning the intermediate key space. • E.g. Hash(key) mod R

  6. MapReduce by Google (contd..) • Uses Google File System (GFS) • Provides fault-tolerance • Preserves locality by scheduling jobs on machines on same cluster and having replicated input data. • Trade-off in selection of M and R values. • Master makes O(M + R) scheduling decisions, and keeps O(M*R) states in memory • Typically M = 200,000 R = 5,000 with 2000 worker machines.

  7. Tools • Hadoop (http://hadoop.apache.org/core/) (This talk is not going to detail Hadoop APIs) • Uses Hadoop Distributed File System (HDFS) – specifically meant for large distributed data intensive applications running on commodity hardware. • Inspired by Google File System (GFS) • For MapReduce operations • Master JobTracker and one slave TaskTracker per cluster-node • Applications specify input/output locations • Supply map and reduce functions implementing appropriate interfaces and abstract classes. • Implemented in Java but applications can be written in other languages • Using Hadoop Streaming

  8. Tools (contd..) • Hbase (http://wiki.apache.org/hadoop/Hbase ) • Inspired by Google’s Bigtable architecture for distributed data storage using sparse tables • It’s like a multidimensional sorted map, indexed by a row key, column key, and a timestamp. • A column name has the form <family>:<label> • Single table enforces set of column families. • Column families stored physically close on the disk to improve locality while searching.

  9. Hbase storage Hbase table view

  10. Hbase storage

  11. Hbase architecture • Table is a list of data tuples sorted by the row key. • Physically broken into HRegions -> tablename, start and end key. • HRegion served by HRegionServer. • HStore for each column group. • HStoreFiles B-Tree like structure • HMaster to control HRegionServers • META table to store meta info about HRegions and HRegionServer locations.

  12. Hbase architecture HMaster HRegionServer1 HRegion1 HRegion2 HRegionServer2 HRegion3 HRegion4 HRegionServer3 HRegion5 HRegion6

  13. MapReduce for SPARQL (HRdfStore) • Use HRdfStore Data Loader (HDL) to read RDF files and organize data in HBase. • Sparcity of RDF data specifically useful to store in Hbase. • Hbase’s compression techniques useful • HRdfStore Query Processor (HQP) executes RDF queries on HBase tables. • SPARQL Query -> Parse tree -> Logical operator tree -> Physical operator tree -> Execution

  14. MapReduce for SPARQL(some more thoughts) • How to organize RDF data in Hbase? • Subjects/Object as Row Keys! • “Predicates” column family • Each predicate as “label” e.g. “Predicates-rdf:type”. • Or predicates as row keys • Subjects/Objects as column families. • Convert each SPARQL query into associated query for Hbase.

  15. MapReduce for SPARQL(some more thoughts) • Each RDF triple mapped to one of more keys and stored in Hbase according to these keys. • Each cluster node being responsible for triples associated with one or more particular keys. • Map each triple pattern in the SPARQL query to a key with associated restrictions e.g. FILTERs. • Execute the query by mapping the triple patterns to cluster nodes associated with those keys. • This is nothing but Distributed Hash Table like system. • Can employ a different hashing scheme to avoid skew in triple distribution as experienced in conventional DHT based P2P systems.

  16. Map-Reduce-Merge (an application) • Map-Reduce do not work well with heterogeneous databases. • It does not directly support join. • Map-Reduce-Merge (as proposed by Yahoo! And UCLA researchers) support features of Map-Reduce while providing relational algebra to the list of database principles.

  17. References • MapReduce - http://labs.google.com/papers/mapreduce.html • BigTable - http://labs.google.com/papers/bigtable.html • Hadoop – http://hadoop.apache.org/core/docs/r0.15.3/index.html • Hbase - http://hadoop.apache.org/hbase/docs/r0.1.1/ • HrdfStore - http://wiki.apache.org/incubator/HRdfStoreProposal • IBM MapReduce tool for Eclipse - http://www.alphaworks.ibm.com/tech/mapreducetools • Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, SIGMOD’07.

More Related