210 likes | 368 Views
Map/Reduce in Practice. Hadoop , Hbase , MongoDB , Accumulo , and related Map/Reduce-enabled data stores. How we got here. Google. Uses. To Provide. Map/Reduce. GFS. BigTable. Uses. To Provide. Hadoop. HDFS. HBase. Related Stuff…. Accumulo. Cassandra. MongoDB.
E N D
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce-enabled data stores
How we got here Google Uses To Provide Map/Reduce GFS BigTable Uses To Provide Hadoop HDFS HBase Related Stuff… Accumulo Cassandra MongoDB
In the beginning was the Google • Larry and Sergey had a lot of data • Needed fast distributed large files • Needed location awareness • GFS was born:
Processing that data • Needed some way to process it all efficiently • Move processing to the data • Distributed processing • Only transfer minimal results • Map/Reduce
Files are good, structure is better • Map/Reduce naturally produces and functions on structured data (key => value pairs) • Needed a way to efficiently store and access data • BigTable • Compressed, sparse, distributed, multidimensional
Open, sortof • Google told the world about this great stuff: • Dean, Jeffrey and Ghemawat, Sanjay. “MapReduce: Simplified Data Processing on Large Clusters,” OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. • Chang, Fay et al. “Bigtable: A Distributed Storage System for Structured Data,” OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. • But they weren’t sharing the implementations
Hadoop: Map/Reduce for the masses • Open source Apache project • Derived from Google papers • Consists of Hadoop Kernel, MapReduce, and HDFS • Also related projects Hive, Hbase, Zookeeper, etc.
MapReduce Layer • Takes Jobs, which are split into Tasks • Tasks are executed on worker nodes that, ideally, store the data the task needs to process • If that’s not possible, the task attempts to execute on a worker node in the same rack as the data • Tasks might be map tasks or reduce tasks, depending on what the job tracker needs at the time
HDFS Layer • Consists of namenode, secondary namenode for replication, and datanodes • Datanodes contain redundant copies of data, generally 2 copies on one rack, and a third copy on a different rack • Exposes data location information to Jobtracker so tasks can be distributed to workers close to the data • Not a POSIX file system, and can’t be mounted directly
Other Storage • Hadoop is flexible about what storage system is used • Alternatives are Amazon S3, CloudStore, FTP Filesystem, and read-only HTTP(S) file systems • Only HDFS and CloudStore are rack-aware, though • Multiple data store implementations • Also, HDFS isn’t restricted to Hadoop. Hbase and other projects use it as storage
HBase • Basically open-source BigTable • Non-relational, distributed, sparse, multi-dimensional, compressed data • Tables can be input/output for MapReduce jobs run in Hadoop • Support Bloom filters • Another thing borrowed from BigTable • Can tell you if something isn’t in the column, but not necessarily if it is there
Data Model • Data is stored as rows with a single key, timestamp, and multiple columnfamilies • Data is sorted based on the key, but otherwise there aren’t any indexes • Supports 4 operations: Get, Put, Scan, Delete • Deletes don’t actually delete, they just mark a row as dead, for later compactions to clean up
Digression: Bloom Filters • Maintains a bit array like a hash table • Each item, when inserted to the column, is hashed with k different algorithms, and the resulting index bit is set to 1. • To determine if a value is in the table, hash it with the k algorithms and see if all the indexes are set to 1. If one or more is missing, the value isn’t in there • But if there is a non-zero probability that all will be 1 and the value won’t be there. • Write-only, since you never know which entries duplicated a bit
So, why bother? • Column scans are expensive, and that’s about the only way to find stuff in a column that’s not the key
Accumulo • Hbase for the NSA • Provides basically the same functionality of Hbase, but with security • Adds a new element to the key, Column Visibility • Stores a logical combination of security labels that must be satisfied at query time for the key/value to be returned • Hence a single table can store data with various security levels, and users only see what they’re allowed to see
Cassandra • A lot like Hbase, with BigTable inspiration, but also inspired by Amazon Dynamo (cloud key/value store) • Also has columnfamilies (and even supercolumns), but allows secondary indexes • Distribution and replication are tunable • Writes faster than reads, so good for logging, etc.
Cassandra vs. HBase • Basically comes down to the CAP theorem: • You have to pick two of Consistency, Availability, and Partition tolerance. You can’t have all 3. • Cassandra chooses AP, though you can get consistency if you can tolerate greater latency. • By default provides weak consistency • Hbase values CP, but availability may suffer. In the event of a partition (node failure), the data won’t be available if it can’t be guaranteed to be consistent with committed operation.
MongoDB • Document-Oriented Storage • Full index support • Replication and high availability • Auto-sharding to scale horizontally • Javascript-based querying • Map/Reduce • GridFS storage
Conclusion • There are a lot of options out there, and more all the time • RDBMS offers the most functionality, but stumbles at the scalability problem • Key/value stores scale, but require different processing model • Best option will be determined by a combination of data and task