320 likes | 403 Views
Software Systems Development. MAP-REDUCE , Hadoop, HBase. The problem. Batch (offline) processing of huge data set using commodity hardware Linear scalability Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms. Data Sets.
E N D
Software Systems Development MAP-REDUCE , Hadoop, HBase
The problem • Batch (offline) processing of huge data set using commodity hardware • Linear scalability • Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms
Data Sets • The New York Stock Exchange: 1 Terabyte of data per day • Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes) • Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month • Can’t put data on a single node, need distributed file system to hold it
Batch processing • Single write/append multiple reads • Analyze Log files for most frequent URL • Each data entry is self-contained • At each step , each data entry can be treated individually • After the aggregation, each aggregated data set can be treated individually
Grid Computing • Grid computing • Cluster of processing nodes attached to shared storage through fiber (typically Storage Area Network) • Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck • Programming paradigm: Low level Message Passing Interface (MPI)
Hadoop • Open-source implementation of 2 key ideas • HDFS: Hadoop distributed file system • Map-Reduce: Programming Model • Build based on Google infrastructure (GFS, Map-Reduce papers published 2003/2004) • Java/Python/C interfaces, several projects built on top of it
Approach • Limited but simple model fit to broad range of applications • Handle communications, redundancies , scheduling in the infrastructure • Move computation to data instead of moving data to computation
Distributed File System (HDFS) • Files are split into large blocks (128M, 64M) • Compare with typical FS block of 512Bytes • Replicated among Data Nodes(DN) • 3 copies by default • Name Node (NN) keeps track of files and pieces • Single Master node • Stream-based I/O • Sequential access
Map Reduce • A Programming Model • Decompose a processing job into Map and Reduce stages • Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest
MAP function • Map each data entry into a pair • <key, value> • Examples • Map each log file entry into <URL,1> • Map day stock trading record into <STOCK, Price>
Hadoop: Shuffle/Merge phase • Hadoop merges(shuffles) output of the MAP stage into • <key, valulue1, value2, value3> • Examples • <URL, 1 ,1 ,1 ,1 ,1 1> • <STOCK, Price On day 1, Price On day 2..>
Reduce function • Reduce entries produces by Hadoop merging processing into <key, value> pair • Examples • Map <URL, 1,1,1> into <URL, 3> • Map <Stock, 3,2,10> into <Stock, 10>
Hadoop Infrastructure • Replicate/Distribute data among the nodes • Input • Output • Map/Shuffle output • Schedule Processing • Partition Data • Assign processing nodes (PN) • Move code to PN(e.g. send Map/Reduce code) • Manage failures (block CRC, rerun MAP/Reduce if necessary)
Example: Trading Data Processing • Input: • Historical Stock Data • Records are CSV (comma separated values) text file • Each line : stock_symbol, low_price, high_price • 1987-2009 data for all stocks one record per stock per day • Output: • Maximum interday delta for each stock
Datastore: HBASE • Distributed Column-Oriented database on top of HDFS • Modeled after Google’s BigTable data store • Random Reads/Writes on to of sequential stream-oriented HDFS • Billions of Rows * Millions of Columns * Thousands of Versions
HBASE: Region Servers • Tables are split into horizontal regions • Each region comprises a subset of rows • HDFS • Namenode, dataNode • MapReduce • JobTracker, TaskTracker • HBASE • Master Server, Region Server
HBASE vs RDMS • HBase tables are similar to RDBS tables with a difference • Rows are sorted with a Row Key • Only cells are versioned • Columns can be added on the fly by client as long as the column family they belong to preexists