BigTable & MapReduce

BigTable & MapReduce http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 12/16/2014

Google File System

Google File System • 一个master，若干个chunkserver，若干个client • 存储大文件（GB-TB） • 一个文件由若干个定长块（chunk，64MB） • 块是普通linux文件，有若干个复本（replica）

Google File System • 对一个文件块，在不同机器上存储3个备份 • 只要不是这3个机器同时宕掉，该文件块就是可恢复的

BigTable

Google’s Motivation – Scale! • Scale Problem • Lots of data • Millions of machines • Different project/applications • Hundreds of millions of users • Storage for (semi-)structured data • No commercial system big enough • Couldn’t afford if there was one • Low-level storage optimization help performance significantly • Much harder to do when running on top of a database layer

Bigtable • Distributed multi-level map • Fault-tolerant, persistent • Scalable • Thousands of servers • Terabytes of in-memory data • Petabyte of disk-based data • Millions of reads/writes per second, efficient scans • Self-managing • Servers can be added/removed dynamically • Servers adjust to load imbalance

Real Applications

Data Model • a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents

Data Model • Rows • Arbitrary string • Access to data in a row is atomic • Ordered lexicographically

Data Model • Column • Tow-level name structure: • family: qualifier • Column Family is the unit of access control

Data Model • Timestamps • Store different versions of data in a cell • Lookup options • Return most recent K values • Return all values

Data Model The row range for a table is dynamically partitioned Each row range is called a tablet Tablet is the unit for distribution and load balancing

APIs • Metadata operations • Create/delete tables, column families, change metadata • Writes • Set(): write cells in a row • DeleteCells(): delete cells in a row • DeleteRow(): delete all cells in a row • Reads • Scanner: read arbitrary cells in a bigtable • Each row read is atomic • Can restrict returned rows to a particular range • Can ask for just data from 1 row, all rows, etc. • Can ask for all columns, just certain column families, or specific columns

Typical Cluster Shared pool of machines that also run other distributed applications

Building Blocks • Google File System (GFS) • stores persistent data (SSTable file format) • Scheduler • schedules jobs onto machines • Chubby • Lock service: distributed lock manager • master election, location bootstrapping • MapReduce (optional) • Data processing • Read/write Bigtable data

Chubby • {lock/file/name} service • Coarse-grained locks • Each clients has a session with Chubby. • The session expires if it is unable to renew its session lease within the lease expiration time. • 5 replicas, need a majority vote to be active • Also an OSDI ’06 Paper

Implementation • Single-master distributed system • Three major components • Library that linked into every client • One master server • Assigning tablets to tablet servers • Detecting addition and expiration of tablet servers • Balancing tablet-server load • Garbage collection • Metadata Operations • Many tablet servers • Tablet servers handle read and write requests to its table • Splits tablets that have grown too large

Implementation

Tablets • Each Tablets is assigned to one tablet server. • Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality • Aim for ~100MB to 200MB of data per tablet • Tablet server is responsible for ~100 tablets • Fast recovery: • 100 machines each pick up 1 tablet for failed machine • Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions

How to locate a Tablet? • METADATA: Key: table id + end row, Data: location • Aggressive Caching and Prefetching at Client side Given a row, how do clients find the location of the tablet whose row range covers the target row?

Tablet Assignment Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room. It uses Chubby to monitor health of tablet servers, and restart/replace failed servers.

Tablet Assignment • Chubby • Tablet server registers itself by getting a lock in a specific directory chubby • Chubby gives “lease” on lock, must be renewed periodically • Server loses lock if it gets disconnected • Master monitors this directory to find which servers exist/are alive • If server not contactable/has lost lock, master grabs lock and reassigns tablets • GFS replicates data. Prefer to start tablet server on same machine that the data is already at

Refinement – Locality groups & Compression • Locality Groups • Can group multiple column families into a locality group • Separate SSTable is created for each locality group in each tablet. • Segregating columns families that are not typically accessed together enables more efficient reads. • In WebTable, page metadata can be in one group and contents of the page in another group. • Compression • Many opportunities for compression • Similar values in the cell at different timestamps • Similar values in different columns • Similar values across adjacent rows

Performance - Scaling Not Linear! WHY? • As the number of tablet servers is increased by a factor of 500: • Performance of random reads from memory increases by a factor of 300. • Performance of scans increases by a factor of 260.

Not linearly? • Load Imbalance • Competitions with other processes • Network • CPU • Rebalancing algorithm does not work perfectly • Reduce the number of tablet movement • Load shifted around as the benchmark progresses

MapReduce Application

1. Top 10 Problem • Lots of data • paper,author, contents • million papers, million authors, millions of possible terms (occuring in contents) • Problem • Top 10 terms for each author • Top 10 authors for each term • Use MapReduce to solve the problem

2. Naive Bayes • Lost of cases • Each case has many features, and YES or NO • Problem • Likelihoods of features

Thank You! Q&A

BigTable & MapReduce