Bigtable : A Distributed Storage System for Structured Data Google, Inc.

Bigtable: A Distributed Storage System for Structured DataGoogle, Inc. 김 윤호

1. Introduction 2. Data Model 3. API 4. Building Blocks 5. Implementation 6. Refinements 7. Performance Evaluation 8. Real Application 9. Lessons 10. Conclusions

WHAT IS THE B I G T A B L E ?

Robert Therrien

1. Introduction • Very large size data (Petabyte) • Managing Structured data • Distributed Storage System • Google Earth, Google Finance…… • Bigtable resembles a database. (NoSQL)

1. Introduction A Bigtable is a Sparse, distributed, persistent Multi dimensional Map. Sorted

2. Data Model • Rows • Column Family • Timestamp

2. Data Model – Rows • Lexicographic order • Tablet • Good Locality (Reversed URL)

2. Data Model – Rows

2. Data Model – Column Family • Unit of access control • Set of Column keys • Same data type • A few number of Column Family • A number of Columns

2. Data Model – Timestamp • Multiple versions of the same data • Bigtable, Real time, Client App • Decreasing order

3. API • NoSQL • Functions for creating and deletingTable, Column family • C++

3. API - Write • RowMutation

3. API - Read • Scanner

4. Building Blocks • Google File System • Store log and data files • SSTable • Chubby

4. Building Blocks Chubby Master Tablet Server GFS Client Tablet Server SSTable Tablet Server SSTable 참조: 구글을 지탱하는 기술 (나시다 케이스케)

4. Building Blocks • SSTable • Used internally to store Bigtable data • Provide a persistent, ordered immutable map • Contains a sequence of block (64KB) Data Index

4. Building Blocks • Chubby • Small distributed file system • Distributed lock service • To Ensure one active master • To store the bootstrap location of Bigtable data • To discover tablet servers and finalize tablet server deaths • To store Bigtable schema information • To store access control lists

5. Implementation • Tablet Location • Tablet Assignment • Tablet Serving • Compactions

5. Implementation - Tablet Location • Three-level hierarchy • METADATA • Row = 1KB • Tablet = 128MB • # of METADATA Tablet = • # of User Tablet = • All Capacity = 128MB * = 2EB

5. Implementation - Tablet Assignment Bigtable • One Tablet - One Tablet server ( GFS) Master Tablet Server Tablet GFS Tablet Info Tablet Info Tablet Tablet Tablet Info Tablet Info Tablet

5. Implementation - Tablet Serving • Recovery Tablet • Write operation • Read operation

5. Implementation - Tablet Serving • Memtable • Commit log

Tablet Recovery • Read Metadata from METADATA Table • Metadata contains the list of SSTable • SSTable comprise a tablet and a set of a redo point • Redo points are pointers into any commit logs • Reconstructs the memtable from redo points

Write operation • Check for well-formedness and proper authorization • Write to the Commit log • Insert into the memtable Read operation • Check for well-formedness and proper authorization • Read from merged view of the sequence of SSTable and the memtable • Or Fail

5. Implementation - Compactions • Minor compaction • Merging compaction • Major compaction

Minor compaction GFS Tablet Server SSTable Memtable Write Op SSTable Memtable SSTable • Shrink the memory usage of the tablet server • Reduce the amount of data that has to be read from the commit log during recovery

Compaction • Merging compaction • Major compaction GFS GFS SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable

5. Implementation - Compactions • Reclaim resources used by deleted data

6. Refinements • High performance, Availability, Reliability • Locality groups • Compression • Caching for read performance • Bloom filters • Commit-log implementation • Speeding up tablet recovery

Locality groups • Group of multiple column families • A separate SSTable is generated for each locality group • More efficient read • Compression • 2-pass custom compression scheme • Bentley and MCIlroy’s scheme • Fast compression algorithm

Caching for read performance • Two-level caching • Scan Cache • Block Cache Bloom filters • Whether an element is a member of a set • Reduce the number of disk access for read operation

Commit-log implementation • Commit log for each tablet in a separate log file • Single commit log per tablet server Speeding up tablet recovery • Master moves a tablet from one tablet server to another. • Minor compaction

7. Performance Evaluation • Single tablet-server performance • Scaling

8. Real Application • Google Analytics • Google Earth • Personalized Search

8. Real Application - Google Analytics • Help webmasters analyze traffic patterns • # of Visitors, page views, site-tracking reports • Embed a JavaScript program • Two tables • Raw click table (~200TB) • Summary table (~20TB)

8. Real Application - Google Earth • Table to preprocess data • Set of tables for serving client data

8. Real Application - Personalized Search • Records user queries and clicks • Web search, images, news • Row – userid • Column family – user action • Timestamp – the time at user action occurred

9. Lessons • Large distributed systems are vulnerable to many types of failures • Important to delay adding new features • Proper system-level monitoring • The value of simple designs 10. Conclusions • Resource-sharing issues within Bigtable itself

THANK YOU Q & A

Bigtable : A Distributed Storage System for Structured Data Google, Inc.