440 likes | 926 Views
Bigtable : A Distributed Storage System for Structured Data Google, Inc. 김 윤호. 1. Introduction 2. Data Model 3. API 4. Building Blocks 5. Implementation. 6. Refinements 7. Performance Evaluation 8. Real Application 9. Lessons 10. Conclusions. WHAT IS THE B I G T A B L E ?.
E N D
Bigtable: A Distributed Storage System for Structured DataGoogle, Inc. 김 윤호
1. Introduction 2. Data Model 3. API 4. Building Blocks 5. Implementation 6. Refinements 7. Performance Evaluation 8. Real Application 9. Lessons 10. Conclusions
WHAT IS THE B I G T A B L E ?
1. Introduction • Very large size data (Petabyte) • Managing Structured data • Distributed Storage System • Google Earth, Google Finance…… • Bigtable resembles a database. (NoSQL)
1. Introduction A Bigtable is a Sparse, distributed, persistent Multi dimensional Map. Sorted
2. Data Model • Rows • Column Family • Timestamp
2. Data Model – Rows • Lexicographic order • Tablet • Good Locality (Reversed URL)
2. Data Model – Column Family • Unit of access control • Set of Column keys • Same data type • A few number of Column Family • A number of Columns
2. Data Model – Timestamp • Multiple versions of the same data • Bigtable, Real time, Client App • Decreasing order
3. API • NoSQL • Functions for creating and deletingTable, Column family • C++
3. API - Write • RowMutation
3. API - Read • Scanner
4. Building Blocks • Google File System • Store log and data files • SSTable • Chubby
4. Building Blocks Chubby Master Tablet Server GFS Client Tablet Server SSTable Tablet Server SSTable 참조: 구글을 지탱하는 기술 (나시다 케이스케)
4. Building Blocks • SSTable • Used internally to store Bigtable data • Provide a persistent, ordered immutable map • Contains a sequence of block (64KB) Data Index
4. Building Blocks • Chubby • Small distributed file system • Distributed lock service • To Ensure one active master • To store the bootstrap location of Bigtable data • To discover tablet servers and finalize tablet server deaths • To store Bigtable schema information • To store access control lists
5. Implementation • Tablet Location • Tablet Assignment • Tablet Serving • Compactions
5. Implementation - Tablet Location • Three-level hierarchy • METADATA • Row = 1KB • Tablet = 128MB • # of METADATA Tablet = • # of User Tablet = • All Capacity = 128MB * = 2EB
5. Implementation - Tablet Assignment Bigtable • One Tablet - One Tablet server ( GFS) Master Tablet Server Tablet GFS Tablet Info Tablet Info Tablet Tablet Tablet Info Tablet Info Tablet
5. Implementation - Tablet Serving • Recovery Tablet • Write operation • Read operation
5. Implementation - Tablet Serving • Memtable • Commit log
Tablet Recovery • Read Metadata from METADATA Table • Metadata contains the list of SSTable • SSTable comprise a tablet and a set of a redo point • Redo points are pointers into any commit logs • Reconstructs the memtable from redo points
Write operation • Check for well-formedness and proper authorization • Write to the Commit log • Insert into the memtable Read operation • Check for well-formedness and proper authorization • Read from merged view of the sequence of SSTable and the memtable • Or Fail
5. Implementation - Compactions • Minor compaction • Merging compaction • Major compaction
Minor compaction GFS Tablet Server SSTable Memtable Write Op SSTable Memtable SSTable • Shrink the memory usage of the tablet server • Reduce the amount of data that has to be read from the commit log during recovery
Compaction • Merging compaction • Major compaction GFS GFS SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable
5. Implementation - Compactions • Reclaim resources used by deleted data
6. Refinements • High performance, Availability, Reliability • Locality groups • Compression • Caching for read performance • Bloom filters • Commit-log implementation • Speeding up tablet recovery
Locality groups • Group of multiple column families • A separate SSTable is generated for each locality group • More efficient read • Compression • 2-pass custom compression scheme • Bentley and MCIlroy’s scheme • Fast compression algorithm
Caching for read performance • Two-level caching • Scan Cache • Block Cache Bloom filters • Whether an element is a member of a set • Reduce the number of disk access for read operation
Commit-log implementation • Commit log for each tablet in a separate log file • Single commit log per tablet server Speeding up tablet recovery • Master moves a tablet from one tablet server to another. • Minor compaction
7. Performance Evaluation • Single tablet-server performance • Scaling
8. Real Application • Google Analytics • Google Earth • Personalized Search
8. Real Application - Google Analytics • Help webmasters analyze traffic patterns • # of Visitors, page views, site-tracking reports • Embed a JavaScript program • Two tables • Raw click table (~200TB) • Summary table (~20TB)
8. Real Application - Google Earth • Table to preprocess data • Set of tables for serving client data
8. Real Application - Personalized Search • Records user queries and clicks • Web search, images, news • Row – userid • Column family – user action • Timestamp – the time at user action occurred
9. Lessons • Large distributed systems are vulnerable to many types of failures • Important to delay adding new features • Proper system-level monitoring • The value of simple designs 10. Conclusions • Resource-sharing issues within Bigtable itself
THANK YOU Q & A