Homework 2

Homework 2 • What is the role of the secondary database that we have to create? • A relational DBMS supports multiple index structures on a table: • Create B-tree index on Salary attribute of Emp • Create Hash index on SS# of Emp • When an application deletes a record from the Emp table (say using the hash index given the SS# of the employee to be fired), all index structures are updated. • The same concept exists with a relational storage manager (Berkeley DB). • Create primary and secondary databases and associate them with one another: • Create a B+-tree primary database on Salary • Create a Hash secondary database on SS# • Associate the two indexes together.

Homework 2 • When we verify that all of our records have been stored correctly, is it sufficient to just count the number of retrieved records, or do we have to keep our own copy of all the records and verify each row in the database against our "shadow" copy? • Either approach is acceptable. • Hint 1: Start by defining very small amount of memory for your main memory database (say 10 MB) to minimize time required to debug your program. Once your program is stable, scale to a large amount of memory. • Hint 2: Do not be surprised if Berkeley DB stores 12 MB of data into 10 MB – read the documentation carefully!

Google’s Bigtable Shahram Ghandeharizadeh Computer Science Department University of Southern California

Overall Architecture • Shared-nothing architecture consisting of thousands of nodes! • A node is an off-the-shelf, commodity PC. Yahoo’s Pig Latin Google’s Map/Reduce Framework Google’s Bigtable Data Model Google File System …….

Bigtable • A data model (a schema). • A sparse, distributed persistent multi-dimensional sorted map. • Data is partitioned across the nodes seamlessly. • The map is indexed by a row key, column key, and a timestamp. • Output value in the map is an un-interpreted array of bytes. • (row: byte[ ], column: byte[ ], time: int64)  byte[ ]

Rows • A row key is an arbitrary string. • Typically 10-100 bytes in size, up to 64 KB. • Every read or write of data under a single row is atomic. • Data is maintained in lexicographic order by row key. • The row range for a table is dynamically partitioned. • Each partition (row range) is named a tablet. • Unit of distribution and load-balancing. • Objective: make read operations single-sited! • E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.

Column Families • Column keys are grouped into sets called column families. • A column family must be created before data can be stored in a column key. • Hundreds of static column families. • Syntax is family:key, e.g., Language:English, Language:German, etc.

Timestamps • 64 bit integers • Assigned by: • Bigtable: real-time in microseconds, • Client application: when unique timestamps are a necessity. • Items in a cell are stored in decreasing timestamp order. • Application specifies how many versions (n) of data items are maintained in a cell. • Bigtable garbage collects obsolete versions.

Bigtable • Used in different applications supported by Google.

Application 1: Google Analytics • Enables webmasters to analyze traffic pattern at their web sites. Statistics such as: • Number of unique visitors per day and the page views per URL per day, • Percentage of users that made a purchase given that they earlier viewed a specific page. • How? • A small JavaScript program that the webmaster embeds in their web pages. • Every time the page is visited, the program is executed. • Program records the following information about each request: • User identifier • The page being fetched

Application 1: Google Analytics (Cont…) • Two of the Bigtables • Raw click table (~ 200 TB) • A row for each end-user session. • Row name include website’s name and the time at which the session was created. • Clustering of sessions that visit the same web site. And a sorted chronological order. • Compression factor of 6-7. • Summary table (~ 20 TB) • Stores predefined summaries for each web site. • Generated from the raw click table by periodically scheduled MapReduce jobs. • Each MapReduce job extracts recent session data from the raw click table. • Row name includes website’s name and the column family is the aggregate summaries. • Compression factor is 2-3. Single-sited

Application 2: Google Earth & Maps • Functionality: Pan, view, and annotate satellite imagery at different resolution levels. • One Bigtable stores raw imagery (~ 70 TB): • Row name is a geographic segments. Names are chosen to ensure adjacent geographic segments are clustered together. • Column family maintains sources of data for each segment. • There are different sets of tables for serving client data, e.g., index table. Single-sited

Application 3: Personalized Search • Records user queries and clicks across Google properties. • Users browse their search histories and request for personalized search results based on their historical usage patterns. • One Bigtable: • Row name is userid • A column family is reserved for each action type, e.g., web queries, clicks. • User profiles are generated using MapReduce. • These profiles personalize live search results. • Replicated geographically to reduce latency and increase availability. Single-sited

Bigtable API • Implements interfaces to • create and delete tables and column families, • modify cluster, table, and column family metadata such as access control rights, • Write or delete values in Bigtable, • Look up values from individual rows, • Iterate over a subset of the data in a table, • Atomic R-M-W sequences on data stored in a single row key (No support for Xacts across multiple rows).

Function Shipping • Similar to Gamma, Bigtable is based on function shipping. Yay! Very smart!

Assumptions • Uses GFS to store log and data files. • Bigtable processes share the same machines with processes from other applications. • A shared cluster of commodity PCs. • A cluster management system: • Schedules jobs, • Manages resources on shared machines, • Monitors PC status and handles failures.

Building Blocks • Google File System • High availability. • SSTable • A key/value database. • Chubby • Name space.

SSTable • A database similar to a BDB database: • Stores and retrieves key/data pairs. • Key and data are arbitrary byte arrays. • Cursors to iterate key/value pairs given a selection predicate (exact and range). • Configurable to use either persistent store (disk) or main-memory based. • A SSTable is stored in GFS.

Bigtable: Hybrid Range Partitioning [VLDB’90] • To minimize the impact of load imbalance, construct more (HN) ranges than (N) nodes, e.g., 10 ranges for a 5 node system; H = 2. • H is higher in practice; 10 in the experimental section of the paper. • A range is named a tablet. A tablet is represented as: • A set of SSTable files. • A set of redo points which are pointers into any commit logs that main contain data for the tablet. 0-10 51-60 11-20 61-70 31-40 81-90 41-50 91-100 21-30 71-80

Chubby • A persistent and distributed lock service. • Consists of 5 active replicas, one replica is the master and serves requests. • Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. • Implements a nameservice that consists of directories and files.

Software Infrastructure • A Bigtable library linked to every client. • Many tablet servers. • Tablet servers are added and removed dynamically. • Ten to a thousand tablets assigned to a tablet server. • Each tablet is typically 100-200 MB in size. • One master server responsible for: • Assigning tablets to tablet servers, • Detecting the addition and deletion of tablet servers, • Balancing tablet-server load, • Garbage collection of files in GFS. • Client communicates directly with tablet server for reads/writes.

Location of Tablets (Ranges) • A 3-level hierarchy: • 1st Level: A file stored in chubby contains location of the root tablet, i.e., a directory of ranges (tablets) and associated meta-data. • The root tablet never splits. • 2nd Level: Each meta-data tablet contains the location of a set of user tablets. • 3rd Level: A set of SSTable identifiers for each tablet. • Analysis: • Each meta-data row stores ~ 1KB of data, • With 128 MB tablets, the three level store addresses 234 tablets (261 bytes in 128 MB tablets). • Approaches a Zetabyte (million Petabytes).

Client/Master • Client caches tablet locations.

Bigtable and Chubby • Bigtable uses Chubby to: • Ensure there is at most one active master at a time, • Store the bootstrap location of Bigtable data (Root tablet), • Discover tablet servers and finalize tablet server deaths, • Store Bigtable schema information (column family information), • Store access control list. • If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

Placement of Tablets • A tablet is assigned to one tablet server at a time. • Master maintains: • The set of live tablet servers, • Current assignment of tablets to tablet servers (including the unassigned ones) • Chubby maintains tablet servers: • A tablet server creates and acquires an eXclusive lock on a uniquely named file in a specific chubby directory (named server directory), • Master monitors server directory to discover tablet server, • A tablet server stops processing requests if it loses its X lock (network partitioning). • Tablet server will try to obtain an X lock on its uniqely named file as long as it exists. • If the uniquely named file of a tablet server no longer exists then the tablet server kills itself. Goes back to a free pool to be assigned tablets by the master.

Placement of Tablets • Master detects when a tablet server is in the free pool. • How? Master periodically probes each tablet server for the status of its lock.

Master • Should the Master die, a new Master is initiated. The master executes the following steps:

Client Write & Read Operations • Write operation arrives at a tablet server: • Server ensures the client has sufficient privileges for the write operation (Chubby), • A log record is generated to the commit log file, • Once the write commits, its contents are inserted into the memtable. • Read operation arrives at a tablet server: • Server ensures client has sufficient privileges for the read operation (Chubby), • Read is performed on a merged view of (a) the SSTables that constitute the tablet, and (b) the memtable.

Write Operations • As writes execute, size of memtable increases. • Once memtable reaches a threshold: • Memtable is frozen, • A new memtable is created, • Frozen metable is converted to an SSTable and written to GFS. • This minor compaction minimizes memory usage of tablet server, and reduces recovery time in the presence of crashes (checkpoints). • Merging compaction (in the background) reads a few SSTables and memtable to produce one SSTable. (Input SSTables and memtable are discareded.) • Major compaction rewrites all SSTables into exactly one SSTable (containing no deletion entries).

System Performance • Experiments involving random reads (from GFS and main memory) and writes, sequential reads and writes, and scans. • Scan: A single RPC fetches a large sequence of values from the tablet server.

Random Reads

Sequential Reads • A read request is for 1000 bytes. Sequential reads perform better because a tablet server caches the 64 KB SSTable block (from GFS) and uses it to serve the next 64 read requests.

Random Reads from Memory Random reads from memory avoid the overhead of fetching a 64 KB block from GFS. Data is mapped onto the memory of the tablet server

Writes • Tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes to GFS efficiently.

Scale-up • As the number of tablet servers is increased by a factor of 500: • Performance of random reads from memory increases by a factor of 300. • Performance of scans increases by a factor of 260. • Why?

1 2 3 R1 R2 R3 R1 R3 R2 R2 R1 R3 6 Ideal cases R2 R3 R1 R3 R1 R2 R3 R2 R1 {R1, R3} R2 {R1, R3} R2 {R1, R3} R2 27 ways to assign 3 requests to the 3 nodes! R2 {R1, R3} R2 {R1, R3} R2 {R1, R3} {R2, R3} R1 {R2, R3} R1 {R2, R3} R1 R1 {R2, R3} R1 {R2, R3} R1 {R2, R3} {R2, R1} R3 {R2, R1} R3 {R2, R1} R3 R3 {R2, R1} R3 {R2, R1} R3 {R2, R1} {R1, R2, R3} {R1, R2, R3} {R1, R2, R3}

Brain Teaser • Given N servers and M requests, • compute the probability of: • M/N requests per node. • Number of ways M requests may map onto N servers and the probability of each scenario. • Reward for correct answer:

Data Shipping? • Data shipping will saturate resources. • Do not be fooled by this discussion because Bigtable has “function shipping” (not reported in the evaluation section).

Lessons • Many types of errors in a real system. • Delay adding new features until it is clear how the new feature will be used. • Very important to have eyes that can see:

Conclusion • From the very first lecture of this semester:

Homework 2

Homework 2

Presentation Transcript

Homework #2

Homework 2

Homework # 2

Homework 2

Homework 2

Homework 2.

Homework 2

Homework # 2

Homework #2

homework #2

Homework 2

Homework #2

Homework 2

Homework #2

Homework #2

Homework 2

Homework #2

Homework #2